- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

Spark-Redis工作篇：执行海量数据插入、查询作业时碰到的问题

我爱次火锅锅发表于 2020/11/27 11:46:17 2020/11/27

【摘要】前一篇博客介绍了Spark-Redis入门篇：包括一些基础概念和重要的类、方法。Spark-Redis是用Spark在redis上面进行读写数据操作的包。其支持redis的所有数据结构。由于redis是基于内存的数据库，稳定性并不是很高，尤其是standalone模式下的redis。于是工作中在使用Spark-Redis时也会碰到很多问题，尤其是执行海量数据插入与查询的场景中。

前一篇博客介绍了Spark-Redis入门篇：包括一些基础概念和重要的类、方法。Spark-Redis是用Spark在redis上面进行读写数据操作的包。其支持redis的所有数据结构：String（字符串）, Hash（哈希）, List（列表）, Set and Sorted Set（集合和有序集合）。此模块既可以用于Redis的standalone模式，也可用于集群情况。由于redis是基于内存的数据库，稳定性并不是很高，尤其是standalone模式下的redis。于是工作中在使用Spark-Redis时也会碰到很多问题，尤其是执行海量数据插入与查询的场景中。

海量数据查询

Redis是基于内存读取的数据库，相比其它的数据库，Redis的读取速度会更快。但是当我们要查询上千万条的海量数据时，即使是Redis也需要花费较长时间。这时候如果我们想要终止select作业的执行，我们希望的是所有的running task立即killed。

Spark是有作业调度机制的。SparkContext是Spark的入口，相当于应用程序的main函数。SparkContext中的cancelJobGroup函数可以取消正在运行的job。

/**
  * Cancel active jobs for the specified group. See `org.apache.spark.SparkContext.setJobGroup`
  * for more information.
  */
 def cancelJobGroup(groupId: String) {
   assertNotStopped()
   dagScheduler.cancelJobGroup(groupId)
 }

按理说取消job之后，job下的所有task应该也终止。而且当我们取消select作业时，executor会throw TaskKilledException，而这个时候负责task作业的TaskContext在捕获到该异常之后，会执行killTaskIfInterrupted。

 // If this task has been killed before we deserialized it, let's quit now. Otherwise,
 // continue executing the task.
 val killReason = reasonIfKilled
 if (killReason.isDefined) {
   // Throw an exception rather than returning, because returning within a try{} block
   // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl
   // exception will be caught by the catch block, leading to an incorrect ExceptionFailure
   // for the task.
   throw new TaskKilledException(killReason.get)
 }

/**
 * If the task is interrupted, throws TaskKilledException with the reason for the interrupt.
 */
 private[spark] def killTaskIfInterrupted(): Unit

但是Spark-Redis中还是会出现终止作业但是task仍然running。因为task的计算逻辑最终是在RedisRDD中实现的，RedisRDD的compute会从Jedis中取获取keys。所以说要解决这个问题，应该在RedisRDD中取消正在running的task。这里有两种方法：

方法一：参考Spark的JDBCRDD，定义close()，结合InterruptibleIterator。

def close() {
   if (closed) return
   try {
     if (null != rs) {
       rs.close()
     }
   } catch {
     case e: Exception => logWarning("Exception closing resultset", e)
   }
   try {
     if (null != stmt) {
       stmt.close()
     }
   } catch {
     case e: Exception => logWarning("Exception closing statement", e)
   }
   try {
     if (null != conn) {
       if (!conn.isClosed && !conn.getAutoCommit) {
         try {
           conn.commit()
         } catch {
           case NonFatal(e) => logWarning("Exception committing transaction", e)
         }
       }
       conn.close()
     }
     logInfo("closed connection")
   } catch {
     case e: Exception => logWarning("Exception closing connection", e)
   }
   closed = true
 }
 
 context.addTaskCompletionListener{ context => close() } 
CompletionIterator[InternalRow, Iterator[InternalRow]](
   new InterruptibleIterator(context, rowsIterator), close())

方法二：异步线程执行compute，主线程中判断task isInterrupted

try{
   val thread = new Thread() {
     override def run(): Unit = {
       try {
          keys = doCall
       } catch {
         case e =>
           logWarning(s"execute http require failed.")
       }
       isRequestFinished = true
     }
   }
 
   // control the http request for quite if user interrupt the job
   thread.start()
   while (!context.isInterrupted() && !isRequestFinished) {
     Thread.sleep(GetKeysWaitInterval)
   }
   if (context.isInterrupted() && !isRequestFinished) {
     logInfo(s"try to kill task ${context.getKillReason()}")
     context.killTaskIfInterrupted()
   }
   thread.join()
   CompletionIterator[T, Iterator[T]](
     new InterruptibleIterator(context, keys), close)

我们可以异步线程来执行compute，然后在另外的线程中判断是否task isInterrupted，如果是的话就执行TaskContext的killTaskIfInterrupted。防止killTaskIfInterrupted无法杀掉task，再结合InterruptibleIterator：一种迭代器，以提供任务终止功能。通过检查[TaskContext]中的中断标志来工作。

海量数据插入

我们都已经redis的数据是保存在内存中的。当然Redis也支持持久化，可以将数据备份到硬盘中。当插入海量数据时，如果Redis的内存不够的话，很显然会丢失部分数据。这里让使用者困惑的点在于：当Redis已使用内存大于最大可用内存时，Redis会报错：command not allowed when used memory > ‘maxmemory’。但是当insert job的数据大于Redis的可用内存时，部分数据丢失了，并且还没有任何报错。

因为不管是Jedis客户端还是Redis服务器，当插入数据时内存不够，不会插入成功，但也不会返回任何response。所以目前能想到的解决办法就是当insert数据丢失时，扩大Redis内存。

总结

Spark-Redis是一个应用还不是很广泛的开源项目，不像Spark JDBC那样已经商业化。所以Spark-Redis还是存在很多问题。相信随着commiter的努力，Spark-Redis也会越来越强大。

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

Spark-Redis工作篇：执行海量数据插入、查询作业时碰到的问题

海量数据查询

方法一：参考Spark的JDBCRDD，定义close()，结合InterruptibleIterator。

方法二：异步线程执行compute，主线程中判断task isInterrupted

海量数据插入

总结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

Spark-Redis工作篇：执行海量数据插入、查询作业时碰到的问题

海量数据查询

方法一：参考Spark的JDBCRDD，定义close()，结合InterruptibleIterator。

方法二：异步线程执行compute，主线程中判断task isInterrupted

海量数据插入

总结

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品