用华为云镜像源码编译Spark3.0.1

举报
凌晨五点起床的打工人 发表于 2020/11/28 18:39:00 2020/11/28
【摘要】 简单修改了 spark3.0.1 源码,然后用华为云镜像对其编译

1. 环境准备

  • git version 1.8.3.1

  • java version "1.8.0_221"

  • scala version 2.12.8

  • apache-maven-3.6.1

1.1 安装 git

 [root@hadoop001 spark]# yum install -y git

1.2 下载spark源码

 git clone https://github.com/apache/spark.git 

这里可能会出现下面的问题

问题1:

 Cloning into 'spark'...
 fatal: unable to access 'https://github.com/apache/spark.git/': Could not resolve host: github.com; Unknown error

这种情况的解决办法如下:

 [root@hadoop001 ~]# ping github.com
 PING github.com (192.30.255.113) 56(84) bytes of data.
 
 # 把 github.com 地址加入到 /etc/hosts 里面
 [root@hadoop001 sourcecode]$ vim /etc/hosts
 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
 192.168.100.100 hadoop001
 192.30.255.113 github.com

问题2:

 [root@hadoop001 ~]# git clone https://github.com/apache/spark.git
 Cloning into 'spark'...
 fatal: unable to access 'https://github.com/apache/spark.git/': Failed connect to github.com:443; Connection refused

解决办法:先把子切换到全局,然后再取消,接着取消全局代理

 [root@hadoop001 ~]# git config --global http.proxy http://127.0.0.1:1080
 [root@hadoop001 ~]# git config --global https.proxy http://127.0.0.1:1080
 [root@hadoop001 ~]# git config --global --unset http.proxy
 [root@hadoop001 ~]# git config --global --unset https.proxy

查看 spark 源码包

 [root@hadoop001 ~]# cd spark/
 [root@hadoop001 spark]# ll
 total 380
 -rw-r--r-- 1 root root 2643 Nov 28 11:24 appveyor.yml
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 assembly
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 bin
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 binder
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 build
 drwxr-xr-x 9 root root 4096 Nov 28 11:24 common
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 conf
 -rw-r--r-- 1 root root 997 Nov 28 11:24 CONTRIBUTING.md
 drwxr-xr-x 4 root root 4096 Nov 28 11:24 core
 drwxr-xr-x 5 root root 4096 Nov 28 11:24 data
 drwxr-xr-x 6 root root 4096 Nov 28 11:24 dev
 drwxr-xr-x 9 root root 12288 Nov 28 11:24 docs
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 examples
 drwxr-xr-x 12 root root 4096 Nov 28 11:24 external
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 graphx
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 hadoop-cloud
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 launcher
 -rw-r--r-- 1 root root 13453 Nov 28 11:24 LICENSE
 -rw-r--r-- 1 root root 23221 Nov 28 11:24 LICENSE-binary
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 licenses
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 licenses-binary
 drwxr-xr-x 4 root root 4096 Nov 28 11:24 mllib
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 mllib-local
 -rw-r--r-- 1 root root 2002 Nov 28 11:24 NOTICE
 -rw-r--r-- 1 root root 57677 Nov 28 11:24 NOTICE-binary
 -rw-r--r-- 1 root root 122016 Nov 28 11:24 pom.xml
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 project
 drwxr-xr-x 7 root root 4096 Nov 28 11:24 python
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 R
 -rw-r--r-- 1 root root 4488 Nov 28 11:24 README.md
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 repl
 drwxr-xr-x 5 root root 4096 Nov 28 11:24 resource-managers
 drwxr-xr-x 2 root root 4096 Nov 28 11:24 sbin
 -rw-r--r-- 1 root root 20431 Nov 28 11:24 scalastyle-config.xml
 drwxr-xr-x 6 root root 4096 Nov 28 11:24 sql
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 streaming
 drwxr-xr-x 3 root root 4096 Nov 28 11:24 tools
 

查看 spark 分支情况

 [root@hadoop001 spark]# git branch -a
 * master
  remotes/origin/HEAD -> origin/master
  remotes/origin/branch-0.5
  remotes/origin/branch-0.6
  remotes/origin/branch-0.7
  remotes/origin/branch-0.8
  remotes/origin/branch-0.9
  remotes/origin/branch-1.0
  remotes/origin/branch-1.0-jdbc
  remotes/origin/branch-1.1
  remotes/origin/branch-1.2
  remotes/origin/branch-1.3
  remotes/origin/branch-1.4
  remotes/origin/branch-1.5
  remotes/origin/branch-1.6
  remotes/origin/branch-2.0
  remotes/origin/branch-2.1
  remotes/origin/branch-2.2
  remotes/origin/branch-2.3
  remotes/origin/branch-2.4
  remotes/origin/branch-3.0
  remotes/origin/master

切换到 spark 3.0.1

 [root@hadoop001 spark]# git checkout v3.0.1
 Note: checking out 'v3.0.1'.
 
 You are in 'detached HEAD' state. You can look around, make experimental
 changes and commit them, and you can discard any commits you make in this
 state without impacting any branches by performing another checkout.
 
 If you want to create a new branch to retain commits you create, you may
 do so (now or later) by using -b with the checkout command again. Example:
 
  git checkout -b new_branch_name
 
 HEAD is now at 2b147c4... Preparing Spark release v3.0.1-rc3

2. 修改 spark 源码

简单的修改一下 spark 源码,再编译,我们想要的结果是执行 spark-shell 后会在命令行里面多出现 “Hitman who wakes up at five in the morning”

 [root@hadoop001 deploy]# vim /root/spark/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala


3. spark 源码编译

3.1 在 Maven 中添加上华为云镜像

需要在 Maven 的 settings.xml 中添加如下内容,我是用的华为云镜像,也可以用阿里云镜像,如果有梯子可以不用国内的镜像

 [root@hadoop001 ~]# vim  /usr/manven/apache-maven-3.6.1/conf/settings.xml  
 <!--华为云镜像-->
 <mirrors>
  <id>huaweicloud</id>
  <mirrorOf>*</mirrorOf>
  <url>https://mirrors.huaweicloud.com/repository/maven/</url>
 </mirrors>
 
 <!--
 <!--阿里云镜像-->
 <mirrors>
  <!-- 配置阿里云的中央镜像仓库 -->
  <mirror>
  <id>nexus-aliyun</id>
  <mirrorOf>central</mirrorOf>
  <name>Nexus aliyun</name>
  <url>http://maven.aliyun.com/nexus/content/groups/public</url>
  </mirror>
  </mirrors>
 -->
 ......
 
  <!-- 通过profile配置cloudera仓库 -->
  <profile>
  <repositories>
  <repository>
  <id>cloudera</id>
  <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  <releases>
  <enabled>true</enabled>
  </releases>
  <snapshots>
  <enabled>false</enabled>
  </snapshots>
  </repository>
  </repositories>
  </profile>
 
  </profiles>
 
  <!-- 激活profile -->
  <activeProfiles>
  <activeProfile>cloudera-profile</activeProfile>
  </activeProfiles>

3.2 在 spark 源码的 pom.xml 中添加 cdh maven 仓库

由于我的 hadoop 版本是基于 CDH 源码编译的,因此需要在 spark 源码目录下的 pom 文件添加 CDH maven 仓库

 [root@hadoop001 spark]# vim /root/spark/pom.xml
 
 <repository>
  <id>cloudera</id>
  <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
 </repository>

注意:在编译前需要检查所需要的软件是否安装成功

 [root@hadoop001 spark]# echo $HADOOP_HOME 
 /home/hadoop/app/hadoop-2.6.0-cdh5.7.0
 [root@hadoop001 spark]#
 [root@hadoop001 spark]# echo $JAVA_HOME
 /usr/java/jdk1.8.0_221
 [root@hadoop001 spark]# echo $SCALA_HOME
 /usr/scala/scala-2.12.8
 [root@hadoop001 spark]# echo $MAVEN_HOME
 /usr/maven/apache-maven-3.6.1

3.3 源码编译命令

 [root@hadoop001 spark]#/root/spark/dev/make-distribution.sh --name 2.6.0-cdh5.7.0  --tgz  -Phadoop-2.6 -Phive -Phive-thriftserv
 er -Pyarn -Pkubernetes -Dhadoop.version=2.6.0-cdh5.7.0

3.4 源码编译成功的结果

 [INFO] ------------------------------------------------------------------------
 [INFO] Reactor Summary for Spark Project Parent POM 3.0.1:
 [INFO]
 [INFO] Spark Project Parent POM ........................... SUCCESS [ 2.695 s]
 [INFO] Spark Project Tags ................................. SUCCESS [ 5.812 s]
 [INFO] Spark Project Sketch ............................... SUCCESS [ 7.493 s]
 [INFO] Spark Project Local DB ............................. SUCCESS [ 2.248 s]
 [INFO] Spark Project Networking ........................... SUCCESS [ 4.814 s]
 [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 1.749 s]
 [INFO] Spark Project Unsafe ............................... SUCCESS [ 11.155 s]
 [INFO] Spark Project Launcher ............................. SUCCESS [01:24 min]
 [INFO] Spark Project Core ................................. SUCCESS [08:23 min]
 [INFO] Spark Project ML Local Library ..................... SUCCESS [01:20 min]
 [INFO] Spark Project GraphX ............................... SUCCESS [01:00 min]
 [INFO] Spark Project Streaming ............................ SUCCESS [01:42 min]
 [INFO] Spark Project Catalyst ............................. SUCCESS [05:16 min]
 [INFO] Spark Project SQL .................................. SUCCESS [06:20 min]
 [INFO] Spark Project ML Library ........................... SUCCESS [04:25 min]
 [INFO] Spark Project Tools ................................ SUCCESS [ 31.632 s]
 [INFO] Spark Project Hive ................................. SUCCESS [04:32 min]
 [INFO] Spark Project REPL ................................. SUCCESS [ 29.968 s]
 [INFO] Spark Project Assembly ............................. SUCCESS [ 5.102 s]
 [INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 36.694 s]
 [INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [01:48 min]
 [INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [01:26 min]
 [INFO] Spark Project Examples ............................. SUCCESS [ 59.937 s]
 [INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 4.603 s]
 [INFO] Spark Avro ......................................... SUCCESS [01:01 min]
 [INFO] ------------------------------------------------------------------------
 [INFO] BUILD SUCCESS
 [INFO] ------------------------------------------------------------------------
 [INFO] Total time: 42:07 min
 [INFO] Finished at: 2020-11-28T14:31:20+08:00
 [INFO] ------------------------------------------------------------------------

4. 结果验证

4.1 解压编译完成的 spark3.0.1

 [root@hadoop001 spark]# tar -zxvf /root/spark/spark-3.0.1-bin-2.6.0-cdh5.7.0.tgz /home/hadoop/app/

4.2 执行 spark-shell 验证结果

 [root@hadoop001 bin]# ./spark-shell
 20/11/28 15:45:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
 Spark context Web UI available at http://hadoop001:4040
 Spark context available as 'sc' (master = local[*], app id = local-1606549507863).
 Spark session available as 'spark'.
 Welcome to
  ____ __
  / __/__ ___ _____/ /__
  _\ \/ _ \/ _ `/ __/ '_/
  /___/ .__/\_,_/_/ /_/\_\ version 3.0.1
  /_/
 
 Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221)
 Type in expressions to have them evaluated.
 Type :help for more information.
 ​
 scala>

很遗憾,发现不是我们想要的结果,检查编译后的版本,发现版本是 spark3.0.1 和想要的版本没有错误,那到底是哪里错误呢?

 [root@hadoop001 deploy]# git branch -vv
 * (detached from v3.0.1) 2b147c4 Preparing Spark release v3.0.1-rc3
  master 13fd272 [origin/master] Spelling r common dev mlib external project streaming resource managers python

在命令行中执行 spark-submit –version 和 spark-shell –version 查看结果

发现加上 spark-shell –version 有自己想要的结果,但是直接 spark-shell 没有想要的结果,那么唯一的解释应该是修改源码的地方有误,修改源码的位置出错了,经过一番的查找,确实是修改源码的位置有错误,如果想要我们预先假设的结果,需要修改 spark repl 模块的源码,修改地方如下所示:

 [root@hadoop001 repl]# vim /root/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
 ......

4.3 再次源码编译

由于前面已经编译完了整个spark,而且只对 repl 模块做了修改,因此我们只对 repl 这个子模块编译即可。那么如何编译 repl 子模块呢?请看 spark 官网做了详细的介绍,首先进入源码 repl 模块的 pom.xml 文件中,发现定义了 <artifactId>spark-repl_2.12</artifactId>

 [root@hadoop001 repl]# vim /root/spark/repl/pom.xml
 ......
 <parent>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-parent_2.12</artifactId>
  <version>3.0.1</version>
  <relativePath>../pom.xml</relativePath>
  </parent>
 ​
  <artifactId>spark-repl_2.12</artifactId>
  <packaging>jar</packaging>
  <name>Spark Project REPL</name>
  <url>http://spark.apache.org/</url>

Apache Spark 文档中关于 Spark 子模块的编译说明截图如下:

因此对 spark repl 子模块的编译操作步骤如下:

 [root@hadoop001 spark]# ./build/mvn -pl :spark-repl_2.12 clean install
 ......
 [INFO] --- maven-install-plugin:3.0.0-M1:install (default-install) @ spark-repl_2.12 ---
 [INFO] Installing /root/spark/repl/target/spark-repl_2.12-3.0.1.jar to /root/.m2/repository/org/apache/spark/spark-repl_2.12/3.0.1/spark-repl_2.12-3.0.1.jar
 [INFO] Installing /root/spark/repl/dependency-reduced-pom.xml to /root/.m2/repository/org/apache/spark/spark-repl_2.12/3.0.1/spark-repl_2.12-3.0.1.pom
 [INFO] Installing /root/spark/repl/target/spark-repl_2.12-3.0.1-tests.jar to /root/.m2/repository/org/apache/spark/spark-repl_2.12/3.0.1/spark-repl_2.12-3.0.1-tests.jar
 [INFO] Installing /root/spark/repl/target/spark-repl_2.12-3.0.1-sources.jar to /root/.m2/repository/org/apache/spark/spark-repl_2.12/3.0.1/spark-repl_2.12-3.0.1-sources.jar
 [INFO] Installing /root/spark/repl/target/spark-repl_2.12-3.0.1-test-sources.jar to /root/.m2/repository/org/apache/spark/spark-repl_2.12/3.0.1/spark-repl_2.12-3.0.1-test-sources.jar
 [INFO] Installing /root/spark/repl/target/spark-repl_2.12-3.0.1-javadoc.jar to /root/.m2/repository/org/apache/spark/spark-repl_2.12/3.0.1/spark-repl_2.12-3.0.1-javadoc.jar
 [INFO] ------------------------------------------------------------------------
 [INFO] BUILD SUCCESS
 [INFO] ------------------------------------------------------------------------
 [INFO] Total time: 04:31 min
 [INFO] Finished at: 2020-11-28T16:47:29+08:00
 [INFO] ------------------------------------------------------------------------

替换掉原来的 spark-repl jar

 [root@hadoop001 jars]# mv /home/hadoop/app/spark-3.0.1-bin-2.6.0-cdh5.7.0/jars/spark-repl_2.12-3.0.1.jar  spark-repl_2.12-3.0.1.jar_bak
 [root@hadoop001 target]# cp /root/spark/repl/target/spark-repl_2.12-3.0.1.jar /home/hadoop/app/spark-3.0.1-bin-2.6.0-cdh5.7.0/jars/
 ​

4.4 再次结果验证

发现是我们预先假设的结果

5. 总结

在源码修改的过程中一定要注意,别修改错了,同时源码编译也是一个基本功,需要非常熟练的掌握,可能没有梯子的会在源码编译中遇到不少的错误,大多数错误是由于网络造成的。

参考文献

参考官网

http://spark.apache.org/docs/latest/building-spark.html

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。