MRS集群中使用Apache Mahout
本文将介绍如何在MRS集群中安装、部署、使用Mahout(0.13.1),包括使用MapReduce和Spark两种计算引擎。
下载、编译Mahout
从Github Mahout Release页面,下载0.13.1-rc1.zip的包。
下载页面 --> https://github.com/apache/mahout/releases/tag/mahout-0.13.1-rc1
下载地址 --> https://github.com/apache/mahout/archive/mahout-0.13.1-rc1.zip
下载下来后,找一台机器解压,并编译mahout。
mvn install -DskipTests
将编译好的mahout目录,重新打成zip包,假设新的zip包命名为mahout-mahout-0.13.1-rc1-compiled.zip。
将该zip包上传到MRS master节点,假定传到了root用户目录下。
至此,第一步完成。
部署
准备工作
source mrs环境变量
source /opt/client/bigdata_env
解压
unzip mahout-mahout-0.13.1-rc1-compiled.zip
export MAHOUT_HOME环境变量
export MAHOUT_HOME=/root/mahout-mahout-0.13.1-rc1
赋予执行权限
cd mahout-mahout-0.13.1-rc1 && dos2unix bin/* && chmod a+x bin/*
修改脚本
先备份一下
cp bin/compute-classpath.sh bin/compute-classpath.sh.bak
将compute-classpath.sh的第109至139行删除,删除的代码见下方
num_jars=0 for f in "${assembly_folder}"/spark-assembly*hadoop*.jar; do if [[ ! -e "$f" ]]; then echo "Failed to find Spark assembly in $assembly_folder" 1>&2 echo "You need to build Spark before running this program." 1>&2 exit 1 fi ASSEMBLY_JAR="$f" num_jars=$((num_jars+1)) done if [ "$num_jars" -gt "1" ]; then echo "Found multiple Spark assembly jars in $assembly_folder:" 1>&2 ls "${assembly_folder}"/spark-assembly*hadoop*.jar 1>&2 echo "Please remove all but one jar." 1>&2 exit 1 fi # Only able to make this check if 'jar' command is available if [ $(command -v "$JAR_CMD") ] ; then # Verify that versions of java used to build the jars and run Spark are compatible jar_error_check=$("$JAR_CMD" -tf "$ASSEMBLY_JAR" nonexistent/class/path 2>&1) if [[ "$jar_error_check" =~ "invalid CEN header" ]]; then echo "Loading Spark jar with '$JAR_CMD' failed. " 1>&2 echo "This is likely because Spark was compiled with Java 7 and run " 1>&2 echo "with Java 6. (see SPARK-1703). Please use Java 7 to run Spark " 1>&2 echo "or build Spark with Java 6." 1>&2 exit 1 fi fi
在删除的位置,添加一行:
ASSEMBLY_JAR=`find ${SPARK_HOME}/jars/ |tr '\n' ':'` appendToClasspath ${SPARK_HOME}/examples/jars/scopt_2.11*.jar
加完后,代码的上下文应该是这样,加入的内容为第109、110行:
102 # Use spark-assembly jar from either RELEASE or assembly directory 103 if [ -f "$FWDIR/RELEASE" ]; then 104 assembly_folder="$FWDIR"/lib 105 else 106 assembly_folder="$ASSEMBLY_DIR" 107 fi 108 109 ASSEMBLY_JAR=`find ${SPARK_HOME}/jars/ |tr '\n' ':'` 110 appendToClasspath ${SPARK_HOME}/examples/jars/scopt_2.11*.jar 111 appendToClasspath "$ASSEMBLY_JAR"
接下来,我们可以运行下面的一些命令试一下。
mahout mapreduce
bin/mahout
spark-itemsimilarity
bin/mahout spark-itemsimilarity
spark-shell 提交到yarn
bin/mahout spark-shell --master yarn
从ResourceManager的Web页面也能看到对应的应用了。
运行样例
mapreduce样例
我们用推荐系统经典的数据集MovieLens data为例
# 获取数据 cd ~ wget http://files.grouplens.org/datasets/movielens/ml-1m.zip unzip ml-1m.zip # 简单的处理 cat ml-1m/ratings.dat | sed 's/::/,/g' | cut -f1-3 -d, > ratings.csv # 上传到hdfs hadoop fs -put ratings.csv /ratings.csv # 执行命令运行 ${MAHOUT_HOME}/bin/mahout recommenditembased --input /ratings.csv --output recommendations --numRecommendations 10 --outputPathForSimilarityMatrix similarity-matrix --similarityClassname SIMILARITY_COSINE
查看结果
hadoop fs -ls recommendations hadoop fs -cat recommendations/part-r-00000 | head
spark样例
执行命令
bin/mahout spark-itemsimilarity -i /ratings.csv -o /tmp/mahout-spark-output --master yarn
查看结果
对应在yarn上的应用
参考
https://mahout.apache.org/docs/latest/tutorials/intro-cooccurrence-spark/
- 点赞
- 收藏
- 关注作者
评论(0)