SparkNLP简单样例(MRS-online)
前期准备:
-
创建MRS2.1.0非安全集群,并为各个节点绑定弹性IP.
-
样例代码
在提交任务的节点(比如master1),代码路径为/opt/Bigdata/program
fly.py代码如下:
from pyspark import SparkContext
from pyspark import SparkConf
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
import numpy as np
conf=SparkConf().setAppName(“Bird”)
sc=SparkContext.getOrCreate(conf)
sparknlp.start()
explain_document_pipeline = PretrainedPipeline(“explain_document_ml”)
annotations = explain_document_pipeline.annotate(“We are very happy about SparkNLP”)
print(annotations)
a=np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
da=sc.parallelize(a)
rep=da.repartition(3)
reduce=rep.reduce(lambda a,b : a+b)
print(reduce)
sc.stop() -
上传依赖包,路径为/root/data/(版本信息可以根据环境选择)
https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.0.3.jar
23 为Spark版本
3.0.3为Spark NLP的版本
方案:
- 安装Anaconda3、python3.7以及程序依赖的第三方库
1.1安装Anaconda3
cd /opt/Bigdata
#通过wget下载anaconda3安装包
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2019.07-Linux-x86_64.sh
#利用bash命令安装到anaconda文件夹下:.
bash Anaconda3-2019.07-Linux-x86_64.sh -p anaconda/ -u
根据提示,一直yes 或者enter键,包括添加到环境
如果运行conda --version 显示没有此指令的话,需要把anaconda添加到变量中去,运行两步(也可以根据自己的情况添加环境变量):
echo ‘export PATH="/opt/Bigdata/anaconda/bin:$PATH"’ >> ~/.bashrc
source ~/.bashrc
1.2创建运行环境,包括安装python3.7和第三方依赖库
conda create -n sparknlp python=3.7
conda activate sparknlp
pip install spark-nlp==3.0.3
conda install numpy
1.3 打包运行环境和上传运行环境
chmod -R 777 /opt/Bigdata/anaconda
cd /opt/Bigdata/anaconda/envs/
zip -r sparknlp.zip ./sparknlp/
source /opt/client/bigdata_env
hadoop fs -mkdir /sparknlp
hadoop fs -put ./sparknlp.zip /sparknlp - 运行fly.py进行测试
2.1 使用cluster模式运行
/opt/client/Spark/spark/bin/spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./sparknlp.zip/sparknlp/bin/python --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./sparknlp.zip/sparknlp/bin/python --master yarn-cluster --jars /opt/Bigdata/program/spark-nlp-spark23-assembly-3.0.3.jar --archives hdfs:///sparknlp/sparknlp.zip /opt/Bigdata/program/fly.py
或者
/opt/client/Spark/spark/bin/spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./sparknlp.zip/sparknlp/bin/python --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./sparknlp.zip/sparknlp/bin/python --master yarn-cluster --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.3 --archives hdfs:///sparknlp/sparknlp.zip /opt/Bigdata/program/fly.py
- 点赞
- 收藏
- 关注作者
评论(0)