赞
踩
目录
Pyspark开发环境搭建 1
1. 软件准备 2
2. 安装JDK1.8 2
3. 安装Anaconda3 3
Windows下安装 3
Linux下安装(配置window本地环境不需要执行该步骤) 5
4. 安装Spark2.3.0 5
a) 解压spark-2.3.0-bin-hadoop2.6.tgz 5
b) 配置环境变量 5
c) 将winutils.exe放到%SPARK_HOME%bin目录下 8
5. 连接远程spark集群相关文件配置 9
a) 配置本地window的hosts文件 9
b) 将集群相关配置文件同步window本地 9
c) 测试远程连接spark 10
d) Windows下Spyder远程连接pyspark配置 11
e) Windows下pyspark连接Hbase操作测试 12
血的教训:永远不要在软件的安装路径中留有任何的空格
https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
备注:JDK是因为hadoop平台要依赖
Windows:Anaconda3-5.2.0-Windows-x86_64.exe
https://repo.anaconda.com/archive/Anaconda3-5.2.0-Windows-x86_64.exe
Linux: Anaconda3-5.2.0-Linux-x86_64.sh
https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh
https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.6.tgz
依赖linux虚拟器winutils.exe
https://github.com/sdravida/hadoop2.6_Win_x64/blob/master/bin/winutils.exe
下载的版本为jdk-8u211-windows-x64.exe,直接双击安装即可。
下面有几个环境变量需要设置
JAVA_HOME C:selfsoftwarejava1.8
PATH %JAVA_HOME%bin
CLASSPATH .;%JAVA_HOME%libdt.jar;%JAVA_HOME%libtools.jar;%JAVA_HOME%bin;
直接双击,选择环境变量自动增加即可。
安装完成之后确认安装目录位置;
如我的是在 /data/opt/anaconda3
for a in {2..5};do scp -r /data/opt/anaconda3 n$a:/data/opt/ ;done
由于Linux系统python命令进入后,版本是2.7.3的,改为默认为3.6版本的
在所有节点上执行:
rm -f /usr/bin/python
ln -s /data/opt/anaconda3/bin/python3 /usr/bin/python3
ln -s /usr/bin/python3 /usr/bin/python
该步操作是为了解决如下问题:
Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
本地放在 C:selfsoftwarespark-2.3.0-bin-hadoop2.6
SPARK_HOME C:selfsoftwarespark-2.3.0-bin-hadoop2.6
HADOOP_HOME C:selfsoftwarespark-2.3.0-bin-hadoop2.6
path %SPARK_HOME%bin
PYSPARK_DRIVER_PYTHON ipython
PYSPARK_DRIVER_PYTHON设置为ipython后,pyspark交互模式变为:
测试1:spark环境变量是否生效与本地spark是否安装成功
该步骤可能会报错,按下个步骤解决即可。
C:Users80681>pyspark
2019-05-15 16:07:34 ERROR Shell:396 - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable C:selfsoftwarespark-2.3.0-bin-hadoop2.6binwinutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2464)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2464)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2464)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:222)
at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.scala:393)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:393)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:401)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:401)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:400)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
测试2:spark本地环境是否满足依赖
pyspark命令行下:
l = [('Alice', 1)]
print(l)
s = spark.createDataFrame(l)
s.collect()
winutils.exe相当于Linux虚拟器。
测试winutils.exe 版本是否与操作系统兼容
cd %SPARK_HOME%bin
winutils.exe ls
可能会报错:
C:selfsoftwarespark-2.3.0-bin-hadoop2.6bin>winutils.exe ls
该版本的 C:selfsoftwarespark-2.3.0-bin-hadoop2.6binwinutils.exe 与你运行的 Windows 版本不兼容。请查看计算机的系统信息,然后联系软件发布者。
可以另外从网上找64位版本,hadoop2.7.1的bin目录下的winutils.exe即可。
http://dx1.pc0359.cn/soft/w/winutilsmaster.rar
将集群各节点IP对应别名,配置在windows下的hosts文件中。
查看linux集群各节点信息:
配置到windows下的hosts文件:
我的本地host文件目录 : C:WindowsSystem32Driversetchosts
对每个节点做如下同样的测试,保证各节点ping通。
由于要远程连接Linux集群,需要远程服务器上以下四个配置文件同步到%SPARK_HOME%conf目录下;
core-site.xml --由于hdfs是基本框架,两个都个同步
hdfs-site.xml
yarn-site.xml --作远程操作要使用
hive-site.xml --有hive操作则要同步
设置环境变量
YARN_CONF_DIR %SPARK_HOME%conf
命令 pyspark --master yarn --deploy-mode client --name 'test'
如下,则说明成功了。
可能会报如下错误
设置环境变量即可
YARN_CONF_DIR %SPARK_HOME%conf
2019-05-15 16:56:55 WARN ScriptBasedMapping:254 - Exception running /etc/hadoop/conf.cloudera.yarn/topology.py 172.168.200.20
java.io.IOException: Cannot run program "/etc/hadoop/conf.cloudera.yarn/topology.py" (in directory "C:Users80681"): CreateProcess error=2, 系统找不到指定的文件。
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:519)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:328)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:317)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:317)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$http://DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:247)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:136)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 25 more
修改 %SPARK_HOME%conf 目录下 core-site.xml 文件
将core-site.xml文件按如下注释即可。
<!--property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
</property-->
spark-client 一直accepted,无法提交任务,报错Failed to connect to driver at
该问题很多种情况导致,但总的来说,就是消息返回不了client;
该出现的情况属于1,4,5情况结合导致。
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
--conf spark.rpc.netty.dispatcher.numThreads=2
确实linux系统核数,top 拿再按1
将 %SPARK_HOME%pythonlib 目录下 py4j-0.10.6-src.zip 与 pyspark.zip解压缩;
然后放到anaconda目录 C:Anaconda3Libsite-packages 目录下即可。
在spyder测试如下脚本:
#####################################e.g 1
from pyspark.sql import SparkSession
import time
print("开始启动会话..................")
ss=SparkSession.builder
.master("yarn-client")
.appName('test spyder')
.config("spark.some.config.option", "some-value")
.config("spark.dynamicAllocation.enabled", "false")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.executor.instances", "3")
.enableHiveSupport()
.getOrCreate()
print("完成启动会话..................")
print("开始parallelize启动..................")
sc=ss.sparkContext
data=sc.parallelize(range(800),7)
print(data.count())
print("结束parallelize..................")
ss.stop()
测试结果如下,说明成功了
连接Hbase需要集群相关的配置文件与jar包:
1.将集群上的hbase-site.xml配置文件同步到本地windows的 %SPARK_HOME%conf 目录下
2.将连接hbase的集群相关jar同步到 %SPARK_HOME%jars目录下
将集群上CDH的安装目录下对应hbase 的lib库目录下的jar饱全部同步下来,如下是我集群的目录:
/data/opt/cloudera-manager/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hbase/lib/
/data/opt/cloudera-manager/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hbase
不清楚在哪个安装目录,用以下命令在确定:
find /data/ -name hbase*.jar
另外要再将metrics-core-2.2.0.jar 同步下来
将SHC编译好的shc-core-spark2.3.0-hbase1.2.0.jar也放在%SPARK_HOME%jars目录下
并上传到集群上spark2安装目录下,这样就不需要在启动部署spark程序时指定jar包
/data/opt/cloudera-manager/cloudera/parcels/SPARK2/lib/spark2/jars/
3.测试spyder连接测试集群是否成功
测试代码
#####################################e.g 2
from pyspark.sql import SparkSession
import time
from pyspark import SQLContext
print("开始启动会话..................")
spark=SparkSession.builder
.master("yarn-client")
.appName('test spyder')
.config("spark.some.config.option", "some-value")
.config("spark.dynamicAllocation.enabled", "false")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.executor.instances", "3")
.enableHiveSupport()
.getOrCreate()
print("完成启动会话..................")
dep = "org.apache.spark.sql.execution.datasources.hbase"
#查询表结构
catalog = """{
"table":{"namespace":"default", "name":"student"},
"rowkey":"key",
"columns":{
"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
"age":{"cf":"info", "col":"age", "type":"string"},
"name":{"cf":"info", "col":"name", "type":"string"}}
}
"""
sql_sc = SQLContext(spark)
#从hbage表查询数据
df = sql_sc.read.options(catalog = catalog).format(dep).load()
#将表数据注册为临时表,并展示出来
df.createOrReplaceTempView("test1")
spark.sql("select * from test1").show()
spark.stop()
结果如下,则说明配置成功。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。