当前位置:   article > 正文

windows hosts路径_windows下Pyspark开发环境搭建

host开发环境

Pyspark开发环境搭建

目录

Pyspark开发环境搭建 1

1. 软件准备 2

2. 安装JDK1.8 2

3. 安装Anaconda3 3

 Windows下安装 3

 Linux下安装(配置window本地环境不需要执行该步骤) 5

4. 安装Spark2.3.0 5

a) 解压spark-2.3.0-bin-hadoop2.6.tgz 5

b) 配置环境变量 5

c) 将winutils.exe放到%SPARK_HOME%bin目录下 8

5. 连接远程spark集群相关文件配置 9

a) 配置本地window的hosts文件 9

b) 将集群相关配置文件同步window本地 9

c) 测试远程连接spark 10

d) Windows下Spyder远程连接pyspark配置 11

e) Windows下pyspark连接Hbase操作测试 12

软件准备

血的教训:永远不要在软件的安装路径中留有任何的空格

  • Java: 1.8.0

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

备注:JDK是因为hadoop平台要依赖

  • Anaconda3/Python : 3.6

Windows:Anaconda3-5.2.0-Windows-x86_64.exe

https://repo.anaconda.com/archive/Anaconda3-5.2.0-Windows-x86_64.exe

Linux: Anaconda3-5.2.0-Linux-x86_64.sh

https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh

  • Spark : 2.3.0

https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.6.tgz

依赖linux虚拟器winutils.exe

https://github.com/sdravida/hadoop2.6_Win_x64/blob/master/bin/winutils.exe

安装JDK1.8

下载的版本为jdk-8u211-windows-x64.exe,直接双击安装即可。

下面有几个环境变量需要设置

JAVA_HOME C:selfsoftwarejava1.8

v2-d81f998f9a1854d40a9b918750cc2555_b.jpg

PATH %JAVA_HOME%bin

v2-19fa9d405419491739fb525c3b2f5c4f_b.jpg

CLASSPATH .;%JAVA_HOME%libdt.jar;%JAVA_HOME%libtools.jar;%JAVA_HOME%bin;

v2-11ceca1325f6aaa0ca0cd4515c9a49c8_b.jpg

安装Anaconda3

Windows下安装

直接双击,选择环境变量自动增加即可。

v2-161ad690223a51f2b0ded91af8f8ec8e_b.jpg

v2-372f78776d3f8a302cc9fd5d842b7136_b.jpg

v2-66b8ca0709607df3b64a7724309b5966_b.jpg

Linux下安装(配置window本地环境不需要执行该步骤)

  1. 先选择任务一个节点上安装(这里选择在devnode1)

安装完成之后确认安装目录位置;

如我的是在 /data/opt/anaconda3

  1. 将devnode1安装目录下文件分发到其它节点上

for a in {2..5};do scp -r /data/opt/anaconda3 n$a:/data/opt/ ;done

  1. 修改系统默认python版本

由于Linux系统python命令进入后,版本是2.7.3的,改为默认为3.6版本的

在所有节点上执行:

rm -f /usr/bin/python

ln -s /data/opt/anaconda3/bin/python3 /usr/bin/python3

ln -s /usr/bin/python3 /usr/bin/python

该步操作是为了解决如下问题:

Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

安装Spark2.3.0

解压spark-2.3.0-bin-hadoop2.6.tgz

本地放在 C:selfsoftwarespark-2.3.0-bin-hadoop2.6

v2-a20ffa91acd82bcdeb9e7356f727a0fb_b.jpg

配置环境变量

SPARK_HOME C:selfsoftwarespark-2.3.0-bin-hadoop2.6

v2-a408bc4c1671355562842dd09da014c3_b.jpg

HADOOP_HOME C:selfsoftwarespark-2.3.0-bin-hadoop2.6

v2-817abdd94138ea1ee3c0bcf8400d98f8_b.jpg

path %SPARK_HOME%bin

v2-1e1aa7c5488ae17e4cd27529d5191288_b.jpg

PYSPARK_DRIVER_PYTHON ipython

v2-5eb167ce35d7821b477eac2ded079b01_b.jpg

PYSPARK_DRIVER_PYTHON设置为ipython后,pyspark交互模式变为:

v2-53200093d8dd4d5351116a094a69980e_b.jpg

测试1:spark环境变量是否生效与本地spark是否安装成功

v2-7bd8b7de9e1f8f0e0cacb4d5dda34d2e_b.jpg

该步骤可能会报错,按下个步骤解决即可。

C:Users80681>pyspark

2019-05-15 16:07:34 ERROR Shell:396 - Failed to locate the winutils binary in the hadoop binary path

java.io.IOException: Could not locate executable C:selfsoftwarespark-2.3.0-bin-hadoop2.6binwinutils.exe in the Hadoop binaries.

at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)

at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)

at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)

at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)

at org.apache.hadoop.security.Groups.<init>(Groups.java:93)

at org.apache.hadoop.security.Groups.<init>(Groups.java:73)

at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)

at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)

at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)

at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)

at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)

at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)

at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2464)

at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2464)

at scala.Option.getOrElse(Option.scala:121)

at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2464)

at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:222)

at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.scala:393)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:393)

at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:401)

at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:401)

at scala.Option.map(Option.scala:146)

at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:400)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

测试2:spark本地环境是否满足依赖

pyspark命令行下:

l = [('Alice', 1)]

print(l)

s = spark.createDataFrame(l)

s.collect()

v2-ed19c40376e9087f450ff2b6ea33a223_b.jpg

将winutils.exe放到%SPARK_HOME%bin目录下

winutils.exe相当于Linux虚拟器。

v2-f8e06c59398924d477ee64915e0b3107_b.jpg

测试winutils.exe 版本是否与操作系统兼容

cd %SPARK_HOME%bin

winutils.exe ls

v2-1ada716f97065cff5b4bf69b894c7a6a_b.jpg

可能会报错:

C:selfsoftwarespark-2.3.0-bin-hadoop2.6bin>winutils.exe ls

该版本的 C:selfsoftwarespark-2.3.0-bin-hadoop2.6binwinutils.exe 与你运行的 Windows 版本不兼容。请查看计算机的系统信息,然后联系软件发布者。

v2-1299a8322fa4f0a6deaf4d741277f65f_b.jpg

可以另外从网上找64位版本,hadoop2.7.1的bin目录下的winutils.exe即可。

http://dx1.pc0359.cn/soft/w/winutilsmaster.rar

v2-ea0db084a89a4e1ee1c420152d9bb9a2_b.jpg

连接远程spark集群相关文件配置

配置本地window的hosts文件

将集群各节点IP对应别名,配置在windows下的hosts文件中。

查看linux集群各节点信息:

v2-f945bb4c037f07ba05617afad33923f5_b.jpg

配置到windows下的hosts文件:

我的本地host文件目录 : C:WindowsSystem32Driversetchosts

v2-40363dd0faeca6c00ec6fbf9f93e3f3b_b.jpg

对每个节点做如下同样的测试,保证各节点ping通。

v2-726b1ae89f302b77bd16541f1b938f07_b.jpg

将集群相关配置文件同步window本地

由于要远程连接Linux集群,需要远程服务器上以下四个配置文件同步到%SPARK_HOME%conf目录下;

core-site.xml --由于hdfs是基本框架,两个都个同步

hdfs-site.xml

yarn-site.xml --作远程操作要使用

hive-site.xml --有hive操作则要同步

设置环境变量

YARN_CONF_DIR %SPARK_HOME%conf

v2-77ab61dcb38c8ff4845d5be9091dc7c0_b.jpg

测试远程连接spark

命令 pyspark --master yarn --deploy-mode client --name 'test'

如下,则说明成功了。

v2-faf1d0cc08dc742c967e7a2af658a9df_b.jpg

v2-3f5c1a53d9b4832cbb1bd02cd5fd13d8_b.jpg

可能会报如下错误

设置环境变量即可

YARN_CONF_DIR %SPARK_HOME%conf

v2-a9b40b692a0f272c8e50e87808b7c896_b.jpg

2019-05-15 16:56:55 WARN ScriptBasedMapping:254 - Exception running /etc/hadoop/conf.cloudera.yarn/topology.py 172.168.200.20

java.io.IOException: Cannot run program "/etc/hadoop/conf.cloudera.yarn/topology.py" (in directory "C:Users80681"): CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)

at org.apache.hadoop.util.Shell.runCommand(Shell.java:519)

at org.apache.hadoop.util.Shell.run(Shell.java:478)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)

at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)

at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)

at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)

at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)

at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)

at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)

at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:328)

at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:317)

at scala.collection.Iterator$class.foreach(Iterator.scala:893)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:317)

at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$http://DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:247)

at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:136)

at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)

at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)

at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)

at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessImpl.create(Native Method)

at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)

at java.lang.ProcessImpl.start(ProcessImpl.java:137)

at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)

... 25 more

修改 %SPARK_HOME%conf 目录下 core-site.xml 文件

将core-site.xml文件按如下注释即可。

<!--property>

<name>net.topology.script.file.name</name>

<value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>

</property-->

spark-client 一直accepted,无法提交任务,报错Failed to connect to driver at

该问题很多种情况导致,但总的来说,就是消息返回不了client;

该出现的情况属于1,4,5情况结合导致。

  1. 常见情况是客户端(即windows)防火墙没关闭
  2. --conf spark.driver.host=$your_ip_address
  3. 内存资源不够而yarn机制将任务杀掉(这种极其少见)

<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

  1. 本地window防火墙关了,但在比较严格的公司生产环境下,有严格的网关权限控制,实现windows可以单向连接传输集群,而集群不给window传或返回信号。(这种一定要找网管解决)
  2. 还有一些开发集群是虚拟机搭建的,生成虚拟系统时只指定了1个core,导致远程访问连接时,一个core被占用,没其它core给返回信息。需要指定参数即可

--conf spark.rpc.netty.dispatcher.numThreads=2

确实linux系统核数,top 拿再按1

v2-84fd97dd7916873d9e212a8e495b5772_b.jpg

Windows下Spyder远程连接pyspark配置

将 %SPARK_HOME%pythonlib 目录下 py4j-0.10.6-src.zip 与 pyspark.zip解压缩;

然后放到anaconda目录 C:Anaconda3Libsite-packages 目录下即可。

v2-5cc8fa900b31c582b96d156a683f232f_b.jpg

在spyder测试如下脚本:

#####################################e.g 1

from pyspark.sql import SparkSession

import time

print("开始启动会话..................")

ss=SparkSession.builder

.master("yarn-client")

.appName('test spyder')

.config("spark.some.config.option", "some-value")

.config("spark.dynamicAllocation.enabled", "false")

.config("hive.exec.dynamic.partition.mode", "nonstrict")

.config("spark.executor.instances", "3")

.enableHiveSupport()

.getOrCreate()

print("完成启动会话..................")

print("开始parallelize启动..................")

sc=ss.sparkContext

data=sc.parallelize(range(800),7)

print(data.count())

print("结束parallelize..................")

ss.stop()

测试结果如下,说明成功了

v2-1b9f3ea4153aca121be708e5d33b8b56_b.jpg

Windows下pyspark连接Hbase操作测试

连接Hbase需要集群相关的配置文件与jar包:

1.将集群上的hbase-site.xml配置文件同步到本地windows的 %SPARK_HOME%conf 目录下

v2-0726746bd1a8ff71ecc3cb8caa82097d_b.jpg

2.将连接hbase的集群相关jar同步到 %SPARK_HOME%jars目录下

将集群上CDH的安装目录下对应hbase 的lib库目录下的jar饱全部同步下来,如下是我集群的目录:

/data/opt/cloudera-manager/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hbase/lib/

/data/opt/cloudera-manager/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hbase

不清楚在哪个安装目录,用以下命令在确定:

find /data/ -name hbase*.jar

v2-c0e0fbd21a72a8b0590298ab06640aba_b.jpg

另外要再将metrics-core-2.2.0.jar 同步下来

v2-2e9fab6d6ad4e31f689fbb15466f4a2c_b.jpg

v2-7fcaad55a86d43c7e157edff753d79f1_b.jpg

将SHC编译好的shc-core-spark2.3.0-hbase1.2.0.jar也放在%SPARK_HOME%jars目录下

并上传到集群上spark2安装目录下,这样就不需要在启动部署spark程序时指定jar包

/data/opt/cloudera-manager/cloudera/parcels/SPARK2/lib/spark2/jars/

v2-bf996c3556dc5ea7c41058bf2b95226d_b.jpg

3.测试spyder连接测试集群是否成功

测试代码

#####################################e.g 2

from pyspark.sql import SparkSession

import time

from pyspark import SQLContext

print("开始启动会话..................")

spark=SparkSession.builder

.master("yarn-client")

.appName('test spyder')

.config("spark.some.config.option", "some-value")

.config("spark.dynamicAllocation.enabled", "false")

.config("hive.exec.dynamic.partition.mode", "nonstrict")

.config("spark.executor.instances", "3")

.enableHiveSupport()

.getOrCreate()

print("完成启动会话..................")

dep = "org.apache.spark.sql.execution.datasources.hbase"

#查询表结构

catalog = """{

"table":{"namespace":"default", "name":"student"},

"rowkey":"key",

"columns":{

"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},

"age":{"cf":"info", "col":"age", "type":"string"},

"name":{"cf":"info", "col":"name", "type":"string"}}

}

"""

sql_sc = SQLContext(spark)

#从hbage表查询数据

df = sql_sc.read.options(catalog = catalog).format(dep).load()

#将表数据注册为临时表,并展示出来

df.createOrReplaceTempView("test1")

spark.sql("select * from test1").show()

spark.stop()

结果如下,则说明配置成功。

v2-e9b4690a8d757d2acffdaf20d2fbca16_b.jpg

v2-dff7218bbb579213a76ee6d17bbc9ea6_b.jpg
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/416484
推荐阅读
相关标签
  

闽ICP备14008679号