赞
踩
学习目标:
1、什么是spark
2、为什么要学习spark
MapReduce框架局限性
Hadoop生态圈
需要一种灵活的框架可同时进行批处理、流式计算、交互式计算
spark的缺点是:吃内存,不太稳定
3、spark特点
启动pyspark
在$SPARK_HOME/sbin目录下执行
sc = spark.sparkContext
words = sc.textFile('file:///home/hadoop/tmp/word.txt') \
.flatMap(lambda line: line.split(" ")) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda a, b: a + b).collect()
输出结果:
[('python', 2), ('hadoop', 1), ('bc', 1), ('foo', 4), ('test', 2), ('bar', 2), ('quux', 2), ('abc', 2), ('ab', 1), ('you', 1), ('ac', 1), ('bec', 1), ('by', 1), ('see', 1), ('labs', 2), ('me', 1), ('welcome', 1)]
e’, 1), (‘labs’, 2), (‘me’, 1), (‘welcome’, 1)]
```
学习目标:
第一步 创建sparkContext
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
创建RDD
[hadoop@hadoop000 ~]$ pyspark Python 3.5.0 (default, Nov 13 2018, 15:43:53) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux Type "help", "copyright", "credits" or "license" for more information. 19/03/08 12:19:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Python version 3.5.0 (default, Nov 13 2018 15:43:53) SparkSession available as 'spark'. >>> sc <SparkContext master=local[*] appName=PySparkShell>
Parallelized Collections方式创建RDD
SparkContext
的 parallelize
方法并且传入已有的可迭代对象或者集合data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
>>> data = [1, 2, 3, 4, 5]
>>> distData = sc.parallelize(data)
>>> data
[1, 2, 3, 4, 5]
>>> distData
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:175
parallelize
方法创建RDD 的时候可以指定分区数量>>> distData = sc.parallelize(data,5)
>>> distData.reduce(lambda a, b: a + b)
15
通过外部数据创建RDD
>>> rdd1 = sc.textFile('file:///home/hadoop/tmp/word.txt')
>>> rdd1.collect()
['foo foo quux labs foo bar quux abc bar see you by test welcome test', 'abc labs foo me python hadoop ab ac bc bec python']
bar quux abc bar see you by test welcome test’, ‘abc labs foo me python hadoop ab ac bc bec python’]
```
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。