赞
踩
首先,让我们从列表中创建一个RDD。
- from pyspark.sql import SparkSession
- spark = SparkSession.builder.master("local[1]") \
- .appName("SparkByExamples.com").getOrCreate()
-
- data = ["Project","Gutenberg’s","Alice’s","Adventures",
- "in","Wonderland","Project","Gutenberg’s","Adventures",
- "in","Wonderland","Project","Gutenberg’s"]
-
- rdd=spark.sparkContext.parallelize(data)
我们为每个元素添加一个值为1的新元素
- rdd2=rdd.map(lambda x: (x,1))
- for element in rdd2.collect():
- print(element)
-
- data = [('James','Smith','M',30),
- ('Anna','Rose','F',41),
- ('Robert','Williams','M',62),
- ]
-
- columns = ["firstname","lastname","gender","salary"]
- df = spark.createDataFrame(data=data, schema = columns)
- df.show()
- +---------+--------+------+------+
- |firstname|lastname|gender|salary|
- +---------+--------+------+------+
- | James| Smith| M| 30|
- | Anna| Rose| F| 41|
- | Robert|Williams| M| 62|
- +---------+--------+------+------+

-
- # 将x[0],x[1]合并,逗号为分隔符
- rdd2=df.rdd.map(lambda x:
- (x[0]+","+x[1],x[2],x[3]*2)
- )
- df2=rdd2.toDF(["name","gender","new_salary"] )
- df2.show()
- +---------------+------+----------+
- | name|gender|new_salary|
- +---------------+------+----------+
- | James,Smith| M| 60|
- | Anna,Rose| F| 82|
- |Robert,Williams| M| 124|
- +---------------+------+----------+
首先,让我们从列表中创建一个RDD。
- data = ["Project Gutenberg’s",
- "Alice’s Adventures in Wonderland",
- "Project Gutenberg’s",
- "Adventures in Wonderland",
- "Project Gutenberg’s"]
- rdd=spark.sparkContext.parallelize(data)
- for element in rdd.collect():
- print(element)
这将产生以下输出
- rdd2=rdd.flatMap(lambda x: x.split(" "))
- for element in rdd2.collect():
- print(element)
- Project
- Gutenberg’s
- Alice’s
- Adventures
- in
- Wonderland
- Project
- Gutenberg’s
- Adventures
- in
- Wonderland
- Project
- Gutenberg’s
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。