当前位置:   article > 正文

spark常用命令 Spark SQL – map() vs mapPartitions() | flatMap()_sql flatmap

sql flatmap
  • 注意1: DataFrame没有可与DataFrame一起使用的map()转换,因此您需要先将DataFrame转换为RDD。
  • 注意2:如果您有大量初始化,请使用PySpark mapPartitions()转换而不是map(),就像mapPartitions()一样,大量初始化仅对每个分区执行一次,而不对每个记录执行一次。

map()例子1

首先,让我们从列表中创建一个RDD。

  1. from pyspark.sql import SparkSession
  2. spark = SparkSession.builder.master("local[1]") \
  3. .appName("SparkByExamples.com").getOrCreate()
  4. data = ["Project","Gutenberg’s","Alice’s","Adventures",
  5. "in","Wonderland","Project","Gutenberg’s","Adventures",
  6. "in","Wonderland","Project","Gutenberg’s"]
  7. rdd=spark.sparkContext.parallelize(data)

 我们为每个元素添加一个值为1的新元素

  1. rdd2=rdd.map(lambda x: (x,1))
  2. for element in rdd2.collect():
  3. print(element)

 pyspark rdd地图转换pyspark rdd地图转换

 map()例子2

  1. data = [('James','Smith','M',30),
  2. ('Anna','Rose','F',41),
  3. ('Robert','Williams','M',62),
  4. ]
  5. columns = ["firstname","lastname","gender","salary"]
  6. df = spark.createDataFrame(data=data, schema = columns)
  7. df.show()
  8. +---------+--------+------+------+
  9. |firstname|lastname|gender|salary|
  10. +---------+--------+------+------+
  11. | James| Smith| M| 30|
  12. | Anna| Rose| F| 41|
  13. | Robert|Williams| M| 62|
  14. +---------+--------+------+------+
  1. # 将x[0],x[1]合并,逗号为分隔符
  2. rdd2=df.rdd.map(lambda x:
  3. (x[0]+","+x[1],x[2],x[3]*2)
  4. )
  5. df2=rdd2.toDF(["name","gender","new_salary"] )
  6. df2.show()
  7. +---------------+------+----------+
  8. | name|gender|new_salary|
  9. +---------------+------+----------+
  10. | James,Smith| M| 60|
  11. | Anna,Rose| F| 82|
  12. |Robert,Williams| M| 124|
  13. +---------------+------+----------+

 

flatMap()例子

首先,让我们从列表中创建一个RDD。

  1. data = ["Project Gutenberg’s",
  2. "Alice’s Adventures in Wonderland",
  3. "Project Gutenberg’s",
  4. "Adventures in Wonderland",
  5. "Project Gutenberg’s"]
  6. rdd=spark.sparkContext.parallelize(data)
  7. for element in rdd.collect():
  8. print(element)

这将产生以下输出

rdd输出

 

  1. rdd2=rdd.flatMap(lambda x: x.split(" "))
  2. for element in rdd2.collect():
  3. print(element)
  1. Project
  2. Gutenberg’s
  3. Alice’s
  4. Adventures
  5. in
  6. Wonderland
  7. Project
  8. Gutenberg’s
  9. Adventures
  10. in
  11. Wonderland
  12. Project
  13. Gutenberg’s

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/757108
推荐阅读
相关标签
  

闽ICP备14008679号