当前位置:   article > 正文

Data.Analysis.with.Python.and.PySpark:4 Analyzing tabular data with pyspark.sql_jonathan rioux, data analysis with python and pysp

jonathan rioux, data analysis with python and pyspark,课后题

创建SparkSession对象以开始使用PySpark

  1. from pyspark.sql import SparkSession
  2. import pyspark.sql.functions as F
  3. spark = SparkSession.builder.getOrCreate()

PySpark如何表示表格数据?

  1. my_grocery_list = [
  2. ["Banana", 2, 1.74],
  3. ["Apple", 4, 2.04],
  4. ["Carrot", 1, 1.09],
  5. ["Cake", 1, 10.99],
  6. ]
  7. df_grocery_list = spark.createDataFrame(
  8. my_grocery_list, ["Item", "Quantity", "Price"]
  9. )
  10. df_grocery_list.printSchema()

我们的第一个参数是数据本身。您可以提供项目列表(这里是列表列表)、data frame或弹性分布式数据集;第二个参数是data frame的模式。同时,传递列名列表会推断出我们的列的类型(分别是string、long和double)。

主节点知道数据帧的结构,但实际数据在工作节点上表示。每一列都映射到存储在PySpark管理的集群中某个位置的数据。我们在抽象结构上操作,让主节点高效地委派工作。


PySpark用于分析和处理表格数据

名词解释:exploratory data analysis (or EDA)

PySpark不提供任何图表功能,也不使用其他图表库(如Matplotlib、seaborn、Altair或plot.ly)通常的解决方案是使用PySpark转换数据,使用toPandas()方法将PySpark数据框转换为pandas数据框,然后使用您最喜欢的图表库。

数据准备:

download the file on the Canada Open Data portal (http://mng.bz/y4YJ);
select the BroadcastLogs_2018_Q3_M8 file.

You also need to download the Data Dictionary in .doc form, as well as the
Reference Tables zip file, unzipping them into a ReferenceTables directory in data/
broadcast_logs.

make sure you have the following:

在PySpark中读取和评估分隔的数据 

通过SparkReader专门处理CSV文件

数据操作的基础:选择、删除、重命名、排序、诊断

  1. logs.select("BroadcastLogID", "LogServiceID", "LogDate").show(5, False)
  2. # +--------------+------------+-------------------+
  3. # |BroadcastLogID|LogServiceID|LogDate |
  4. # +--------------+------------+-------------------+
  5. # |1196192316 |3157 |2018-08-01 00:00:00|
  6. # |1196192317 |3157 |2018-08-01 00:00:00|
  7. # |1196192318 |3157 |2018-08-01 00:00:00|
  8. # |1196192319 |3157 |2018-08-01 00:00:00|
  9. # |1196192320 |3157 |2018-08-01 00:00:00|
  10. # +--------------+------------+-------------------+
  11. # only showing top 5 rows

Four ways to select columns in PySpark, all equivalent in terms of results

  1. # Using the string to column conversion
  2. logs.select("BroadCastLogID", "LogServiceID", "LogDate")
  3. logs.select(*["BroadCastLogID", "LogServiceID", "LogDate"])
  4. # Passing the column object explicitly
  5. logs.select(
  6. F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")
  7. )
  8. logs.select(
  9. *[F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")]
  10. )

当显式选择几列时,不必将它们包装到列表中。如果已经在处理列列表,可以使用*前缀将其解压缩。

data frame在columns属性中跟踪其列;logs.columns是一个Python列表,包含logs daya frame的所有列名。

使用drop()方法除去列

logs = logs.drop("BroadcastLogID", "SequenceNO")

Getting rid of columns, select style

  1. logs = logs.select(
  2. *[x for x in logs.columns if x not in ["BroadcastLogID", "SequenceNO"]]
  3. )

创建不存在的内容:使用withColumn()创建新列

Extracting the hours, minutes, and seconds from the Duration column

 

WARNING If you create a column withColumn() and give it a name that
already exists in your data frame, PySpark will happily overwrite the column.

WARNING Creating many (100+) new columns using withColumns() will slow
Spark down to a grind. If you need to create a lot of columns at once, use the
select() approach. While it will generate the same work, it is less tasking on
the query planner.

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/1002648
推荐阅读
相关标签
  

闽ICP备14008679号