当前位置:   article > 正文

Synapse Spark

Synapse Spark

rm -r dp203 -f

git clone https://github.com/MicrosoftLearning/Dp-203-azure-data-engineer dp203

cd dp203/Allfiles/labs/05

./setup.ps1

  1. Select any of the files in the orders folder, and then in the New notebook list on the toolbar, select Load to DataFrame. A dataframe is a structure in Spark that represents a tabular dataset.

    1. %%pyspark
    2. df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/sales/orders/2019.csv', format='csv'
    3. ## If header exists uncomment line below
    4. ##, header=True
    5. )
    6. display(df.limit(10))
    1. %%pyspark
    2. df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/sales/orders/*.csv', format='csv'
    3. )
    4. display(df.limit(100))

    1. %%pyspark
    2. from pyspark.sql.types import *
    3. from pyspark.sql.functions import *
    4. orderSchema = StructType([
    5. StructField("SalesOrderNumber", StringType()),
    6. StructField("SalesOrderLineNumber", IntegerType()),
    7. StructField("OrderDate", DateType()),
    8. StructField("CustomerName", StringType()),
    9. StructField("Email", StringType()),
    10. StructField("Item", StringType()),
    11. StructField("Quantity", IntegerType()),
    12. StructField("UnitPrice", FloatType()),
    13. StructField("Tax", FloatType())
    14. ])
    15. df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/sales/orders/*.csv', format='csv', schema=orderSchema)
    16. display(df.limit(100))

  1. df.printSchema()

    1. customers = df['CustomerName', 'Email']
    2. print(customers.count())
    3. print(customers.distinct().count())
    4. display(customers.distinct())

  2. Run the new code cell, and review the results. Observe the following details:

    • When you perform an operation on a dataframe, the result is a new dataframe (in this case, a new customers dataframe is created by selecting a specific subset of columns from the df dataframe)
    • Dataframes provide functions such as count and distinct that can be used to summarize and filter the data they contain.
    • The dataframe['Field1', 'Field2', ...] syntax is a shorthand way of defining a subset of column. You can also use select method, so the first line of the code above could be written as customers = df.select("CustomerName", "Email")
  1. ustomers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52')
  2. print(customers.count())
  3. print(customers.distinct().count())
  4. display(customers.distinct())

  1. productSales = df.select("Item", "Quantity").groupBy("Item").sum()
  2. display(productSales)

  1. yearlySales = df.select(year("OrderDate").alias("Year")).groupBy("Year").count().orderBy("Year")
  2. display(yearlySales)

  1. df.createOrReplaceTempView("salesorders")
  2. spark_df = spark.sql("SELECT * FROM salesorders")
  3. display(spark_df)

  1. %%sql
  2. SELECT YEAR(OrderDate) AS OrderYear,
  3. SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue
  4. FROM salesorders
  5. GROUP BY YEAR(OrderDate)
  6. ORDER BY OrderYear;

  1. %%sql
  2. SELECT * FROM salesorders

  1. sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, \
  2. SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue \
  3. FROM salesorders \
  4. GROUP BY CAST(YEAR(OrderDate) AS CHAR(4)) \
  5. ORDER BY OrderYear"
  6. df_spark = spark.sql(sqlQuery)
  7. df_spark.show()

  1. from matplotlib import pyplot as plt
  2. # matplotlib requires a Pandas dataframe, not a Spark one
  3. df_sales = df_spark.toPandas()
  4. # Create a bar plot of revenue by year
  5. plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'])
  6. # Display the plot
  7. plt.show()

  1. # Clear the plot area
  2. plt.clf()
  3. # Create a bar plot of revenue by year
  4. plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
  5. # Customize the chart
  6. plt.title('Revenue by Year')
  7. plt.xlabel('Year')
  8. plt.ylabel('Revenue')
  9. plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
  10. plt.xticks(rotation=45)
  11. # Show the figure
  12. plt.show()

  1. # Clear the plot area
  2. plt.clf()
  3. # Create a Figure
  4. fig = plt.figure(figsize=(8,3))
  5. # Create a bar plot of revenue by year
  6. plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
  7. # Customize the chart
  8. plt.title('Revenue by Year')
  9. plt.xlabel('Year')
  10. plt.ylabel('Revenue')
  11. plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
  12. plt.xticks(rotation=45)
  13. # Show the figure
  14. plt.show()

  1. # Clear the plot area
  2. plt.clf()
  3. # Create a figure for 2 subplots (1 row, 2 columns)
  4. fig, ax = plt.subplots(1, 2, figsize = (10,4))
  5. # Create a bar plot of revenue by year on the first axis
  6. ax[0].bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
  7. ax[0].set_title('Revenue by Year')
  8. # Create a pie chart of yearly order counts on the second axis
  9. yearly_counts = df_sales['OrderYear'].value_counts()
  10. ax[1].pie(yearly_counts)
  11. ax[1].set_title('Orders per Year')
  12. ax[1].legend(yearly_counts.keys().tolist())
  13. # Add a title to the Figure
  14. fig.suptitle('Sales Data')
  15. # Show the figure
  16. plt.show()

  1. import seaborn as sns
  2. # Clear the plot area
  3. plt.clf()
  4. # Create a bar chart
  5. ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
  6. plt.show()

  1. # Clear the plot area
  2. plt.clf()
  3. # Set the visual theme for seaborn
  4. sns.set_theme(style="whitegrid")
  5. # Create a bar chart
  6. ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
  7. plt.show()

  1. # Clear the plot area
  2. plt.clf()
  3. # Create a bar chart
  4. ax = sns.lineplot(x="OrderYear", y="GrossRevenue", data=df_sales)
  5. plt.show()

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/爱喝兽奶帝天荒/article/detail/801473
推荐阅读
相关标签
  

闽ICP备14008679号