当前位置:   article > 正文

【Python笔记】spark.read.csv

spark.read.csv

1 问题发现

from pyspark.sql.types import StructField, StructType, StringType

# 定义 spark df 的表结构
schema = StructType(
    [
        StructField('ip', StringType(), True),
        StructField('city', StringType(), True)
    ]
)
ip_city_path = job+'/abcdefg'
ip_city_df = spark.read.csv(ip_city_path, header=True, schema=schema, encoding='utf-8', sep=',')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

or

spark.read
     .option("charset", "utf-8")
     .option("header", "true")
     .option("quote", "\"")
     .option("delimiter", ",")
     .csv(...)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

2 Spark读取外部数据的几种方式

Spark读取外部数据的几种方式

2.1 读取csv文件(四种方式)

//方式一:直接使用csv方法
  val sales4: DataFrame = spark.read.option("header", "true").option("header", false).csv("file:///D:\\Software\\idea_space\\spark_streaming\\src\\data\\exam\\sales.csv")
    .withColumnRenamed("_c0", "time")
    .withColumnRenamed("_c1", "id")
    .withColumnRenamed("_c2", "salary")
    .withColumn("salary", $"salary".cast("Double"))

//方式二:使用format
  val salesDF3 = spark.read.format("com.databricks.spark.csv")
  //header默认false头不设置成字段名,true是第一行作为数据表的字段名
    // .option("header", "true")
    //自动类型推断
	// .option("inferSchema", true)
    .load("D:\\Software\\idea_space\\spark_streaming\\src\\data\\exam\\sales.csv")
    //字段重命名
    .withColumnRenamed("_c0", "time")
    .withColumnRenamed("_c1", "id")
    .withColumnRenamed("_c2", "salary")
    //字段重命名后修改类型
    .withColumn("salary", $"salary".cast("Double"))
    
 //方式三:通过算子将数组转换成样例类对象
private val salesDF: DataFrame = sc.textFile("file:///D:\\Software\\idea_space\\spark_streaming\\src\\data\\exam\\sales.csv")
    .map(_.split(",")).map(x => Sales(x(0), x(1), x(2).toDouble)).toDF()
case class Sales(time:String,id:String,salary:Double)    

 //方式四:通过schema创建 
 val userRDD:RDD[Row] = sc.textFile("file:///D:\\Software\\idea_space\\spark_streaming\\src\\data\\exam\\demo02.txt").map(_.split(","))
      .map(x => Row(x(0), x(1), x(2).toInt))
 
 val schema = StructType(Array(
  StructField("username", StringType, true),
  StructField("month", StringType, true),
  StructField("visitNum", IntegerType, true)
 ))
 
 val userDF: DataFrame = spark.createDataFrame(userRDD,schema)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37

spark.read.format(“csv”)与spark.read.csv的性能差异

DF1花了42秒,而DF2只花了10秒. csv文件的大小为60+ GB.

DF1 = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("hdfs://bda-ns/user/project/xxx.csv")

DF2 = spark.read.option("header", "true").csv("hdfs://bda-ns/user/project/xxx.csv")
  • 1
  • 2
  • 3

https://www.it1352.com/1847378.html

2.2 读取json文件

//方式一:
private val userJsonDF: DataFrame = spark.read.json("file:///D:\\Software\\idea_space\\spark_streaming\\src\\data\\exam\\users.json")
//方式二:
private val userJsonDF: DataFrame = spark.read.format("json").load("D:\\Software\\idea_space\\spark_streaming\\src\\data\\exam\\users.json")
  • 1
  • 2
  • 3
  • 4

2.3 读取及写入mysql

读取

val url = "jdbc:mysql://hadoop1:3306/test"
  val tableName = "sales"
  private val prop = new Properties()
  prop.setProperty("user","root")
  prop.setProperty("password","ok")
  prop.setProperty("driver","com.mysql.jdbc.Driver")
  
private val salesDF2: DataFrame = spark.read.jdbc(url,tableName,prop)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

写入

val url = "jdbc:mysql://hadoop1:3306/test"
  val tableName = "sales"
  private val prop = new Properties()
  prop.setProperty("user","root")
  prop.setProperty("password","ok")
  prop.setProperty("driver","com.mysql.jdbc.Driver")
  //mode的几种方式:overwrite覆盖,append追加
salesDF.write.mode("overwrite").jdbc(url,tableName,prop)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

2.4 操作hive中的数据表

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/609184
推荐阅读
相关标签
  

闽ICP备14008679号