赞
踩
Spark SQL DataType
class is a base class of all data types in Spark which defined in a package org.apache.spark.sql.types.DataType
and they are primarily used while working on DataFrames, In this article, you will learn different Data Types and their utility methods with Scala examples.
All data types from the below table are supported in Spark SQL and DataType
class is a base class for all these. For some types like IntegerType
, DecimalType
, ByteType
e.t.c are subclass of NumericType
which is a subclass of DataType.
StringType | ShortType |
ArrayType | IntegerType |
MapType | LongType |
StructType | FloatType |
DateType | DoubleType |
TimestampType | DecimalType |
BooleanType | ByteType |
CalendarIntervalType | HiveStringType |
BinaryType | ObjectType |
NumericType | NullType |
All Spark SQL Data Types extends DataType
class and should provide implementation to the methods explained in this example.
-
- val arr = ArrayType(IntegerType,false)
- println("json() : "+arrayType.json) // Represents json string of datatype
- println("prettyJson() : "+arrayType.prettyJson) // Gets json in pretty format
- println("simpleString() : "+arrayType.simpleString) // simple string
- println("sql() : "+arrayType.sql) // SQL format
- println("typeName() : "+arrayType.typeName) // type name
- println("catalogString() : "+arrayType.catalogString) // catalog string
- println("defaultSize() : "+arrayType.defaultSize) // default size
Yields below output.
-
- json() : {"type":"array","elementType":"string","containsNull":true}
- prettyJson() : {
- "type" : "array",
- "elementType" : "string",
- "containsNull" : true
- }
- simpleString() : array<string>
- sql() : ARRAY<STRING>
- typeName() : array
- catalogString() : array<string>
- defaultSize() : 20
Besides these, the DataType
class has the following static methods.
If you have a JSON string and you wanted to convert to a DataType use fromJson()
. For example you wanted to convert JSON schema from a string to StructType.
-
- val typeFromJson = DataType.fromJson(
- """{"type":"array",
- |"elementType":"string","containsNull":false}""".stripMargin)
- println(typeFromJson.getClass)
- val typeFromJson2 = DataType.fromJson("\"string\"")
- println(typeFromJson2.getClass)
-
- //This prints
- class org.apache.spark.sql.types.ArrayType
- class org.apache.spark.sql.types.StringType$
Like loading structure from JSON string, we can also create it fromDDL()
,
-
- val ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING," +
- "`middle`: STRING>,`age` INT,`gender` STRING"
- val ddlSchema = DataType.fromDDL(ddlSchemaStr)
- println(ddlSchema.getClass)
-
- // This prints
- class org.apache.spark.sql.types.StructType
In order to get or create a specific data type, we should use the objects and factory methods provided by org.apache.spark.sql.types.DataTypes
class. for example, use object DataTypes.StringType
to get StringType
and the factory method DataTypes.createArrayType(StirngType)
to get ArrayType of string.
-
- //Below are some examples
- val strType = DataTypes.StringType
- val arrayType = DataTypes.createArrayType(StringType)
- val structType = DataTypes.createStructType(
- Array(DataTypes.createStructField("fieldName",StringType,true)))
StringType “org.apache.spark.sql.types.StringType
” is used to represent string values, To create a string type use either DataTypes.StringType
or StringType()
, both of these returns object of String type.
-
- val strType = DataTypes.StringType
- println("json : "+strType.json)
- println("prettyJson : "+strType.prettyJson)
- println("simpleString : "+strType.simpleString)
- println("sql : "+strType.sql)
- println("typeName : "+strType.typeName)
- println("catalogString : "+strType.catalogString)
- println("defaultSize : "+strType.defaultSize)
Outputs
-
- json : "string"
- prettyJson : "string"
- simpleString : string
- sql : STRING
- typeName : string
- catalogString : string
- defaultSize : 20
Use ArrayType to represent arrays in a DataFrame and use either factory method DataTypes.createArrayType()
or ArrayType()
constructor to get an array object of a specific type.
On Array type object you can access all methods defined in section 1.1 and additionally, it provides containsNull()
, elementType()
, productElement()
to name a few.
-
- val arr = ArrayType(IntegerType,false)
- val arrayType = DataTypes.createArrayType(StringType,true)
- println("containsNull : "+arrayType.containsNull)
- println("elementType : "+arrayType.elementType)
- println("productElement : "+arrayType.productElement(0))
Yields below output.
-
- containsNull : true
- elementType : StringType
- productElement : StringType
For more example and usage, please refer Using ArrayType on DataFrame
Use MapType to represent maps with key-value pair in a DataFrame and use either factory method DataTypes.createMapType()
or MapType()
constructor to get a map object of a specific key and value type.
On Map type object you can access all methods defined in section 1.1 and additionally, it provides keyType()
, valueType()
, valueContainsNull()
, productElement()
to name a few.
-
- val mapType1 = MapType(StringType,IntegerType)
- val mapType = DataTypes.createMapType(StringType,IntegerType)
- println("keyType() : "+mapType.keyType)
- println("valueType() : "+mapType.valueType)
- println("valueContainsNull() : "+mapType.valueContainsNull)
- println("productElement(1) : "+mapType.productElement(1))
Yields below output.
-
- keyType() : StringType
- valueType() : IntegerType
- valueContainsNull() : true
- productElement(1) : IntegerType
For more example and usage, please refer Using MapType on DataFrame
Use DateType “org.apache.spark.sql.types.DataType
” to represent the date on a DataFrame and use either DataTypes.DateType
or DateType()
constructor to get a date object.
On Date type object you can access all methods defined in section 1.1
Use TimestampType “org.apache.spark.sql.types.TimestampType
” to represent the time on a DataFrame and use either DataTypes.TimestampType
or TimestampType()
constructor to get a time object.
On Timestamp type object you can access all methods defined in section 1.1
Use StructType “org.apache.spark.sql.types.StructType
” to define the nested structure or schema of a DataFrame, use either DataTypes.createStructType()
or StructType()
constructor to get a struct object.
StructType object provides lot of functions like toDDL()
, fields()
, fieldNames()
, length()
to name few.
-
- //StructType
- val structType = DataTypes.createStructType(
- Array(DataTypes.createStructField("fieldName",StringType,true)))
-
- val simpleSchema = StructType(Array(
- StructField("name",StringType,true),
- StructField("id", IntegerType, true),
- StructField("gender", StringType, true),
- StructField("salary", DoubleType, true)
- ))
-
- val anotherSchema = new StructType()
- .add("name",new StructType()
- .add("firstname",StringType)
- .add("lastname",StringType))
- .add("id",IntegerType)
- .add("salary",DoubleType)
For more example and usage, please refer StructType
Similar to the above-described types, for the rest of the datatypes use the appropriate method on DataTypes class or data type constructor to create an object of the desired Data Type, And all common methods described in section 1.1 are available with these types.
In this article, you have learned all different Spark SQL DataTypes, DataType, DataTypes classes and their methods using Scala examples. I would recommend referring to DataType and DataTypes API for more details.
Thanks for reading. If you like it, please do share the article by following the below social links and any comments or suggestions are welcome in the comments sections!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。