赞
踩
本文主要记录一次Spark SQL在多表关联时出现的OOM问题以及解决。看似是OOM问题,实则是数据倾斜问题。在讨论这个问题之前,我们有必要先来看一下Spark SQL有哪些Join类型
Spark SQL共提供了7种Join类型,包括:内联接、左联接、右联接、全连接、交叉联接、半联接和反联接。以下是各Join类型的介绍以及联接示例:
准备示例数据如下:
-- 使用员工和部门表来演示不同类型的联接 create table employee( id string, name string, deptno bigint ); insert into employee values ('105','Chloe', 5), ('103','Paul', 3), ('101','John', 1), ('102','Lisa', 2), ('104','Evan', 4), ('106','Amy', 6); SELECT * FROM employee; +---+-----+------+ | id| name|deptno| +---+-----+------+ |105|Chloe| 5| |103| Paul| 3| |101| John| 1| |102| Lisa| 2| |104| Evan| 4| |106| Amy| 6| +---+-----+------+ create table department( deptno bigint, deptname string ); insert into department values (3,'Engineering'), (2,'Sales'), (1,'Marketing'); SELECT * FROM department; +------+-----------+ |deptno| deptname| +------+-----------+ | 3|Engineering| | 2| Sales| | 1| Marketing| +------+-----------+
1)内联接(Inner Join)
语法:
A [ INNER ] JOIN B
内联接是Spark SQL中的默认联接。它选择在两个表中具有公共匹配值的行
示例:
SELECT id, name, employee.deptno, deptname FROM employee INNER JOIN department ON employee.deptno = department.deptno;
+---+-----+------+-----------|
| id| name|deptno| deptname|
+---+-----+------+-----------|
|103| Paul| 3|Engineering|
|101| John| 1| Marketing|
|102| Lisa| 2| Sales|
+---+-----+------+-----------|
2)左联接(Left Join)
语法:
A LEFT [ OUTER ] JOIN B
左联接返回左侧表中的所有值和右侧表中的匹配值,如果没有匹配项,则追加NULL
示例:
SELECT id, name, employee.deptno, deptname FROM employee LEFT JOIN department ON employee.deptno = department.deptno;
+---+-----+------+-----------|
| id| name|deptno| deptname|
+---+-----+------+-----------|
|105|Chloe| 5| NULL|
|
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。