赞
踩
Window Function用于解决组内排名问题:
from pyspark.sql.window import Window
# 创建window对象, 以相同uid为一个窗口
# 优先d降序, 其次ts降序
window = Window.partitionBy("uid") \
.orderBy([F.col('d').desc(), F.col('ts').desc()])
# 在窗口内进行排序
df.withColumn("topn", F.row_number().over(window))
说明:
[100,100,98]的不同排序下的位次为
F.rank:113
F.dense_rank:112
F.row_number:123
d1 = {'name1':["A","A","A","B","B","B",],
'age':[65,64,50,48,49,68],
'height':[171,172,173,175,174,176,]
}
df1 = spark.createDataFrame(pd.DataFrame(d1))
df1.createOrReplaceTempView("table1")
df1.show()
+-----+---+------+
|name1|age|height|
+-----+---+------+
| A| 65| 171|
| A| 64| 172|
| A| 50| 173|
| B| 48| 175|
| B| 49| 174|
| B| 68| 176|
+-----+---+------+
window function的PySpark实践代码。
from pyspark.sql.window import Window
window = Window.partitionBy("name1") \
.orderBy([F.col('age').desc(), F.col('height').desc()])
# 在窗口内进行排序
df1.withColumn("topn", F.row_number().over(window))
df1.show()
+-----+---+------+----+
|name1|age|height|topn|
+-----+---+------+----+
| A| 65| 171| 1|
| A| 64| 172| 2|
| A| 50| 173| 3|
| B| 68| 176| 1|
| B| 49| 174| 2|
| B| 48| 175| 3|
+-----+---+------+----+
window function的SQL实践代码。
spark.sql(
"""
select *, row_number()
over(partition by name1
order by age desc) as topn
from table1
"""
).show()
+-----+---+------+----+
|name1|age|height|topn|
+-----+---+------+----+
| A| 65| 171| 1|
| A| 64| 172| 2|
| A| 50| 173| 3|
| B| 68| 176| 1|
| B| 49| 174| 2|
| B| 48| 175| 3|
+-----+---+------+----+
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。