赞
踩
(1)在pyspark中读取Linux系统本地文件“/home/hadoop/test.txt”(如果该文件不存在,请创建并自由添加内容),然后统计出文件的行数;
- cat /home/hadoop/test.txt
- pyspark
- lines = sc.textFile("file:///home/hadoop/test.txt")
- line_count = lines.count()
- print("Line count:", line_count)
(2)在pyspark中读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请创建并自由添加内容),然后统计出文件的行数;
- hadoop fs -cat /user/hadoop/test.txt
- pyspark
- lines = sc.textFile("hdfs:///user/hadoop/test.txt")
- line_count = lines.count()
- print("Line count:", line_count)
(3)编写独立应用程序,读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请创建并自由添加内容),然后统计出文件的行数;通过 spark-submit 提交到 Spark 中运行程序。
- cd /opt/module/spark-3.0.3-bin-without-hadoop/mycode/
- touch File_Count.py
- vim File_Count.py
- from pyspark import SparkConf,SparkContext
- conf = SparkConf().setMaster("local").setAppName("File Count")
- sc = SparkContext(conf = conf)
- lines = sc.textFile("hdfs:///user/hadoop/test.txt")
- line_count = lines.count()
- print("Line count:", line_count)
- sc.stop()
- spark-submit File_Count.py
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。