赞
踩
file —— settings —— plugins ,输入scala,搜索插件下载安装,注意版本:
先下载解压scala,直接从linux服务器端解压一个就行
file —— project structure —— library,配置之后,new就可以看到scala class了:
配置spark和scala的环境变量:
分别下载hadoop,spark和scala解压,增加环境变量:
file —— new project —— maven ,
有2个xml配置文件如下:
(1)pom.xml
- <?xml version="1.0" encoding="UTF-8"?>
- <project xmlns="http://maven.apache.org/POM/4.0.0"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
- <modelVersion>4.0.0</modelVersion>
-
- <groupId>Learn-BigData</groupId>
- <artifactId>bigdata</artifactId>
- <version>1.0-SNAPSHOT</version>
-
- <properties>
- <maven.compiler.source>1.8</maven.compiler.source>
- <maven.compiler.target>1.8</maven.compiler.target>
- <scala.version>2.11.8</scala.version>
- <spark.version>2.4.0</spark.version>
- <hadoop.version>2.8.5</hadoop.version>
- <encoding>UTF-8</encoding>
- </properties>
-
- <repositories>
- <repository>
- <id>nexus-aliyun</id>
- <name>Nexus aliyun</name>
- <url>http://maven.aliyun.com/nexus/content/groups/public</url>
- </repository>
- </repositories>
-
- <dependencies>
- <!-- 导入scala的依赖 -->
- <dependency>
- <groupId>org.scala-lang</groupId>
- <artifactId>scala-library</artifactId>
- <version>${scala.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <!-- 导入spark的依赖 -->
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-core_2.11</artifactId>
- <version>${spark.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-sql_2.11</artifactId>
- <version>${spark.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-hive_2.11</artifactId>
- <version>${spark.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <!-- 指定hadoop-client API的版本 -->
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-client</artifactId>
- <version>${hadoop.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-common</artifactId>
- <version>${hadoop.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-hdfs</artifactId>
- <version>${hadoop.version}</version>
- <scope>compile</scope>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-mapreduce-client-common</artifactId>
- <version>${hadoop.version}</version>
- <scope>compile</scope>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-mapreduce-client-core</artifactId>
- <version>${hadoop.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
- <version>${hadoop.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>commons-cli</groupId>
- <artifactId>commons-cli</artifactId>
- <version>1.3.1</version>
- <scope>compile</scope>
- </dependency>
-
- <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-streaming_2.11</artifactId>
- <version>${spark.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-mllib_2.11</artifactId>
- <version>${spark.version}</version>
- <scope>compile</scope>
- </dependency>
-
- <dependency>
- <groupId>commons-configuration</groupId>
- <artifactId>commons-configuration</artifactId>
- <version>1.6</version>
- <scope>compile</scope>
- </dependency>
-
- </dependencies>
-
- <build>
- <pluginManagement>
- <plugins>
- <!-- 编译scala的插件 -->
- <plugin>
- <groupId>net.alchim31.maven</groupId>
- <artifactId>scala-maven-plugin</artifactId>
- <version>3.2.2</version>
- </plugin>
- <!-- 编译java的插件 -->
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-compiler-plugin</artifactId>
- <version>3.5.1</version>
- </plugin>
- </plugins>
- </pluginManagement>
- <plugins>
- <plugin>
- <groupId>net.alchim31.maven</groupId>
- <artifactId>scala-maven-plugin</artifactId>
- <executions>
- <execution>
- <id>scala-compile-first</id>
- <phase>process-resources</phase>
- <goals>
- <goal>add-source</goal>
- <goal>compile</goal>
- </goals>
- </execution>
- <execution>
- <id>scala-test-compile</id>
- <phase>process-test-resources</phase>
- <goals>
- <goal>testCompile</goal>
- </goals>
- </execution>
- </executions>
- </plugin>
-
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-compiler-plugin</artifactId>
- <executions>
- <execution>
- <phase>compile</phase>
- <goals>
- <goal>compile</goal>
- </goals>
- </execution>
- </executions>
- </plugin>
-
-
- <!-- 打jar插件 -->
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-shade-plugin</artifactId>
- <version>2.4.3</version>
-
- <configuration>
- <createDependencyReducedPom>false</createDependencyReducedPom>
- </configuration>
-
- <executions>
- <execution>
- <phase>package</phase>
- <goals>
- <goal>shade</goal>
- </goals>
- <configuration>
- <filters>
- <filter>
- <artifact>*:*</artifact>
- <excludes>
- <exclude>META-INF/*.SF</exclude>
- <exclude>META-INF/*.DSA</exclude>
- <exclude>META-INF/*.RSA</exclude>
- </excludes>
- </filter>
- </filters>
- </configuration>
- </execution>
- </executions>
- </plugin>
- </plugins>
- </build>
-
-
- </project>
(2)dependency-reduced-pom.xml,这个文件是打包时生成的,没啥用。
- package cn.edu360.spark;
-
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaPairRDD;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- import org.apache.spark.api.java.function.FlatMapFunction;
- import org.apache.spark.api.java.function.Function2;
- import org.apache.spark.api.java.function.PairFunction;
- import scala.Tuple2;
-
- import java.util.Arrays;
- import java.util.Iterator;
-
-
- public class JavaWordCount {
-
- public static void main(String[] args) {
-
- SparkConf conf = new SparkConf().setAppName("JavaWordCount");
- //创建sparkContext
- JavaSparkContext jsc = new JavaSparkContext(conf);
- //指定以后从哪里读取数据
- JavaRDD<String> lines = jsc.textFile(args[0]);
- //切分压平
- JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
- @Override
- public Iterator<String> call(String line) throws Exception {
- return Arrays.asList(line.split(" ")).iterator();
- }
- });
-
- //将单词和一组合在一起
- JavaPairRDD<String, Integer> wordAndOne = words.mapToPair(new PairFunction<String, String, Integer>() {
- @Override
- public Tuple2<String, Integer> call(String word) throws Exception {
- return new Tuple2<>(word, 1);
- }
- });
-
- //聚合
- JavaPairRDD<String, Integer> reduced = wordAndOne.reduceByKey(new Function2<Integer, Integer, Integer>() {
- @Override
- public Integer call(Integer v1, Integer v2) throws Exception {
- return v1 + v2;
- }
- });
-
- //调换顺序
- JavaPairRDD<Integer, String> swaped = reduced.mapToPair(new PairFunction<Tuple2<String, Integer>, Integer, String>() {
- @Override
- public Tuple2<Integer, String> call(Tuple2<String, Integer> tp) throws Exception {
- //return new Tuple2<>(tp._2, tp._1);
- return tp.swap();
- }
- });
-
- //排序
- JavaPairRDD<Integer, String> sorted = swaped.sortByKey(false);
-
- //调整顺序
- JavaPairRDD<String, Integer> result = sorted.mapToPair(new PairFunction<Tuple2<Integer, String>, String, Integer>() {
- @Override
- public Tuple2<String, Integer> call(Tuple2<Integer, String> tp) throws Exception {
- return tp.swap();
- }
- });
-
- //将数据保存到hdfs
- result.saveAsTextFile(args[1]);
-
- //释放资源
- jsc.stop();
-
- }
- }
view -- tool windows -- maven project,没有出现如下目录时,点击+号,去选中pom.xml文件:
- 进入spark的安装目录的bin目录,执行以下代码:
- spark-submit --master spark://hdp-01:7077 --class cn.edu360.spark.JavaWordCount /root/learn_dh/original-SparkTest-1.0-SNAPSHOT.jar hdfs://hdp-01:9000/spark/input/test.txt hdfs://hdp-01:9000/spark/output/wc1005
命令解释:
1、--master spark://hdp-01:7077 ,指定spark集群的master
2、--class cn.edu360.spark.JavaWordCount,指定java类名全路径
3、/root/learn_dh/original-SparkTest-1.0-SNAPSHOT.jar,指定这个jar包在linux服务器上jar的绝对路径
4、hdfs://hdp-01:9000/spark/input/test.txt ,HDFS上输入文件的路径
5、hdfs://hdp-01:9000/spark/output/wc1005,HDFS上输出文件路径。(这路径不能是已经存在的,否则会报错)
可以在http://hdp-01:8080/这里查看执行情况。
setMaster为local,
本地运行时,要配置输入输出文件的路径:
9、idea打开项目时,需要选中到src这一级目录,否则,打开后看不见项目结构图,这个坑的很啊:
例如,直接选中项目根目录打开是这样的,初学者注意下,有点莫名其妙的。
10、本地运行mapreduce和spark程序:
配置好上面的pom文件之后,不用再按照网上说的添加spark和hadoop的jar包,spark设置setMaster("local")就可以运行。
二步设置:
1、设置运行环境,edit configuration:
2、
新建maven项目,配置好pom后,不用再添加其他spark和hadoop的jar包,否则容易报莫名其妙的错误,估计是依赖冲突造成的。
在windows下配置好scala、hadoop、spark的环境变量之后:
在cmd下,输入,scala、spark-shell,可直接在本地编写scala、spark程序
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。