赞
踩
Spark简介
Apache Spark是一个用于实时处理的开源集群计算框架。 它是Apache软件基金会中最成功的项目。 Spark已成为大数据处理市场的领导者。 今天,Spark被亚马逊,eBay和雅虎等主要厂商采用。 许多组织在具有数千个节点的集群上运行Spark。
MR的这种方式对数据领域两种常见的操作不是很高效。第一种是迭代式的算法。比如机器学习中ALS、凸优化梯度下降等。这些都需要基于数据集或者数据集的衍生数据反复查询反复操作。MR这种模式不太合适,即使多MR串行处理,性能和时间也是一个问题。数据的共享依赖于磁盘。另外一种是交互式数据挖掘,MR显然不擅长。
spark安装
配置:
slaves
- hadoop-senior01.zhangbk.com
- hadoop-senior02.zhangbk.com
spark-env.sh
- SPARK_MASTER_HOST=hadoop-senior03.zhangbk.com
- SPARK_MASTER_PORT=7077
- export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=7077 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://ns1/spark-history"
vi spark-defaults.conf
- spark.master spark://hadoop-senior01.zhangbk.com:7077
- spark.eventLog.enabled true
- spark.eventLog.dir hdfs://ns1/spark-history
- spark.eventLog.compress true
创建目录 hdfs dfs -mkdir /spark-history
由于是Hadoop集群HA,所以需将hdfs-site.xml 拷贝到./conf目录下 传输到其他节点
启动spark
- sbin/start-all.sh
- sbin/start-history-server.sh
登陆web页面
http://192.168.159.21:8080
执行第一个spark程序
- bin/spark-submit \
- --class org.apache.spark.examples.SparkPi \
- --master spark://hadoop-senior01.zhangbk.com:7077 \
- --executor-memory 1G \
- --total-executor-cores 2 \
- examples/jars/spark-examples_2.11-2.3.0.jar \
- 100
参数说明:
--class CLASS_NAME Your application's main class (for Java / Scala apps)
--master spark://master01:7077 指定Master的地址
--executor-memory 1G 指定每个executor可用内存为1G
--total-executor-cores 2 指定每个executor使用的cup核数为2个
该算法是利用蒙特·卡罗算法求PI
Spark应用提交
一旦打包好,就可以使用bin/spark-submit脚本启动应用了. 这个脚本负责设置spark使用的classpath和依赖,支持不同类型的集群管理器和发布模式:
- bin/spark-submit \
- --class <main-class>
- --master <master-url> \
- --deploy-mode <deploy-mode> \
- --conf <key>=<value> \
- ... # other options
- <application-jar> \
- [application-arguments]
一些常用选项:
1) --class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
2) --master: 集群的master URL (如 spark://23.195.26.187:7077)
3) --deploy-mode: 是否发布你的驱动到worker节点(cluster) 或者作为一个本地客户端 (client) (default: client)*
4) --conf: 任意的Spark配置属性, 格式key=value. 如果值包含空格,可以加引号“key=value”. 缺省的Spark配置
5) application-jar: 打包好的应用jar,包含依赖. 这个URL在集群中全局可见。 比如hdfs:// 共享存储系统, 如果是 file:// path, 那么所有的节点的path都包含同样的jar.
6) application-arguments: 传给main()方法的参数
启动Spark shell
- bin/spark-shell \
- --master spark://hadoop-senior01.zhangbk.com:7077 \
- --executor-memory 2g \
- --total-executor-cores 2
注意:
如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,
其实是启动了spark的cluster模式,如果spark是单节点,并且没有指定slave文件,这个时候如果打开spark-shell
默认是local模式
Local模式是master和worker在同同一进程内
Cluster模式是master和worker在不同进程内
Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到,则直接应用sc即可
在Spark shell中编写WordCount程序
在Spark shell中用scala语言编写spark程序
sc.textFile("hdfs://ns1/spark/input/RELEASE").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://ns1/spark/output/out1")
说明:
sc是SparkContext对象,该对象是提交spark程序的入口
textFile("hdfs://ns1/spark/input/RELEASE")是从hdfs中读取数据
flatMap(_.split(" "))先map在压平
map((_,1))将单词和1构成元组
reduceByKey(_+_)按照key进行reduce,并将value累加
saveAsTextFile("hdfs://ns1/spark/output/out1")将结果写入到hdfs中
IDEA中编写WordCount程序
- package com.zhangbk.spark
-
- import org.apache.spark.{SparkConf, SparkContext}
- import org.slf4j.LoggerFactory
-
- object WordCount {
- val logger = LoggerFactory.getLogger(WordCount.getClass)
-
- def main(args: Array[String]) {
- val conf = new SparkConf().setAppName("WC")
-
- val sc = new SparkContext(conf)
-
- sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_,
- 1).sortBy(_._2, false).saveAsTextFile(args(1))
-
-
- logger.info("=========================completed================================")
-
- sc.stop()
- }
- }
配置pom.xml
- <!-- wordcount pom.xml-->
- <?xml version="1.0" encoding="UTF-8"?>
- <project xmlns="http://maven.apache.org/POM/4.0.0"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
- <parent>
- <artifactId>spark</artifactId>
- <groupId>com.zhangbk</groupId>
- <version>1.0-SNAPSHOT</version>
- </parent>
- <modelVersion>4.0.0</modelVersion>
-
- <artifactId>wordcount</artifactId>
-
- <dependencies>
- <dependency>
- <groupId>org.scala-lang</groupId>
- <artifactId>scala-library</artifactId>
- <version>${scala.version}</version>
- <!--<scope>provided</scope>-->
- </dependency>
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-core_2.11</artifactId>
- <version>${spark.version}</version>
- <!--<scope>provided</scope>-->
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-client</artifactId>
- <version>${hadoop.version}</version>
- <!--<scope>provided</scope>-->
- </dependency>
-
- <!-- Logging -->
- <dependency>
- <groupId>org.slf4j</groupId>
- <artifactId>jcl-over-slf4j</artifactId>
- <version>${slf4j.version}</version>
- </dependency>
- <dependency>
- <groupId>org.slf4j</groupId>
- <artifactId>slf4j-api</artifactId>
- <version>${slf4j.version}</version>
- </dependency>
- <dependency>
- <groupId>org.slf4j</groupId>
- <artifactId>slf4j-log4j12</artifactId>
- <version>${slf4j.version}</version>
- </dependency>
- <dependency>
- <groupId>log4j</groupId>
- <artifactId>log4j</artifactId>
- <version>${log4j.version}</version>
- </dependency>
- <!-- Logging End -->
- </dependencies>
- <build>
- <finalName>wordcount</finalName>
- <plugins>
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-assembly-plugin</artifactId>
- <version>2.2-beta-5</version>
- <configuration>
- <archive>
- <manifest>
- <mainClass>com.zhangbk.spark.WordCount</mainClass>
- </manifest>
- </archive>
- <descriptorRefs>
- <descriptorRef>jar-with-dependencies</descriptorRef>
- </descriptorRefs>
- </configuration>
- </plugin>
- </plugins>
- </build>
- </project>
-
-
- <!-- spark pom.xml-->
- <?xml version="1.0" encoding="UTF-8"?>
- <project xmlns="http://maven.apache.org/POM/4.0.0"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
- <modelVersion>4.0.0</modelVersion>
-
- <groupId>com.zhangbk</groupId>
- <artifactId>spark</artifactId>
- <packaging>pom</packaging>
- <version>1.0-SNAPSHOT</version>
- <modules>
- <module>wordcount</module>
- </modules>
-
- <properties>
- <mysql.version>6.0.5</mysql.version>
- <spring.version>4.3.6.RELEASE</spring.version>
- <spring.data.jpa.version>1.11.0.RELEASE</spring.data.jpa.version>
- <log4j.version>1.2.17</log4j.version>
- <quartz.version>2.2.3</quartz.version>
- <slf4j.version>1.7.22</slf4j.version>
- <hibernate.version>5.2.6.Final</hibernate.version>
- <camel.version>2.18.2</camel.version>
- <config.version>1.10</config.version>
- <jackson.version>2.8.6</jackson.version>
- <servlet.version>3.0.1</servlet.version>
- <net.sf.json.version>2.4</net.sf.json.version>
- <activemq.version>5.14.3</activemq.version>
- <spark.version>2.1.1</spark.version>
- <scala.version>2.11.8</scala.version>
- <hadoop.version>2.5.0</hadoop.version>
- </properties>
-
- <build>
- <plugins>
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-compiler-plugin</artifactId>
- <version>3.6.0</version>
- <configuration>
- <source>1.8</source>
- <target>1.8</target>
- </configuration>
- </plugin>
- <plugin>
- <groupId>net.alchim31.maven</groupId>
- <artifactId>scala-maven-plugin</artifactId>
- <version>3.2.2</version>
- <executions>
- <execution>
- <goals>
- <goal>compile</goal>
- <goal>testCompile</goal>
- </goals>
- </execution>
- </executions>
- </plugin>
-
- </plugins>
- <pluginManagement>
- <plugins>
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-assembly-plugin</artifactId>
- <version>2.2-beta-5</version>
- <executions>
- <execution>
- <id>make-assembly</id>
- <phase>package</phase>
- <goals>
- <goal>single</goal>
- </goals>
- </execution>
- </executions>
- </plugin>
- </plugins>
- </pluginManagement>
- </build>
- </project>
执行WordCount程序
- bin/spark-submit \
- --class com.zhangbk.spark.WordCount \
- --master spark://hadoop-senior01.zhangbk.com:7077 \
- --executor-memory 1G \
- --total-executor-cores 2 \
- spark-jars/wordcount-jar-with-dependencies.jar \
- hdfs://ns1/spark/input/RELEASE \
- hdfs://ns1/spark/output/out5
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。