赞
踩
os: ubuntu 18.04
jdk: 1.8
hadoop: 3.0.3
参考自Ubuntu18.04安装hadoop
在Linux中安装SSH免登录认证,用于避免使用Hadoop时的权限问题。
SSH的安装命令:
sudo apt-get install ssh
ssh-keygen -t rsa
密码输入空
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
ssh localhost
tar -zxf /home/yusong/Desktop/hadoop-3.0.3.tar.gz -C /home/yusong/software # 解压到指定目录中
cd /home/yusong/software # 进入当指定目录
mv ./hadoop-3.0.3/ ./hadoop # 将文件夹名改为hadoop
chown -R huyn ./hadoop # 修改文件权限
验证是否安装成功
cd /usr/local/hadoop
./bin/hadoop version
安装成功会显示版本:
Hadoop 3.0.3
Source code repository https://yjzhangal@git-wip-us.apache.org/repos/asf/hadoop.git -r 37fd7d752db73d984dc31e0cdfd590d252f5e075
Compiled by yzhang on 2018-05-31T17:12Z
Compiled with protoc 2.5.0
From source with checksum 736cdcefa911261ad56d2d120bf1fa
This command was run using /home/yusong/software/hadoop/share/hadoop/common/hadoop-common-3.0.3.jar
参考官方教材:Hadoop: Setting up a Single Node Cluster.
博客:hadoop3的简单安装方法(单节点)
默认情况下,Hadoop被配置为以非分布式模式作为单个Java进程运行。这对于调试很有用。
下面的示例复制解压缩的conf目录以用作输入,然后查找并显示给定正则表达式的每个匹配项。输出被写入给定的输出目录。
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep input output 'dfs[a-z.]+'
$ cat output/*
输出结果为:
1 dfsadmin
在/etc/hadoop/core-site.xml
文件末尾添加如下内容:
vim ./etc/hadoop/core-site.xml
添加内容:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
在/etc/hadoop/hdfs-site.xml
文件末尾添加如下内容:
vim ./etc/hadoop/hdfs-site.xml
添加内容:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
打开文件添加java路径:
vim ./etc/hadoop/hadoop-env.sh
添加内容:
export JAVA_HOME=/home/yusong/software/jvm/jdk1.8.0_271
以下说明是在本地运行MapReduce作业。如果要在YARN上执行作业,请参阅YARN在单节点上。
bin/hdfs namenode -format
sudo sbin/start-dfs.sh
Starting namenodes on [localhost]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [yusong-desktop]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation
需要执行如下操作
vim sbin/start-dfs.sh
添加如下内容:
HDFS_DATANODE_USER=yusong
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=yusong
HDFS_SECONDARYNAMENODE_USER=yusong
vim sbin/stop-dfs.sh
添加如下内容:
HDFS_DATANODE_USER=yusong
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=yusong
HDFS_SECONDARYNAMENODE_USER=yusong
然后在重新执行上面的操作:sudo sbin/start-dfs.sh
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/yusong // 换成自己的用户名
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep input /user/yusong/output 'dfs[a-z.]+'
bin/hdfs dfs -get /user/yusong/output output
cat output/*
bin/hdfs dfs -cat /user/yusong/output/*
sbin/stop-dfs.sh
可以在网页上看到上面操作产生的数据:
可以通过设置一些参数并另外运行ResourceManager守护程序和NodeManager守护程序,以伪分布式模式在YARN上运行MapReduce作业。
以下指令假定上述指令的步骤已经执行。
配置参数如下:
vim etc/hadoop/mapred-site.xml
添加如下内容:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
配置参数如下:
vim etc/hadoop/yarn-site.xml
添加如下内容:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
启动ResourceManager守护程序和NodeManager守护程序:
sbin/start-yarn.sh
可能会报如下错误:
Starting resourcemanagers on []
ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.
可以看出是因为没有配置YARN_RESOURCEMANAGER_USER和YARN_NODEMANAGER_USER导致的。需要到 sbin 目录下 更改 start-yarn.sh 和 stop-yarn.sh 信息,在两个配置文件的第一行添加user:
依次打开文件:
vim sbin/start-yarn.sh
vim sbin/stop-yarn.sh
添加内容:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
重新执行:sbin/start-yarn.sh
浏览Web界面以找到ResourceManager;默认情况下,它在以下位置可用:
ResourceManager - http://localhost:8088/
运行MapReduce作业。
运行依然要保证dfs在运行,如果没有,需要开启:
sbin/start-dfs.sh
例如执行:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar wordcount input_2 /user/yusong/output_2
出现如下错误:
2020-11-28 22:10:24,373 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
java.net.ConnectException: Call From yusong-desktop/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Metho
查看:cat /etc/hosts
结果如下:
127.0.0.1 localhost
127.0.1.1 yusong-desktop
The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
参考:https://blog.csdn.net/Julyaaaa/article/details/94865033
执行:
修改/etc/hosts文件,将::1开头的那一行注释掉(代码前面加上#即可);
注意:修改该文件需要root权限,开启方法:
sudo vim /etc/hosts
#按i在原地插入,插入完之后按Esc,输入:wq,回车,完成修改。
修改core-site.xml
文件,将localhost
改为主机名。
主机名查询方法:hostname
初始化名称节点:bin/hadoop namenode -format
重启hadoop
:./sbin/start-all.sh
关闭hadoop
:./sbin/stop-all.sh
关闭防火墙:service iptables stop
如果上面的命令报错可以用下面的:
使用sudo ufw status
命令查看当前防火墙状态。
使用sudo ufw enable
命令来开启防火墙。
使用sudo ufw disable
命令来关闭防火墙。
查询9000端口是否开启:lsof -i:9000
,若未开启,检查core-site.xml
配置是否正确。
若开启,查询9000端口是否连接:netstat -tlpn
(我到这一步就成功了,主机ip成功连接上9000端口)
完成后,使用以下命令停止守护进程:
sbin/stop-yarn.sh
sbin/start-dfs.sh
利用Hadoop自带的share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar来实现。
参考:hadoop3.1.3wordcount实例讲解
随便在某处建立一个文件例如:
vim /home/yusong/test.txt
内容如下:
hello
world
qit
stop
sdf
apple
EA
steam
orange
world
hello
bin/hdfs dfs -mkdir input_2
bin/hdfs dfs -put /home/yusong/test.txt /user/yusong/input_2
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar wordcount input_2 /user/yusong/output_2
bin/hdfs dfs -cat /user/yusong/output_2/*
结果如下:
EA 1
apple 1
hello 2
orange 1
qit 1
sdf 1
steam 1
stop 1
world 2
不是用官方提供的jar来实现wordcount需要自己先编写代码,然后打包为jar包,使用自己生成的jar包来运行wordcount。
这个思路很简单,但是网上教材差异很大,我按照官方教材也运行失败了。
下面记录下经过多次摸索得到的过程。在本文第一节jdk和idea中已经介绍了idea的安装,下面将基于idea来介绍。
首先需要在idea中建立一个maven项目,File->new->Project->选择Maven(Project SDK一定要和你hadoop配置的jdk版本一致!)->next->HadoopDemo
代码是从官方教材上摘的代码来源。
在项目结果src/main/java下新建一个class文件:WordCount.java,代码解读,内容如下:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
打包前需要先配置依赖,在项目主目录下有个pom.xml文件,在里面添加如下内容,注意我在里面配置的jdk为1.8。
<dependencies>
<!-- hadoop-client-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>3.0.3</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.0.3</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
添加并保存后,点击右侧Maven->HadoopDamo->Lifecycle->package
,有疑问可以参考Hadoop实例学习(五)打jar包)双击package之后,idea自动为我们打jar包,待完成后,可以在项目主目录下的target
路径下看到新生成的jar包: HadoopDamo-1.0-SNAPSHOT.jar
,如下图所示:
我将 HadoopDamo-1.0-SNAPSHOT.jar
改名为wc2.jar
,然后放到hadoop下的新建的myjar
文件夹下,操作如下:
mkdir myjar
cp /home/yusong/IdeaProjects/HadoopDamo/target/wc2.jar myjar/wc2.jar
bin/hadoop jar myjar/wc2.jar WordCount input_2 /user/yusong/output_2
以上操作建立在已经开启了hadoop的基层之上。
运行结果:
查看运行结果:bin/hdfs dfs -cat /user/yusong/output_2/*
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。