赞
踩
Hadoop Distributed File System (HDFS)
1 ) 分布式
2)运行在廉价通用的电脑上
3)高容错
4)高吞吐量(high throughput)
5 ) 适用大的数据集
有一般的文件系统的功能: 有目录结构,存放的是文件或者文件夹,对外提供服务:创建、修改、删除、查看、移动等
普通的文件系统是单机的
分布式文件系统能够横跨多个机器
Hardware Failure解决硬件故障
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
Streaming Data Access 数据流式访问
Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.
Large Data Sets大数据集
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
Simple Coherency Model一致性模型
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model.
“Moving Computation is Cheaper than Moving Data” 移动计算比移动数据更划算
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
Portability Across Heterogeneous Hardware and Software Platforms跨平台
HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
1 NameNode and DataNodes
2 HDFS has a master/slave architecture.
3 NN(NameNode):An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.
4 DN(DataNode):there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
5 HDFS exposes a file system namespace and allows user data to be stored in files
6 Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.为了容错
7 The NameNode executes file system namespace operations like opening, closing, and renaming files and directories.
It also determines the mapping of blocks to DataNodes
8 The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS).
HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software.
Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software.
The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.
HDFS supports a traditional hierarchical file organization.
A user or an application can create directories and store files inside these directories.
The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file.
HDFS supports user quotas and access permissions.
HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.
While HDFS follows naming convention of the FileSystem, some paths and names (e.g. /.reserved and .snapshot ) are reserved. Features such as transparent encryption and snapshot use reserved paths.
The NameNode maintains the file system namespace.
Any change to the file system namespace or its properties is recorded by the NameNode.
An application can specify the number of replicas of a file that should be maintained by HDFS. 指定副本数
The number of copies of a file is called the replication factor
of that file. This information is stored by the NameNode.指定副本数
HDFS is designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance.
The block size and replication factor are configurable per file.
All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync.
An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
mkdir software #软件安装包
mkdir app # 软件安装目录
mkdir data # 存放数据
mkdir lib # 存放作业
mkdir shell # 存放脚本
mkdir maven_resp # 存放maven依赖包
切换rooter 用户:sudo -i
切换hadoop用户:su hadoop
查linux版本 uname -a
使用Hadoop版本:CDH:Apache Hadoop 2.6.0-cdh5.15.1
https://archive.cloudera.com/cdh5/cdh/5/
学习使用单机版(Hadoop Hive Spark)
java1.8
source .bash_profile # 激活
java -version
bash_profile
# .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs PATH=$PATH:$HOME/.local/bin:$HOME/bin export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91 export PATH=$JAVA_HOME/bin:$PATH export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.15.1 export PATH=$HADOOP_HOME/bin:$PATH export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.15.1 export PATH=$HIVE_HOME/bin:$PATH export PATH
ll -a
ll -la
查看你防火墙 : sudo firewall-cmd --state
关闭防火墙:sudo systemctl stop firewalld.service
停止
package com.imooc.bigdata.hadoop.hdfs; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URI; import java.util.HashMap; import java.util.Map; import java.util.Set; /** * 使用HDFS API完成wordcount统计 * * 需求:统计HDFS上的文件的wc,然后将统计结果输出到HDFS * * 功能拆解: * 1)读取HDFS上的文件 ==> HDFS API * 2)业务处理(词频统计):对文件中的每一行数据都要进行业务处理(按照分隔符分割) ==> Mapper * 3)将处理结果缓存起来 ==> Context * 4)将结果输出到HDFS ==> HDFS API * */ public class HDFSWCApp01 { public static void main(String[] args) throws Exception { // 1)读取HDFS上的文件 ==> HDFS API Path input = new Path("/hdfsapi/test/hello.txt"); // 获取要操作的HDFS文件系统 FileSystem fs = FileSystem.get(new URI("hdfs://192.168.199.233:8020"), new Configuration(),"hadoop"); RemoteIterator<LocatedFileStatus> iterator = fs.listFiles(input, false); ImoocMapper mapper = new WordCountMapper(); ImoocContext context = new ImoocContext(); while(iterator.hasNext()) { LocatedFileStatus file = iterator.next(); FSDataInputStream in = fs.open(file.getPath()); BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String line = ""; while ((line = reader.readLine()) != null) { // 2)业务处理(词频统计) (hello,3) // TODO... 在业务逻辑完之后将结果写到Cache中去 mapper.map(line, context); } reader.close(); in.close(); } //TODO... 3 将处理结果缓存起来 Map Map<Object, Object> contextMap = context.getCacheMap(); // 4)将结果输出到HDFS ==> HDFS API Path output = new Path("/hdfsapi/output/"); FSDataOutputStream out = fs.create(new Path(output, new Path("wc.out"))); // TODO... 将第三步缓存中的内容输出到out中去 Set<Map.Entry<Object, Object>> entries = contextMap.entrySet(); for(Map.Entry<Object, Object> entry : entries) { out.write((entry.getKey().toString() + " \t " + entry.getValue() + "\n").getBytes()); } out.close(); fs.close(); System.out.println("PK哥的HDFS API统计词频运行成功...."); } }
package com.imooc.bigdata.hadoop.hdfs; import java.util.HashMap; import java.util.Map; /** * 自定义上下文,其实就是缓存 */ public class ImoocContext { private Map<Object, Object> cacheMap = new HashMap<Object, Object>(); public Map<Object, Object> getCacheMap() { return cacheMap; } /** * 写数据到缓存中去 * @param key 单词 * @param value 次数 */ public void write(Object key, Object value) { cacheMap.put(key, value); } /** * 从缓存中获取值 * @param key 单词 * @return 单词对应的词频 */ public Object get(Object key) { return cacheMap.get(key); } }
package com.imooc.bigdata.hadoop.hdfs;
/**
* 自定义Mapper
*/
public interface ImoocMapper {
/**
*
* @param line 读取到到每一行数据
* @param context 上下文/缓存
*/
public void map(String line, ImoocContext context);
}
package com.imooc.bigdata.hadoop.hdfs; /** * 自定义wc实现类 */ public class WordCountMapper implements ImoocMapper { public void map(String line, ImoocContext context) { String[] words = line.split("\t"); for(String word : words) { Object value = context.get(word); if(value == null) { // 表示没有出现过该单词 context.write(word, 1); } else { int v = Integer.parseInt(value.toString()); context.write(word, v+1); // 取出单词对应的次数+1 } } } }
package com.imooc.bigdata.hadoop.hdfs; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URI; import java.util.HashMap; import java.util.Map; import java.util.Set; /** * 使用HDFS API完成wordcount统计 * * 需求:统计HDFS上的文件的wc,然后将统计结果输出到HDFS * * 功能拆解: * 1)读取HDFS上的文件 ==> HDFS API * 2)业务处理(词频统计):对文件中的每一行数据都要进行业务处理(按照分隔符分割) ==> Mapper * 3)将处理结果缓存起来 ==> Context * 4)将结果输出到HDFS ==> HDFS API * */ public class HDFSWCApp01 { public static void main(String[] args) throws Exception { // 1)读取HDFS上的文件 ==> HDFS API Path input = new Path("/hdfsapi/test/hello.txt"); // 获取要操作的HDFS文件系统 FileSystem fs = FileSystem.get(new URI("hdfs://192.168.199.233:8020"), new Configuration(),"hadoop"); RemoteIterator<LocatedFileStatus> iterator = fs.listFiles(input, false); ImoocMapper mapper = new WordCountMapper(); ImoocContext context = new ImoocContext(); while(iterator.hasNext()) { LocatedFileStatus file = iterator.next(); FSDataInputStream in = fs.open(file.getPath()); BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String line = ""; while ((line = reader.readLine()) != null) { // 2)业务处理(词频统计) (hello,3) // TODO... 在业务逻辑完之后将结果写到Cache中去 mapper.map(line, context); } reader.close(); in.close(); } //TODO... 3 将处理结果缓存起来 Map Map<Object, Object> contextMap = context.getCacheMap(); // 4)将结果输出到HDFS ==> HDFS API Path output = new Path("/hdfsapi/output/"); FSDataOutputStream out = fs.create(new Path(output, new Path("wc.out"))); // TODO... 将第三步缓存中的内容输出到out中去 Set<Map.Entry<Object, Object>> entries = contextMap.entrySet(); for(Map.Entry<Object, Object> entry : entries) { out.write((entry.getKey().toString() + " \t " + entry.getValue() + "\n").getBytes()); } out.close(); fs.close(); System.out.println("PK哥的HDFS API统计词频运行成功...."); } }
INPUT_PATH=/hdfsapi/test/h.txt
OUTPUT_PATH=/hdfsapi/output/
OUTPUT_FILE=wc.out5
HDFS_URI=hdfs://192.168.199.233:8020
MAPPER_CLASS=com.imooc.bigdata.hadoop.hdfs.CaseIgnoreWordCountMapper
package com.imooc.bigdata.hadoop.hdfs; /** * 常量 */ public class Constants { public static final String INPUT_PATH = "INPUT_PATH"; public static final String OUTPUT_PATH = "OUTPUT_PATH"; public static final String OUTPUT_FILE = "OUTPUT_FILE"; public static final String HDFS_URI = "HDFS_URI"; public static final String MAPPER_CLASS = "MAPPER_CLASS"; }
package com.imooc.bigdata.hadoop.hdfs; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URI; import java.util.Map; import java.util.Properties; import java.util.Set; /** * 使用HDFS API完成wordcount统计 * * 需求:统计HDFS上的文件的wc,然后将统计结果输出到HDFS * * 功能拆解: * 1)读取HDFS上的文件 ==> HDFS API * 2)业务处理(词频统计):对文件中的每一行数据都要进行业务处理(按照分隔符分割) ==> Mapper * 3)将处理结果缓存起来 ==> Context * 4)将结果输出到HDFS ==> HDFS API * */ public class HDFSWCApp02 { public static void main(String[] args) throws Exception { // 1)读取HDFS上的文件 ==> HDFS API Properties properties = ParamsUtils.getProperties(); Path input = new Path(properties.getProperty(Constants.INPUT_PATH)); // 获取要操作的HDFS文件系统 FileSystem fs = FileSystem.get(new URI(properties.getProperty(Constants.HDFS_URI)), new Configuration(),"hadoop"); RemoteIterator<LocatedFileStatus> iterator = fs.listFiles(input, false); // TODO... 通过反射创建对象 Class<?> clazz = Class.forName(properties.getProperty(Constants.MAPPER_CLASS)); ImoocMapper mapper = (ImoocMapper)clazz.newInstance(); //ImoocMapper mapper = new WordCountMapper(); ImoocContext context = new ImoocContext(); while(iterator.hasNext()) { LocatedFileStatus file = iterator.next(); FSDataInputStream in = fs.open(file.getPath()); BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String line = ""; while ((line = reader.readLine()) != null) { // 2)业务处理(词频统计) (hello,3) // TODO... 在业务逻辑完之后将结果写到Cache中去 mapper.map(line, context); } reader.close(); in.close(); } //TODO... 3 将处理结果缓存起来 Map Map<Object, Object> contextMap = context.getCacheMap(); // 4)将结果输出到HDFS ==> HDFS API Path output = new Path(properties.getProperty(Constants.OUTPUT_PATH)); FSDataOutputStream out = fs.create(new Path(output, new Path(properties.getProperty(Constants.OUTPUT_FILE)))); // TODO... 将第三步缓存中的内容输出到out中去 Set<Map.Entry<Object, Object>> entries = contextMap.entrySet(); for(Map.Entry<Object, Object> entry : entries) { out.write((entry.getKey().toString() + " \t " + entry.getValue() + "\n").getBytes()); } out.close(); fs.close(); System.out.println("PK哥的HDFS API统计词频运行成功...."); } }
MAPPER_CLASS=com.imooc.bigdata.hadoop.hdfs.CaseIgnoreWordCountMapper
public static final String MAPPER_CLASS = "MAPPER_CLASS";
// TODO... 通过反射创建对象
Class<?> clazz = Class.forName(properties.getProperty(Constants.MAPPER_CLASS));
ImoocMapper mapper = (ImoocMapper)clazz.newInstance();
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。