当前位置:   article > 正文

分布式文件系统HDFS_头歌分布式文件系统hdfs

头歌分布式文件系统hdfs

分 布 式 文 件 系 统 H D F S 分布式文件系统HDFS HDFS

1.HDFS概述

Hadoop Distributed File System (HDFS)

1 ) 分布式

2)运行在廉价通用的电脑上

3)高容错

4)高吞吐量(high throughput)

5 ) 适用大的数据集

有一般的文件系统的功能: 有目录结构,存放的是文件或者文件夹,对外提供服务:创建、修改、删除、查看、移动等

普通的文件系统是单机的

分布式文件系统能够横跨多个机器

2.HDFS设计目标

Hardware Failure解决硬件故障
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Streaming Data Access 数据流式访问
Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.

Large Data Sets大数据集
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

Simple Coherency Model一致性模型
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model.

“Moving Computation is Cheaper than Moving Data” 移动计算比移动数据更划算
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

Portability Across Heterogeneous Hardware and Software Platforms跨平台
HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.

3.HDFS架构详解

1 NameNode and DataNodes

2 HDFS has a master/slave architecture.

3 NN(NameNode):An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.

4 DN(DataNode):there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

5 HDFS exposes a file system namespace and allows user data to be stored in files

6 Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.为了容错

7 The NameNode executes file system namespace operations like opening, closing, and renaming files and directories.
It also determines the mapping of blocks to DataNodes

8 The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

在这里插入图片描述

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS).

HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software.

Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines.

A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software.

The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

4.文件系统NameSpace详解

HDFS supports a traditional hierarchical file organization.

A user or an application can create directories and store files inside these directories.

The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file.

HDFS supports user quotas and access permissions.

HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.

While HDFS follows naming convention of the FileSystem, some paths and names (e.g. /.reserved and .snapshot ) are reserved. Features such as transparent encryption and snapshot use reserved paths.

The NameNode maintains the file system namespace.

Any change to the file system namespace or its properties is recorded by the NameNode.

An application can specify the number of replicas of a file that should be maintained by HDFS. 指定副本数

The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.指定副本数

5.HDFS副本机制

HDFS is designed to reliably store very large files across machines in a large cluster.

It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance.

The block size and replication factor are configurable per file.

All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync.

An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

在这里插入图片描述

6.本课程使用的Linux环境介绍

  • linux:
mkdir software  #软件安装包
mkdir app  # 软件安装目录
mkdir data # 存放数据
mkdir lib  # 存放作业
mkdir shell # 存放脚本
mkdir maven_resp # 存放maven依赖包
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
切换rooter 用户:sudo -i
切换hadoop用户:su hadoop
  • 1
  • 2
查linux版本 uname -a
  • 1

7.Hadoop部署前置介绍

使用Hadoop版本:CDH:Apache Hadoop 2.6.0-cdh5.15.1

https://archive.cloudera.com/cdh5/cdh/5/

学习使用单机版(Hadoop Hive Spark)

java1.8

8.JDK1.8部署详解

source .bash_profile  # 激活
  • 1
 java -version
  • 1

bash_profile

# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
	. ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/.local/bin:$HOME/bin


export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
export PATH=$JAVA_HOME/bin:$PATH

export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.15.1
export PATH=$HADOOP_HOME/bin:$PATH

export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.15.1
export PATH=$HIVE_HOME/bin:$PATH

export PATH

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

9.ssh无密码登陆部署详解

ll -a

ll -la
  • 1
  • 2
  • 3

10.Hadoop安装目录详解及hadoop-env配置

在这里插入图片描述

11.HDFS格式化以及启动详解

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

12.HDFS常见文件之防火墙干扰

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

查看你防火墙 : sudo firewall-cmd --state
关闭防火墙:sudo systemctl stop firewalld.service
  • 1
  • 2

13.Hadoop停止集群以及如何单个进程启动

停止
在这里插入图片描述

在这里插入图片描述

14.Hadoop命令行操作详解

在这里插入图片描述

15.深度剖析Hadoop文件的存储机制

在这里插入图片描述

16.HDFS API编程之开发环境搭建

17. HDFS API编程之第一个应用程序的开发

18.HDFS API编程之jUnit封装

19 HDFS API编程之查看HDFS文件内容

20 HDFS API编程之创建文件并写入内容

21 HDFS API编程之副本系数深度剖析

22 HDFS API编程之重命名

23 HDFS API编程之copyFromLocalFile

24 HDFS API编程之带进度的上传大文件

25 HDFS API编程之下载文件

26 HDFS API编程之列出文件夹下的所有内容

27 HDFS API编程之递归列出文件夹下的所有文件

28 HDFS API编程之查看文件块信息

29 HDFS API编程之删除文件

30 HDFS项目实战之需求分析

31 HDFS项目实战之代码框架编写

package com.imooc.bigdata.hadoop.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

/**
 * 使用HDFS API完成wordcount统计
 *
 * 需求:统计HDFS上的文件的wc,然后将统计结果输出到HDFS
 *
 * 功能拆解:
 * 1)读取HDFS上的文件 ==> HDFS API
 * 2)业务处理(词频统计):对文件中的每一行数据都要进行业务处理(按照分隔符分割) ==> Mapper
 * 3)将处理结果缓存起来   ==> Context
 * 4)将结果输出到HDFS ==> HDFS API
 *
 */
public class HDFSWCApp01 {

    public static void main(String[] args) throws Exception {

        // 1)读取HDFS上的文件 ==> HDFS API
        Path input = new Path("/hdfsapi/test/hello.txt");


        // 获取要操作的HDFS文件系统
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.199.233:8020"), new Configuration(),"hadoop");

        RemoteIterator<LocatedFileStatus> iterator = fs.listFiles(input, false);

        ImoocMapper mapper = new WordCountMapper();
        ImoocContext context = new ImoocContext();

        while(iterator.hasNext()) {
            LocatedFileStatus file = iterator.next();
            FSDataInputStream in = fs.open(file.getPath());
            BufferedReader reader = new BufferedReader(new InputStreamReader(in));

            String line = "";
            while ((line = reader.readLine()) != null) {

                // 2)业务处理(词频统计)   (hello,3)
                // TODO... 在业务逻辑完之后将结果写到Cache中去
                mapper.map(line, context);
            }

            reader.close();
            in.close();
        }


        //TODO... 3 将处理结果缓存起来  Map

        Map<Object, Object> contextMap = context.getCacheMap();

        // 4)将结果输出到HDFS ==> HDFS API
        Path output = new Path("/hdfsapi/output/");

        FSDataOutputStream out = fs.create(new Path(output, new Path("wc.out")));

        // TODO... 将第三步缓存中的内容输出到out中去
        Set<Map.Entry<Object, Object>> entries = contextMap.entrySet();
        for(Map.Entry<Object, Object> entry : entries) {
            out.write((entry.getKey().toString() + " \t " + entry.getValue() + "\n").getBytes());
        }

        out.close();
        fs.close();

        System.out.println("PK哥的HDFS API统计词频运行成功....");

    }

}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82

32 HDFS项目实战之自定义上下文

package com.imooc.bigdata.hadoop.hdfs;

import java.util.HashMap;
import java.util.Map;

/**
 * 自定义上下文,其实就是缓存
 */
public class ImoocContext {

    private Map<Object, Object> cacheMap = new HashMap<Object, Object>();

    public Map<Object, Object> getCacheMap() {
        return cacheMap;
    }

    /**
     * 写数据到缓存中去
     * @param key 单词
     * @param value 次数
     */
    public void write(Object key, Object value) {
        cacheMap.put(key, value);
    }

    /**
     * 从缓存中获取值
     * @param key 单词
     * @return  单词对应的词频
     */
    public Object get(Object key) {
        return cacheMap.get(key);
    }


}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37

33 HDFS项目实战之自定义处理类实现

package com.imooc.bigdata.hadoop.hdfs;

/**
 * 自定义Mapper
 */
public interface ImoocMapper {

    /**
     *
     * @param line  读取到到每一行数据
     * @param context  上下文/缓存
     */
    public void map(String line, ImoocContext context);
}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
package com.imooc.bigdata.hadoop.hdfs;

/**
 * 自定义wc实现类
 */
public class WordCountMapper implements ImoocMapper {

    public void map(String line, ImoocContext context) {
        String[] words = line.split("\t");

        for(String word : words) {
            Object value = context.get(word);
            if(value == null) { // 表示没有出现过该单词
                context.write(word, 1);
            } else {
                int v = Integer.parseInt(value.toString());
                context.write(word, v+1);  // 取出单词对应的次数+1
            }
        }

    }
}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

34 HDFS项目实战之功能实现

package com.imooc.bigdata.hadoop.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

/**
 * 使用HDFS API完成wordcount统计
 *
 * 需求:统计HDFS上的文件的wc,然后将统计结果输出到HDFS
 *
 * 功能拆解:
 * 1)读取HDFS上的文件 ==> HDFS API
 * 2)业务处理(词频统计):对文件中的每一行数据都要进行业务处理(按照分隔符分割) ==> Mapper
 * 3)将处理结果缓存起来   ==> Context
 * 4)将结果输出到HDFS ==> HDFS API
 *
 */
public class HDFSWCApp01 {

    public static void main(String[] args) throws Exception {

        // 1)读取HDFS上的文件 ==> HDFS API
        Path input = new Path("/hdfsapi/test/hello.txt");


        // 获取要操作的HDFS文件系统
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.199.233:8020"), new Configuration(),"hadoop");

        RemoteIterator<LocatedFileStatus> iterator = fs.listFiles(input, false);

        ImoocMapper mapper = new WordCountMapper();
        ImoocContext context = new ImoocContext();

        while(iterator.hasNext()) {
            LocatedFileStatus file = iterator.next();
            FSDataInputStream in = fs.open(file.getPath());
            BufferedReader reader = new BufferedReader(new InputStreamReader(in));

            String line = "";
            while ((line = reader.readLine()) != null) {

                // 2)业务处理(词频统计)   (hello,3)
                // TODO... 在业务逻辑完之后将结果写到Cache中去
                mapper.map(line, context);
            }

            reader.close();
            in.close();
        }


        //TODO... 3 将处理结果缓存起来  Map

        Map<Object, Object> contextMap = context.getCacheMap();

        // 4)将结果输出到HDFS ==> HDFS API
        Path output = new Path("/hdfsapi/output/");

        FSDataOutputStream out = fs.create(new Path(output, new Path("wc.out")));

        // TODO... 将第三步缓存中的内容输出到out中去
        Set<Map.Entry<Object, Object>> entries = contextMap.entrySet();
        for(Map.Entry<Object, Object> entry : entries) {
            out.write((entry.getKey().toString() + " \t " + entry.getValue() + "\n").getBytes());
        }

        out.close();
        fs.close();

        System.out.println("PK哥的HDFS API统计词频运行成功....");

    }

}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82

在这里插入图片描述

35 HDFS项目实战之使用自定义配置文件重构代码

INPUT_PATH=/hdfsapi/test/h.txt
OUTPUT_PATH=/hdfsapi/output/
OUTPUT_FILE=wc.out5
HDFS_URI=hdfs://192.168.199.233:8020
MAPPER_CLASS=com.imooc.bigdata.hadoop.hdfs.CaseIgnoreWordCountMapper
  • 1
  • 2
  • 3
  • 4
  • 5
package com.imooc.bigdata.hadoop.hdfs;

/**
 * 常量
 */
public class Constants {

    public static final String INPUT_PATH = "INPUT_PATH";

    public static final String OUTPUT_PATH = "OUTPUT_PATH";

    public static final String OUTPUT_FILE = "OUTPUT_FILE";

    public static final String HDFS_URI = "HDFS_URI";

    public static final String MAPPER_CLASS = "MAPPER_CLASS";

}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
package com.imooc.bigdata.hadoop.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.Map;
import java.util.Properties;
import java.util.Set;

/**
 * 使用HDFS API完成wordcount统计
 *
 * 需求:统计HDFS上的文件的wc,然后将统计结果输出到HDFS
 *
 * 功能拆解:
 * 1)读取HDFS上的文件 ==> HDFS API
 * 2)业务处理(词频统计):对文件中的每一行数据都要进行业务处理(按照分隔符分割) ==> Mapper
 * 3)将处理结果缓存起来   ==> Context
 * 4)将结果输出到HDFS ==> HDFS API
 *
 */
public class HDFSWCApp02 {

    public static void main(String[] args) throws Exception {

        // 1)读取HDFS上的文件 ==> HDFS API


        Properties properties = ParamsUtils.getProperties();
        Path input = new Path(properties.getProperty(Constants.INPUT_PATH));


        // 获取要操作的HDFS文件系统
        FileSystem fs = FileSystem.get(new URI(properties.getProperty(Constants.HDFS_URI)), new Configuration(),"hadoop");

        RemoteIterator<LocatedFileStatus> iterator = fs.listFiles(input, false);

        // TODO... 通过反射创建对象
        Class<?> clazz = Class.forName(properties.getProperty(Constants.MAPPER_CLASS));
        ImoocMapper mapper = (ImoocMapper)clazz.newInstance();


        //ImoocMapper mapper = new WordCountMapper();
        ImoocContext context = new ImoocContext();

        while(iterator.hasNext()) {
            LocatedFileStatus file = iterator.next();
            FSDataInputStream in = fs.open(file.getPath());
            BufferedReader reader = new BufferedReader(new InputStreamReader(in));

            String line = "";
            while ((line = reader.readLine()) != null) {

                // 2)业务处理(词频统计)   (hello,3)
                // TODO... 在业务逻辑完之后将结果写到Cache中去
                mapper.map(line, context);
            }

            reader.close();
            in.close();
        }


        //TODO... 3 将处理结果缓存起来  Map

        Map<Object, Object> contextMap = context.getCacheMap();

        // 4)将结果输出到HDFS ==> HDFS API
        Path output = new Path(properties.getProperty(Constants.OUTPUT_PATH));

        FSDataOutputStream out = fs.create(new Path(output, new Path(properties.getProperty(Constants.OUTPUT_FILE))));

        // TODO... 将第三步缓存中的内容输出到out中去
        Set<Map.Entry<Object, Object>> entries = contextMap.entrySet();
        for(Map.Entry<Object, Object> entry : entries) {
            out.write((entry.getKey().toString() + " \t " + entry.getValue() + "\n").getBytes());
        }

        out.close();
        fs.close();

        System.out.println("PK哥的HDFS API统计词频运行成功....");

    }

}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90

36 HDFS项目实战之使用反射创建自定义Mapper对象

MAPPER_CLASS=com.imooc.bigdata.hadoop.hdfs.CaseIgnoreWordCountMapper
  • 1
    public static final String MAPPER_CLASS = "MAPPER_CLASS";
  • 1
        // TODO... 通过反射创建对象
        Class<?> clazz = Class.forName(properties.getProperty(Constants.MAPPER_CLASS));
        ImoocMapper mapper = (ImoocMapper)clazz.newInstance();
  • 1
  • 2
  • 3

37 HDFS项目实战之可插拔的业务逻辑处理

38 HDFS Replica Placement Policy

39 HDFS写数据流程图解

40 HDFS读数据流程图解

41 HDFS Checkpoint详解

42 HDFS SaveMode

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/634287
推荐阅读
相关标签
  

闽ICP备14008679号