Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular disk filesystems.
管理网络中跨多台计算机存储的文件系统被称之为分布式文件系统。
因为是以网络为基础,也就引入了网络编程的复杂性,因此使得分布式文件系统比普通的磁盘文件系统更加复杂。
1) The Design of HDFS
a) HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
HDFS被设计成以流数据訪问模式来进行存储超大型文件。执行在商业硬件集群上。
b) HDFS is built around the idea that the most efficient data processing pattern is a writeonce,read-many-times pattern. A dataset is typically generated or copied from source,and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
HDFS建立在一次写入,多次读取这样一个最高效的数据处理模式的理念之上。数据集通常有数据源生成或者从数据源复制而来,接着在此数据集上进行长时间的数据分析操作。每一次分析都会涉及到一大部分数据,甚至整个数据集,因此读取整个数据集的时间比读取第一条记录的延迟更重要。
c) Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors)for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
hadoop不须要昂贵的、高可靠性的硬件,hadoop执行在商业硬件集群上(普通硬件能够从各种供应商来获得),因此整个集群节点发生问题的机会是非常高的。至少是对于大集群而言。
2) HDFS Concepts
a) A disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.
每个磁盘都有一个默认的数据块大小,这是磁盘能够读写的最小数据量。单个磁盘文件管理系统构建于处理磁盘块数据之上,它的大小是磁盘块大小的整数倍。磁盘文件系统的块大小一般是几KB,然而磁盘块大小一般是512字节。
b) Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
跟单个磁盘文件系统不同的是,HDFS中比磁盘块小的文件不会沾满整个块的潜在存储空间。
c) HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.
HDFS的块比磁盘块大,这样会最小化搜索代价。假设块足够大,那么从磁盘数据传输的时间就会比搜寻块開始位置的时间长得多,因此,传输由多个块组成的大文件取决于磁盘传输速率
d) Having a block abstraction for a distributed filesystem brings several benefits. The first benefit is the most obvious: a file can be larger than any single disk in the network. Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. Furthermore, blocks fit well with replication for providing fault tolerance and availability.
对分布式文件系统的块进行抽象能够带来几个优点 。
首先一个显而易见的是。一个文件的大小能够大于网络中不论什么一个磁盘的容量。
其次,用抽象块而不是整个文件能够使得存储子系统得到简化。最后,抽象块非常适合于备份,这样能够提高容错性和有用性。
e) An HDFS cluster has two types of nodes operating in a master−worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace.
在主机-从机模式下工作的HDFS集群有两种类型的节点能够操作:一个namenode(主机上)和若干个datanode(从机上)。namenode管理整个文件系统的命名空间。
f) Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
datanode是文件系统的直接工作点。它们存储和检索数据块(受client或者namenode通知),而且周期性的向namenode报告它们所存储的块列表信息。
g) For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata.
基于这个原因。确保namenode对故障的弹性机制非常重要,为此,hadoop提供了两种机制。第一种机制是备份那些由文件系统元数据持久状态组成的文件。
h) It is also possible to run a secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
还有一种可能的方法是执行一个辅助的namenode,虽然它不会被用作namenode。
它的主要角色是通过可编辑的日志周期性的融合命名空间镜像,以防止可编辑日志过大。
i) However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.
然后,辅助namenode中的状态总是滞后于主节点,因此,在主节点的整个故障事件中。数据丢失差点儿是肯定的。在这样的情况下。通常的做法是把存储在NFS上的元数据文件复制到辅助namenode中,而且作为一个新的主节点namenode来执行。
j) Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
通常一个节点会从磁盘中读取块数据,可是对已频繁訪问的文件,其块数据可能会缓存在节点的内存中。一个非堆形式的块缓存。
k) HDFS federation,introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.
HDFS中的federation,是在2.X系列公布中引进的,它同意一个集群通过添加namenode节点来扩展。每个namenode节点管理文件系统命名空间中的一部分。
l) Hadoop 2 remedied this situation by adding support for HDFS high availability (HA). In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption. A few architectural changes are needed to allow this to happen:
hadoop 2 通过添加对HA的支持纠正了这样的情况。在这样的实现方式中,将会有2个namenode实现双机热备份。
在发生主活动节点故障的时候,备份主节点就能够在不发生明显的中断的情况下接管继续响应client请求的责任。下面这些结构性的变化是同意发生的:
m) The namenodes must use highly available shared storage to share the edit log. Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk. Clients must be configured to handle namenode failover, using a mechanism that is transparent to users. The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace.
namenode必须使用高有用性的共享存储来实现可编辑日志的共享,因为块映射信息是存储在namenode的内存中,而不是磁盘上,所以datanode必须发送块信息报告至双机热备份的namenode,client必须进行故障切换的操作配置。这个能够通过一个对用户透明的机制来实现。辅助节点的角色通过备份被包括进来,其含有活动主节点命名空间周期性检查点信息。
n) There are two choices for the highly available shared storage: an NFS filer, or a quorum journal manager (QJM). The QJM is a dedicated HDFS implementation, designed for the sole purpose of providing a highly available edit log, and is the recommended choice for most HDFS installations.
对于高有用性共享存储有两种选择:NFS文件。QJM(quorum journal manager)。QJM专注于HDFS的实现。其唯一目的就是提供一个高有用性的可编辑日志。也是大多是HDFS安装时所推荐的。
o) If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries and an up-to-date block mapping.
假设活动namenode发生问题 。备份节点会迅速接管任务(在数秒内)。因为在内存中备份节点有最新的可用状态,包括最新的可编辑日志记录和块映射信息。
p) The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. There are various failover controllers, but the default implementation uses ZooKeeper to ensure that only one namenode is active.Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures and trigger a failover should a namenode fail.
从活动主节点到备份节点的故障切换是由系统中一个新的实体——故障切换控制器来管理的。虽然有多种版本号的故障切换控制器。可是hadoop默认的是ZooKeeper,它也可确保唯独一个namenode是处于活动状态。每个namenode节点上都执行一个轻量级的故障切换控制器进程。它的任务就是去监控namenode的故障。一旦namenode发生问题,它就会触发故障切换。
q) The HA implementation goes to great lengths to ensure that the previously active namenode is prevented from doing any damage and causing corruption — a method known as fencing.
HA的实现会竭尽全力的去确保之前的活动主节点不会做出不论什么导致故障的有害举动。这种方法就是fencing。
r) The QJM only allows one namenode to write to the edit log at one time; however, it is still possible for the previously active namenode to serve stale read requests to clients, so setting up an SSH fencing command that will kill the namenode’s process is a good idea.
QJM只同意一个namenode在同一时刻进行可编辑日志的写入操作。然而,对已之前的活动节点来说,响应来自client的陈旧的读取请求服务是可能的。因此建立一个能够杀死namenode进程的fencing命令式一个好方法。
3) The Command-Line Interface
a) You can type hadoop fs -help to get detailed help on every command.
你能够在每个命令上使用hadoop fs –help来获得具体的帮助。
b) Let’s copy the file back to the local filesystem and check whether it’s the same:
- % hadoop fs -copyToLocal quangle.txt quangle.copy.txt
- % md5 input/docs/quangle.txt quangle.copy.txt
- MD5 (input/docs/quangle.txt) = e7891a2627cf263a079fb0f18256ffb2
- MD5 (quangle.copy.txt) = e7891a2627cf263a079fb0f18256ffb2
- 1
- 2
- 3
- 4
The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.
让我们来拷贝这个文件到本地,检查它们是否是一样的文件。
MD5是一样的,表明这个文件传输到了HDFS,而且原封不动的回来了。
4) Hadoop Filesystems
a) Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are several concrete implementations.
hadoop对于文件系统有一个抽象的概念,HDFS仅是当中一个实现。Java的抽象类org.apache.hadoop.fs.FileSystem定义了client到文件系统之间的接口,而且该抽象类还有几个具体的实现。
b) Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API.
hadoop是用Java编写的,因此大多数hadoop文件系统的交互通过Java API来进行的。
c) By exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS. Note that the HTTP interface is slower than the native Java client, so should be avoided for very large data transfers if possible.
把文件系统的接口作为一个Java API。会让非Java应用程序訪问HDFS时有些麻烦。通过WebHDFS协议实现的HTTP REST API能够非常easy的让其它语言与HDFS进行交互。注意,HTTP接口比本地Javaclient要慢,因此,有可能的话,应该避免使用该接口进行大文件数据传输。