赞
踩
块设备的IO操作是非常慢的,它远远赶不上内存和CPU的速度。为了减少访问块设备的次数,Linux的文件系统提供了内存高速缓冲区。当需要读取块设备的数据时,首先在缓冲区中查找,有则马上返回,没有则读到缓冲区中,然后再复制到用户数据缓冲区。这样缓冲块并没有马上消失,还会在内存中待一段时间,如果重复读取一个文件,下次就可以直接从缓冲块中读取。当缓冲块不够用时,请求的进程就必须进入休眠,等待显式唤醒。缓冲区有自己的管理机制,很久没有使用的块可以给其他进程使用,如果是脏块则要进行写盘。缓冲在某些情况下才会有写盘操作,所以要拔出一个设备时,应该先进行卸载,这样才会写盘,否则数据可能丢失,文件系统可能损坏。
本文通过Linux 0.11的源码自顶向下来探究块设备文件的写机制,字符设备文件的使用可以参考Linux 0.11字符设备的使用。
在sys_read或者sys_write等系统调用中,分别依靠inode节点的mode属性来识别具体的文件类型,然后调用具体的设备读写函数。对于块设备,在Linux 0.11中有三种:虚拟内存盘,硬盘,软盘,其写函数位于fs/block_dev.c(p293, 第14行)。这里主要讨论写操作,而不是读操作,主要是因为写操作要先读取数据到缓冲区,然后对缓冲区进行写,最后缓冲区会在某些时候进行写盘,这个过程比较全面复杂,一旦理解了,看懂读函数的源码是没什么问题的。
int block_write(int dev, long * pos, char * buf, int count)
{
int block = *pos >> BLOCK_SIZE_BITS;
int offset = *pos & (BLOCK_SIZE-1);
int chars;
int written = 0;
struct buffer_head * bh;
register char * p;
while (count>0) {
chars = BLOCK_SIZE - offset;
if (chars > count)
chars=count;
if (chars == BLOCK_SIZE)
bh = getblk(dev,block);
else
bh = breada(dev,block,block+1,block+2,-1);
block++;
if (!bh)
return written?written:-EIO;
p = offset + bh->b_data;
offset = 0;
*pos += chars;
written += chars;
count -= chars;
while (chars-->0)
*(p++) = get_fs_byte(buf++);
bh->b_dirt = 1;
brelse(bh);
}
return written;
}
首先,BLOCK_SIZE_BITS,BLOCK_SIZE位于include/linux/fs.h(p394,第49行):
#define BLOCK_SIZE 1024
#define BLOCK_SIZE_BITS 10
这里将整个设备看成一个大文件,而pos为这个文件的偏移,它不管是否启动块或者超级块,从整个设备的第一个块开始。但对于块设备而言,其操作的基本单位是块,这里定义了一个块的大小为1024KB,即两个扇区。把pos映射为具体的块block的偏移offset,然后读取磁盘数据到缓冲块,并将用户的数据复制到缓冲块中,类似覆盖,最后释放缓冲块(count属性减一)。这里使用了提前读的思想,提前读两块(breada),这样下次就可以直接从缓冲区获取(getblk)。
getblk这个函数位于fs/buffer.c(p247,第206):
/*
Ok, this is getblk, and it isn't very clear, again to *hinder race-conditions. Most of the code is seldom used, *(ie repeating), so it should be much more efficient than *it looks. The algoritm is changed: hopefully better, and *an elusive bug removed.
*/
#define BADNESS(bh) (((bh)->b_dirt<<1)+(bh)->b_lock)
struct buffer_head * getblk(int dev,int block)
{
struct buffer_head * tmp, * bh;
repeat:
if ((bh = get_hash_table(dev,block)))
return bh;
tmp = free_list;
do {
if (tmp->b_count)
continue;
if (!bh || BADNESS(tmp)<BADNESS(bh)) {
bh = tmp;
if (!BADNESS(tmp))
break;
}
/* and repeat until we find something good */
} while ((tmp = tmp->b_next_free) != free_list);
if (!bh) {
sleep_on(&buffer_wait);
goto repeat;
}
wait_on_buffer(bh);
if (bh->b_count)
goto repeat;
while (bh->b_dirt) {
sync_dev(bh->b_dev);
wait_on_buffer(bh);
if (bh->b_count)
goto repeat;
}
/* NOTE!! While we slept waiting for this block, somebody else might */
/* already have added "this" block to the cache. check it */
if (find_buffer(dev,block))
goto repeat;
/* OK, FINALLY we know that this buffer is the only one of it's kind, */
/* and that it's unused (b_count=0), unlocked (b_lock=0), and clean */
bh->b_count=1;
bh->b_dirt=0;
bh->b_uptodate=0;
remove_from_queues(bh);
bh->b_dev=dev;
bh->b_blocknr=block;
insert_into_queues(bh);
return bh;
}
这个函数首先通过get_hash_table检查缓冲块是否已经存在,如果存在则直接返回。否则遍历free_list,当free_list所有的缓冲块都被使用时(count > 0),则进入休眠,添加到buffer_wait链表中,待会重新开始。否则等待解锁,这里考虑了竞争条件,有多次重复的判断,每次休眠醒来后重新判断满足的条件。当数据块没有使用,但是脏标志置位时,将该块对应的设备的所有inode和block进行写盘(发起写盘请求),这里wait_on_buffer会引起休眠,因为写盘时会上锁。写盘结束后还要判断是否已经在哈希队列中,在的话重新开始,否则就是得到了一个干净的缓冲块。将缓冲块从旧的队列移出,添加到新的队列中,即哈希表的头,空闲表的尾,这样能够迅速找到该存在的块,而该缓冲块存在的时间最长。
哈希函数使用的是设备号和逻辑块号的异或,队列是一个双向链表。而空闲链表是一个双向的环形链表。最终返回的块数据可能已经存在,或者返回一个未使用但没有数据的数据块。需要判断是否uptodate,不是uptodte,则要调用ll_rw_block。
哈希函数的定义如下,哈希表头数组共有307项buffer_head,拉链法。
#define _hashfn(dev,block) (((unsigned)(dev^block))%NR_HASH)
#define hash(dev,block) hash_table[_hashfn(dev,block)]
find_buffer先利用哈希函数找到对应的哈希队列,然后遍历哈希队列,比对(dev, block),查看缓冲块是否存在。注意这个函数加了static,不会被外部文件使用。
static struct buffer_head * find_buffer(int dev, int block)
{
struct buffer_head * tmp;
for (tmp = hash(dev,block) ; tmp != NULL ; tmp = tmp->b_next)
if (tmp->b_dev==dev && tmp->b_blocknr==block)
return tmp;
return NULL;
}
get_hash_table对find_buffer进行了封装,考虑了竞争条件,先对引用计数加一。当缓冲块在队列中时,如果加了锁,则要休眠等待。结束后还要查看该块的对应设备号和块号是否已经被修改,如果没有则返回。因为如果bh被上锁,则解锁时将唤醒所有等待该bh的进程链表,这时会造成多个进程并发操作该bh,可能会修改bh。这里允许多个进程共享一个数据块,只要该数据块不被上锁。
/*
* Why like this, I hear you say... The reason is race-conditions.
* As we don't lock buffers (unless we are readint them, that is), something might happen to it while we sleep (ie a read-error will force it bad). This shouldn't really happen currently, but the code is ready.
*/
struct buffer_head * get_hash_table(int dev, int block)
{
struct buffer_head * bh;
for (;;) {
if (!(bh=find_buffer(dev,block)))
return NULL;
bh->b_count++;
wait_on_buffer(bh);
if (bh->b_dev == dev && bh->b_blocknr == block)
return bh;
bh->b_count--;
}
}
wait_on_buffer用于当数据在读取时,或者写脏块等待使用缓冲块时,使得当前进程进入休眠的状态,注意当多个进程请求同一个被锁住的缓冲块时会形成休眠链表。这里有点像互斥锁,锁的粒度为一个缓冲区,当缓冲块在某个设备的请求链表时会上锁。
static inline void wait_on_buffer(struct buffer_head * bh)
{
cli();
while (bh->b_lock)
sleep_on(&bh->b_wait);
sti();
}
sync_dev将指定设备的所有脏块和inode进行写盘,产生写请求。
int sync_dev(int dev)
{
int i;
struct buffer_head * bh;
bh = start_buffer;
for (i=0 ; i<NR_BUFFERS ; i++,bh++) {
if (bh->b_dev != dev)
continue;
wait_on_buffer(bh);
if (bh->b_dev == dev && bh->b_dirt)
ll_rw_block(WRITE,bh);
}
sync_inodes();
bh = start_buffer;
for (i=0 ; i<NR_BUFFERS ; i++,bh++) {
if (bh->b_dev != dev)
continue;
wait_on_buffer(bh);
if (bh->b_dev == dev && bh->b_dirt)
ll_rw_block(WRITE,bh);
}
return 0;
}
另外,sync_inodes位于fs/inode.c(p258, 第59行),将所有内存的脏的inode写到对应的缓冲块中。
void sync_inodes(void)
{
int i;
struct m_inode * inode;
inode = 0+inode_table;
for(i=0 ; i<NR_INODE ; i++,inode++) {
wait_on_inode(inode);
if (inode->i_dirt && !inode->i_pipe)
write_inode(inode);
}
remove_from_queues将bh从哈希队列和空闲链表删除。
static inline void remove_from_queues(struct buffer_head * bh)
{
/* remove from hash-queue */
if (bh->b_next)
bh->b_next->b_prev = bh->b_prev;
if (bh->b_prev)
bh->b_prev->b_next = bh->b_next;
if (hash(bh->b_dev,bh->b_blocknr) == bh)
hash(bh->b_dev,bh->b_blocknr) = bh->b_next;
/* remove from free list */
if (!(bh->b_prev_free) || !(bh->b_next_free))
panic("Free block list corrupted");
bh->b_prev_free->b_next_free = bh->b_next_free;
bh->b_next_free->b_prev_free = bh->b_prev_free;
if (free_list == bh)
free_list = bh->b_next_free;
}
insert_into_queues将bh插入到空闲链表的尾部,哈希队列的头部。
static inline void insert_into_queues(struct buffer_head * bh)
{
/* put at end of free list */
bh->b_next_free = free_list;
bh->b_prev_free = free_list->b_prev_free;
free_list->b_prev_free->b_next_free = bh;
free_list->b_prev_free = bh;
/* put the buffer in new hash-queue if it has a device */
bh->b_prev = NULL;
bh->b_next = NULL;
if (!bh->b_dev)
return;
bh->b_next = hash(bh->b_dev,bh->b_blocknr);
hash(bh->b_dev,bh->b_blocknr) = bh;
bh->b_next->b_prev = bh;
}
这个函数首先获取对应的块,判断该块是否已经uptodate,也就是是否可以读,可以的话表示哈希队列中已经有的。否则必须用ll_rw_block产生读请求。下面函数循环中的bh应该是tmp。对另外的缓冲块产生读请求,但最后只等待first的缓冲块解锁,并等待该数据已经从盘读取到。所以这个函数要保证读到(dev, first)这个数据块。这个函数是缓冲区提供的接口。
/*
* Ok, breada can be used as bread, but additionally to *mark other blocks for reading as well. End the argument *list with a negative number.
*/
struct buffer_head * breada(int dev,int first, ...)
{
va_list args;
struct buffer_head * bh, *tmp;
va_start(args,first);
if (!(bh=getblk(dev,first)))
panic("bread: getblk returned NULL\n");
if (!bh->b_uptodate)
ll_rw_block(READ,bh);
while ((first=va_arg(args,int))>=0) {
tmp=getblk(dev,first);
if (tmp) {
if (!tmp->b_uptodate)
ll_rw_block(READA,bh);
tmp->b_count--;
}
}
va_end(args);
wait_on_buffer(bh);
if (bh->b_uptodate)
return bh;
brelse(bh);
return (NULL);
}
在block_write将用户数据复制到缓冲块后,会将dirt置位,并调用brelse(bh)。该函数将引用计数减一,然后唤醒buffer_wait的进程链表,这个链表的进程在等待当前没使用的缓冲块。
void brelse(struct buffer_head * buf)
{
if (!buf)
return;
wait_on_buffer(buf);
if (!(buf->b_count--))
panic("Trying to free free buffer");
wake_up(&buffer_wait);
}
ll_rw_block函数位于kernel/blk_drv/ll_rw_block.c(p153,第145行)
void ll_rw_block(int rw, struct buffer_head * bh)
{
unsigned int major;
if ((major=MAJOR(bh->b_dev)) >= NR_BLK_DEV ||
!(blk_dev[major].request_fn)) {
printk("Trying to read nonexistent block-device\n\r");
return;
}
make_request(major,rw,bh);
}
其中rw表示读或者写请求,bh用来传递数据或保存数据。先通过主设备号判断是否为有效的设备,同时请求函数是否存在。如果是有效的设备且函数存在,即有驱动,则添加请求到相关链表中。
这个函数首先判断是否为提前读或者提前写,如果是则要看bh是否上了锁。上了锁则直接返回,因为提前操作是不必要的。否则转化为可以识别的读或者写。然后锁住缓冲区。数据处理结束后在中断处理函数中会进行解锁如果是写操作但是缓冲区不脏,或者读操作但是缓冲区已经更新,则直接返回。
接着寻找一项request,注意最后的
static void make_request(int major,int rw, struct buffer_head * bh)
{
struct request * req;
int rw_ahead;
/* WRITEA/READA is special case - it is not really needed, so if the buffer is locked, we just forget about it, else it's a normal read */
if ((rw_ahead = (rw == READA || rw == WRITEA))) {
if (bh->b_lock)
return;
if (rw == READA)
rw = READ;
else
rw = WRITE;
}
if (rw!=READ && rw!=WRITE)
panic("Bad block dev command, must be R/W/RA/WA");
lock_buffer(bh);
if ((rw == WRITE && !bh->b_dirt) || (rw == READ && bh->b_uptodate)) {
unlock_buffer(bh);
return;
}
repeat:
/* we don't allow the write-requests to fill up the queue completely: we want some room for reads: they take precedence. The last third of the requests are only for reads.
*/
if (rw == READ)
req = request+NR_REQUEST;
else
req = request+((NR_REQUEST*2)/3);
/* find an empty request */
while (--req >= request)
if (req->dev<0)
break;
/* if none found, sleep on new requests: check for rw_ahead */
if (req < request) {
if (rw_ahead) {
unlock_buffer(bh);
return;
}
sleep_on(&wait_for_request);
goto repeat;
}
/* fill up the request-info, and add it to the queue */
req->dev = bh->b_dev;
req->cmd = rw;
req->errors=0;
req->sector = bh->b_blocknr<<1;
req->nr_sectors = 2;
req->buffer = bh->b_data;
req->waiting = NULL;
req->bh = bh;
req->next = NULL;
add_request(major+blk_dev,req);
}
再来看看lock_buffer,这个函数其实就是获取缓冲块的锁。当锁已被其他进程获取,则进入休眠。
static inline void lock_buffer(struct buffer_head * bh)
{
cli();
while (bh->b_lock)
sleep_on(&bh->b_wait);
bh->b_lock=1;
sti();
}
当该设备没有请求操作时,直接调用请求函数,对于硬盘是do_hd_request。否则遍历请求链表,将当前req插入,这里插入使用的是电梯调度算法。
注意:req不能插在开头,因为开头项正在操作。这里的思想主要是考虑到磁盘的移臂的时间消耗较大,要么从里到外,要么从外到里,顺着某个方向多处理请求。如果req刚好在磁头移动的方向上,那么可以先处理,这样能节省IO的时间。
/*
* add-request adds a request to the linked list.
* It disables interrupts so that it can muck with the
* request-lists in peace.
*/
static void add_request(struct blk_dev_struct * dev, struct request * req)
{
struct request * tmp;
req->next = NULL;
cli();
if (req->bh)
req->bh->b_dirt = 0;
if (!(tmp = dev->current_request)) {
dev->current_request = req;
sti();
(dev->request_fn)();
return;
}
for ( ; tmp->next ; tmp=tmp->next)
if ((IN_ORDER(tmp,req) ||
!IN_ORDER(tmp,tmp->next)) &&
IN_ORDER(req,tmp->next))
break;
req->next=tmp->next;
tmp->next=req;
sti();
}
其中IN_ORDER位于kernel/blk_drv/blk.h(p134,第35行)
/*
* This is used in the elevator algorithm: Note that
* reads always go before writes. This is natural: reads
* are much more time-critical than writes.
*/
#define IN_ORDER(s1,s2) \
((s1)->cmd<(s2)->cmd || ((s1)->cmd==(s2)->cmd && \
((s1)->dev < (s2)->dev || ((s1)->dev == (s2)->dev && \
(s1)->sector < (s2)->sector))))
这个宏的含义是read请求排在写请求前面;相同请求则次设备号低的排在前面,即低分区的排在前面;或者设备号相同,即同一个分区,则扇区号小的排在前面。
这里看一下blk.h(位于kernel/blk_drv/,p133)这个文件:
#ifndef _BLK_H
#define _BLK_H
#define NR_BLK_DEV 7
/*
* NR_REQUEST is the number of entries in the request-queue.
* NOTE that writes may use only the low 2/3 of these: reads take precedence.
* 32 seems to be a reasonable number: enough to get some * benefit from the elevator-mechanism, but not so much as * to lock a lot of buffers when they are in the queue. 64 * seems to be too many (easily long pauses in reading when * heavy writing/syncing is going on)
*/
#define NR_REQUEST 32
/* Ok, this is an expanded form so that we can use the *same request for paging requests when that is *implemented. In paging, 'bh' is NULL, and 'waiting' is *used to wait for read/write completion.
*/
struct request {
int dev; /* -1 if no request */
int cmd; /* READ or WRITE */
int errors;
unsigned long sector;
unsigned long nr_sectors;
char * buffer;
struct task_struct * waiting;
struct buffer_head * bh;
struct request * next;
};
/*
* This is used in the elevator algorithm: Note that
* reads always go before writes. This is natural: reads
* are much more time-critical than writes.
*/
#define IN_ORDER(s1,s2) \
((s1)->cmd<(s2)->cmd || ((s1)->cmd==(s2)->cmd && \
((s1)->dev < (s2)->dev || ((s1)->dev == (s2)->dev && \
(s1)->sector < (s2)->sector))))
struct blk_dev_struct {
void (*request_fn)(void);
struct request * current_request;
};
extern struct blk_dev_struct blk_dev[NR_BLK_DEV];
extern struct request request[NR_REQUEST];
extern struct task_struct * wait_for_request;
#ifdef MAJOR_NR
/*
* Add entries as needed. Currently the only block devices
* supported are hard-disks and floppies.
*/
#if (MAJOR_NR == 1)
/* ram disk */
#define DEVICE_NAME "ramdisk"
#define DEVICE_REQUEST do_rd_request
#define DEVICE_NR(device) ((device) & 7)
#define DEVICE_ON(device)
#define DEVICE_OFF(device)
#elif (MAJOR_NR == 2)
/* floppy */
#define DEVICE_NAME "floppy"
#define DEVICE_INTR do_floppy
#define DEVICE_REQUEST do_fd_request
#define DEVICE_NR(device) ((device) & 3)
#define DEVICE_ON(device) floppy_on(DEVICE_NR(device))
#define DEVICE_OFF(device) floppy_off(DEVICE_NR(device))
#elif (MAJOR_NR == 3)
/* harddisk */
#define DEVICE_NAME "harddisk"
#define DEVICE_INTR do_hd
#define DEVICE_REQUEST do_hd_request
#define DEVICE_NR(device) (MINOR(device)/5)
#define DEVICE_ON(device)
#define DEVICE_OFF(device)
#elif 1
/* unknown blk device */
#error "unknown blk device"
#endif
#define CURRENT (blk_dev[MAJOR_NR].current_request)
#define CURRENT_DEV DEVICE_NR(CURRENT->dev)
#ifdef DEVICE_INTR
void (*DEVICE_INTR)(void) = NULL;
#endif
static void (DEVICE_REQUEST)(void);
static inline void unlock_buffer(struct buffer_head * bh)
{
if (!bh->b_lock)
printk(DEVICE_NAME ": free buffer being unlocked\n");
bh->b_lock=0;
wake_up(&bh->b_wait);
}
static inline void end_request(int uptodate)
{
DEVICE_OFF(CURRENT->dev);
if (CURRENT->bh) {
CURRENT->bh->b_uptodate = uptodate;
unlock_buffer(CURRENT->bh);
}
if (!uptodate) {
printk(DEVICE_NAME " I/O error\n\r");
printk("dev %04x, block %d\n\r",CURRENT->dev,
CURRENT->bh->b_blocknr);
}
wake_up(&CURRENT->waiting);
wake_up(&wait_for_request);
CURRENT->dev = -1;
CURRENT = CURRENT->next;
}
#define INIT_REQUEST \
repeat: \
if (!CURRENT) \
return; \
if (MAJOR(CURRENT->dev) != MAJOR_NR) \
panic(DEVICE_NAME ": request list destroyed"); \
if (CURRENT->bh) { \
if (!CURRENT->bh->b_lock) \
panic(DEVICE_NAME ": block not locked"); \
}
#endif
#endif
这个文件定义了blk_dev的结构,也就是该设备的请求函数(在每个设备的init函数中初始化,对于硬盘是do_hd_request),以及对应的请求链表的表头(一开始为NULL,定义在ll_rw_blk.c中),共有7项。定义了request这个请求,共有32项,属性dev = -1时,表示request没有被使用。定义了3项块设备的请求函数。显然,这个文件是要被包含的,而且必须在这个文件之前定义MAJOR_NR这个宏,主设备号,表示使用哪个设备。CURRENT表示请求表头,而CURRENT_DEV表示哪个驱动盘(0或1)。这里还定义了end_request,也就是中断结束后,对request链表移动到下一项,将处理完的request释放,dev = -1,然后唤醒wait_for_request等,表示有request可以使用。最重要的是要进行解锁,并唤醒等待该缓冲块的进程。
在add_request里面,如果当前表头dev->current_request为空,则直接调用(dev->request_fn)()。这个函数是启动读写的关键函数。对于每个设备都有一个request_fn函数。这里以硬盘为例。
do_hd_request位于kernel/blk_drv/hd.c(p145,第294行)
void do_hd_request(void)
{
int i,r = 0;
unsigned int block,dev;
unsigned int sec,head,cyl;
unsigned int nsect;
INIT_REQUEST;
dev = MINOR(CURRENT->dev);
block = CURRENT->sector;
if (dev >= 5*NR_HD || block+2 > hd[dev].nr_sects) {
end_request(0);
goto repeat;
}
block += hd[dev].start_sect;
dev /= 5;
__asm__("divl %4":"=a" (block),"=d" (sec):"0" (block),"1" (0),
"r" (hd_info[dev].sect));
__asm__("divl %4":"=a" (cyl),"=d" (head):"0" (block),"1" (0),
"r" (hd_info[dev].head));
sec++;
nsect = CURRENT->nr_sectors;
if (reset) {
reset = 0;
recalibrate = 1;
reset_hd(CURRENT_DEV);
return;
}
if (recalibrate) {
recalibrate = 0;
hd_out(dev,hd_info[CURRENT_DEV].sect,0,0,0,
WIN_RESTORE,&recal_intr);
return;
}
if (CURRENT->cmd == WRITE) {
hd_out(dev,nsect,sec,head,cyl,WIN_WRITE,&write_intr);
for(i=0 ; i<3000 && !(r=inb_p(HD_STATUS)&DRQ_STAT) ; i++)
/* nothing */ ;
if (!r) {
bad_rw_intr();
goto repeat;
}
port_write(HD_DATA,CURRENT->buffer,256);
} else if (CURRENT->cmd == READ) {
hd_out(dev,nsect,sec,head,cyl,WIN_READ,&read_intr);
} else
panic("unknown hd-command");
}
do_hd_request首先查看硬盘设备的表头是否为空,空则直接返回 。否则获取表头请求项的次设备号,将起始扇区号转化为绝对扇区号(LBA),再将绝对扇区号转化为扇区号、磁头号、柱面号。对于写硬盘,先将具体的硬盘(第一块或第二块)、写的扇区数、扇区号、磁头号、柱面号、写命令、对应的写中断函数指针传递给hd_out函数,从而将相关参数写到硬盘相应的寄存器中,之后等待一段时间,然后将一个扇区的数据写到硬盘中。对于读硬盘,则只传递相关参数给hd_out即可。
这个函数主要是将相关的参数写到硬盘的相应寄存器中,并设置全局中断句柄do_hd,表示硬盘下一次发生中断调用的函数,对于读,do_hd = read_intr,对于写,do_hd = write_intr。
static void hd_out(unsigned int drive,unsigned int nsect,unsigned int sect,
unsigned int head,unsigned int cyl,unsigned int cmd,
void (*intr_addr)(void))
{
register int port asm("dx");
if (drive>1 || head>15)
panic("Trying to write bad sector");
if (!controller_ready())
panic("HD controller not ready");
do_hd = intr_addr;
outb_p(hd_info[drive].ctl,HD_CMD);
port=HD_DATA;
outb_p(hd_info[drive].wpcom>>2,++port);
outb_p(nsect,++port);
outb_p(sect,++port);
outb_p(cyl,++port);
outb_p(cyl>>8,++port);
outb_p(0xA0|(drive<<4)|head,++port);
outb(cmd,++port);
}
为什么硬盘会调用do_hd?这可以通过硬盘中断的句柄得知:
void hd_init(void)
{
blk_dev[MAJOR_NR].request_fn = DEVICE_REQUEST;
set_intr_gate(0x2E,&hd_interrupt);
outb_p(inb_p(0x21)&0xfb,0x21);
outb(inb_p(0xA1)&0xbf,0xA1);
}
显然,上面设置了硬盘中断的入口函数为hd_interrupt。而这个函数位于kernel/system_call.s(p89,第221行):
hd_interrupt:
pushl %eax
pushl %ecx
pushl %edx
push %ds
push %es
push %fs
movl $0x10,%eax
mov %ax,%ds
mov %ax,%es
movl $0x17,%eax
mov %ax,%fs
movb $0x20,%al
outb %al,$0xA0 # EOI to interrupt controller #1
jmp 1f # give port chance to breathe
1: jmp 1f
1: xorl %edx,%edx
xchgl do_hd,%edx
testl %edx,%edx
jne 1f
movl $unexpected_hd_interrupt,%edx
1: outb %al,$0x20
call *%edx # "interesting" way of handling intr.
pop %fs
pop %es
pop %ds
popl %edx
popl %ecx
popl %eax
iret
这段代码,主要是中断结束命令字给8259A,然后判断do_hd是否为空。不为空的话就调用do_hd。
这里来看一下read_intr和write_intr:
static void read_intr(void)
{
if (win_result()) {
bad_rw_intr();
do_hd_request();
return;
}
port_read(HD_DATA,CURRENT->buffer,256);
CURRENT->errors = 0;
CURRENT->buffer += 512;
CURRENT->sector++;
if (--CURRENT->nr_sectors) {
do_hd = &read_intr;
return;
}
end_request(1);
do_hd_request();
}
static void write_intr(void)
{
if (win_result()) {
bad_rw_intr();
do_hd_request();
return;
}
if (--CURRENT->nr_sectors) {
CURRENT->sector++;
CURRENT->buffer += 512;
do_hd = &write_intr;
port_write(HD_DATA,CURRENT->buffer,256);
return;
}
end_request(1);
do_hd_request();
}
这两个函数都是将操作的扇区数前减减,然后请求的起始扇区递增,缓冲块的起始地址加一个扇区的长度。只要操作的扇区数不为零,则不断处理当前的请求项。对于读请求,将数据从硬盘读取到缓冲区中,使用port_read;而对于写请求,则将缓冲区的数据写到硬盘缓冲区中,使用port_write。不再给寄存器发送命令,因为已经发送了请求两个扇区的数据的命令。当前请求项处理完时才会end_request,处理下一项。这里有点向链表的感觉。
从上图,可以发现,文件系统使用的是高速缓冲区,然后从缓冲区中把数据复制到用户数据区。而底层驱动读取磁盘数据时,将磁盘数据保存到高速缓冲区中。这样中间层隔着一个高速缓冲区,能够提高IO效率。但是必须把脏块写盘,否则可能损坏文件系统。一种是在get_blk时写盘,一种是在umount文件系统时写盘,一种是使用系统调用sys_sync写盘。
参考文献
Linux 内核完全注释 赵炯
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。