pdflush内核线程池是Linux为了回写文件系统数据而创建的进程上下文工作环境。它的实现比较精巧,全部代码只有不到250行。
?
? 1 /*
? 2? * mm/pdflush.c - worker threads for writing back filesystem data
? 3? *
? 4? * Copyright (C) 2002,Linus Torvalds.
? 5? *
? 6? * 09Apr2002??? akpm@zip.com.au
? 7? *????? Initial version
? 8? * 29Feb2004??? kaos@sgi.com
? 9? *????? Move worker thread creation to kthread to avoid chewing
?10? *????? up stack space with nested calls to kernel_thread.
?11? */
? |
文件头部的说明,主要包含版权信息和主要的更改记录(Changlog)。kaos@sgi.com将内核工作线程的创建工作移交给了kthread,主要是为了防止过多的内核线程消耗太多的父工作线程的堆栈空间。关于这个改变我们也能够通过ps的结果看出:
?
root???????? 5???? 1???? 5? 0??? 1 21:31 ???????? 00:00:00 [kthread]
root?????? 114???? 5?? 114? 0??? 1 21:31 ???????? 00:00:00 [pdflush]
root?????? 115???? 5?? 115? 0??? 1 21:31 ???????? 00:00:00 [pdflush]
? |
所有pdflush内核线程的父进程都是kthread进程(pid为5)。
?
?12
?13 #include <linux/sched.h>
?14 #include <linux/list.h>
?15 #include <linux/signal.h>
?16 #include <linux/spinlock.h>
?17 #include <linux/gfp.h>
?18 #include <linux/init.h>
?19 #include <linux/module.h>
?20 #include <linux/fs.h>?????? // Needed by writeback.h
?21 #include <linux/writeback.h>??? // Prototypes pdflush_operation()
?22 #include <linux/kthread.h>
?23 #include <linux/cpuset.h>
?24
?25
? |
包含一些比要的头文件。不过有一点不怎么好,虽然C++的行注释已经迁移到了C,可在内核的代码里面看到,还是一样的不舒服,可能是我太挑剔了,本身也没啥不好,我可能需要与时俱进。
?
?26 /*
?27? * Minimum and maximum number of pdflush instances
?28? */
?29 #define MIN_PDFLUSH_THREADS 2
?30 #define MAX_PDFLUSH_THREADS 8
?31
?32 static void start_one_pdflush_thread(void);
?33
?34
? |
29和30行分别定义了pdflush内核线程实例的最小和最大数量,分别是2和8。最小线程数是为了减少操作的延时,最大线程数是为了防止过多的线程降低系统性能。不过,这里的最大线程数有些问题,下面我们分析其中的竞争条件时会再次提及它。
?
?35 /*
?36? * The pdflush threads are worker threads for writing back dirty data.
?37? * Ideally,we'd like one thread per active disk spindle.? But the disk
?38? * topology is very hard to divine at this level.?? Instead,we take
?39? * care in various places to prevent more than one pdflush thread from
?40? * performing writeback against a single filesystem.? pdflush threads
?41? * have the PF_FLUSHER flag set in current->flags to aid in this.
?42? */
?43
? |
上面这段注释是对pdflush线程池的简单解释,大致的意思就是:“pdflush线程是为了将脏数据写回的工作线程。比较理想的情况是为每一个活跃的磁盘轴创建一个线程,但是在这个层次上比较难确定磁盘的拓扑结构,因此,我们处处小心,尽量防止对单一文件系统做多个回写操作。pdflush线程可以通过current->flags中PF_FLUSHER标志来协助实现这个。”
可以看出,内核开发者们对于效率还是相当的“吝啬”,考虑的比较周全。但是,对于层次的划分也相当关注,时刻不敢越“雷池”半步,那么的谨小慎微。
?
?43
?44 /*
?45? * All the pdflush threads.? Protected by pdflush_lock
?46? */
?47 static LIST_HEAD(pdflush_list);
?48 static DEFINE_SPINLOCK(pdflush_lock);
?49
?50 /*
?51? * The count of currently-running pdflush threads.? Protected
?52? * by pdflush_lock.
?53? *
?54? * Readable by sysctl,but not writable.? Published to userspace at
?55? * /proc/sys/vm/nr_pdflush_threads.
?56? */
?57 int nr_pdflush_threads = 0;
?58
?59 /*
?60? * The time at which the pdflush thread pool last went empty
?61? */
?62 static unsigned long last_empty_jifs;
?63
? |
定义个一些必要的全局变量,为了不污染内核的名字空间,对于不需要导出的变量都采用了static关键字限定了它们的作用域为此编译单元(即当前的pdflush.c文件)。所有的空闲pdflush线程都被串在双向链表pdflush_list里面,并用变量nr_pdflush_threads对当前pdflush的进程(包括活跃的和空闲的)数就行统计,last_empty_jifs用来记录pdflush线程池上次为空(也就是无线程可用)的jiffies时间,线程池中所有需要互斥操作的场合都采用自旋锁pdflush_lock进行加锁保护。
?
?64 /*
?65? * The pdflush thread.
?66? *
?67? * Thread pool management algorithm:
?68? *
?69? * - The minimum and maximum number of pdflush instances are bound
?70? *?? by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
?71? *
?72? * - If there have been no idle pdflush instances for 1 second,create
?73? *?? a new one.
?74? *
?75? * - If the least-recently-went-to-sleep pdflush thread has been asleep
?76? *?? for more than one second,terminate a thread.
?77? */
?78
? |
又是一大段注释,不知道你有没有看烦,反正我都有点儿腻烦了,本来只想就其间的竞争说两句,没想到扯出这么多东西!上面介绍的是线程池的算法:
?
- pdflush线程实例的数量介于MIN_PDFLUSH_THREADS和MAX_PDFLUSH_THREADS之间。
- 如果线程池持续1秒没有空闲线程,则创建一个新的线程。
- 如果那个最先睡眠的进程休息了超过1秒,则结束一个线程实例。
?79 /*
?80? * A structure for passing work to a pdflush thread.? Also for passing
?81? * state information between pdflush threads.? Protected by pdflush_lock.
?82? */
?83 struct pdflush_work {
?84???????? struct task_struct *who;??????? /* The thread */
?85???????? void (*fn)(unsigned long);????? /* A callback function */
?86???????? unsigned long arg0;???????????? /* An argument to the callback */
?87???????? struct list_head list;????????? /* On pdflush_list,when idle */
?88???????? unsigned long when_i_went_to_sleep;
?89 };
?90
? |
上面定义了每个线程实例的节点数据结构,比较简明,不需要再废话。
现在,基本的数据结构的变量都浏览了一遍,接下来我们将从module_init这个入口着手分析:
?
232 static int __init pdflush_init(void)
233 {
234???????? int i;
235
236???????? for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
237???????????????? start_one_pdflush_thread();
238???????? return 0;
239 }
240
241 module_init(pdflush_init);
? |
创建MIN_PDFLUSH_THREADS个pdflush线程实例。请注意,这里只有module_init()定义,而没有module_exit(),言外之意就是:这个程序即使编译成内核模块,也是只能添加不能删除。请参看sys_delete_module()的实现:
File: kernel/module.c
?
?? 609????? /* If it has an init func,it must have an exit func to unload */
?? 610????? if ((mod->init != NULL && mod->exit == NULL)
?? 611????????? || mod->unsafe) {
?? 612????????? forced = try_force(flags);
?? 613????????? if (!forced) {
?? 614????????????? /* This module can't be removed */
?? 615????????????? ret = -EBUSY;
?? 616????????????? goto out;
?? 617????????? }
?? 618????? }
? |
?
?? 498? #ifdef CONFIG_MODULE_FORCE_UNLOAD
?? 499? static inline int try_force(unsigned int flags)
?? 500? {
?? 501????? int ret = (flags & O_TRUNC);
?? 502????? if (ret)
?? 503????????? add_taint(TAINT_FORCED_MODULE);
?? 504????? return ret;
?? 505? }
?? 506? #else
?? 507? static inline int try_force(unsigned int flags)
?? 508? {
?? 509????? return 0;
?? 510? }
?? 511? #endif /* CONFIG_MODULE_FORCE_UNLOAD */
? |
可见,除非编译的时候选择了模块强制卸载(注意:这个选项比较危险,不要尝试)的选项,否则这样的模块是不允许被卸载的。再次回到pdflush:
?
227 static void start_one_pdflush_thread(void)
228 {
229???????? kthread_run(pdflush,NULL,"pdflush");
230 }
231
? |
用kthread_run借助kthread帮助线程生成pdflush内核线程实例:
?
164 /*
165? * Of course,my_work wants to be just a local in __pdflush().? It is
166? * separated out in this manner to hopefully prevent the compiler from
167? * performing unfortunate optimisations against the auto variables.? Because
168? * these are visible to other tasks and CPUs.? (No problem has actually
169? * been observed.? This is just paranoia).
170? */
这段注释比较有意思,为了防止编译器将局部变量my_work优化成寄存器变量,所以这里整个处理流程转变成了pdflush套__pdflush的方式。实际上,局部变量的采用相对于动态申请内存,无论是在空间利用率还是在时间效率上都是有好处的。
171 static int pdflush(void *dummy)
172 {
173???????? struct pdflush_work my_work;
174???????? cpumask_t cpus_allowed;
175
176???????? /*
177????????? * pdflush can spend a lot of time doing encryption via dm-crypt.? We
178????????? * don't want to do that at keventd's priority.
179????????? */
180???????? set_user_nice(current,0);
微调优先级,提高系统的整体响应。
181
182???????? /*
183????????? * Some configs put our parent kthread in a limited cpuset,
184????????? * which kthread() overrides,forcing cpus_allowed == CPU_MASK_ALL.
185????????? * Our needs are more modest - cut back to our cpusets cpus_allowed.
186????????? * This is needed as pdflush's are dynamically created and destroyed.
187????????? * The boottime pdflush's are easily placed w/o these 2 lines.
188????????? */
189???????? cpus_allowed = cpuset_cpus_allowed(current);
190???????? set_cpus_allowed(current,cpus_allowed);
设置允许运行的CPU集合掩码。
191
192???????? return __pdflush(&my_work);
193 }
? |
?
?91 static int __pdflush(struct pdflush_work *my_work)
?92 {
?93???????? current->flags |= PF_FLUSHER;
?94???????? my_work->fn = NULL;
?95???????? my_work->who = current;
?96???????? INIT_LIST_HEAD(&my_work->list);
做些初始化动作。
?97
?98???????? spin_lock_irq(&pdflush_lock);
因为要对nr_pdflush_threads和pdflush_list操作,所以需要加互斥锁,为了避免意外(pdflush任务的添加可能在硬中断上下文),故同时关闭硬中断。
?99???????? nr_pdflush_threads++;
将nr_pdflush_threads的计数加1,因为多了一个pdflush内核线程实例。
100???????? for ( ; ; ) {
101???????????????? struct pdflush_work *pdf;
102
103???????????????? set_current_state(TASK_INTERRUPTIBLE);
104???????????????? list_move(&my_work->list,&pdflush_list);
105???????????????? my_work->when_i_went_to_sleep = jiffies;
106???????????????? spin_unlock_irq(&pdflush_lock);
107
108???????????????? schedule();
将自己加入空闲线程列表pdflush_list,然后让出cpu,等待被调度。
109???????????????? if (try_to_freeze()) {
110???????????????????????? spin_lock_irq(&pdflush_lock);
111???????????????????????? continue;
112???????????????? }
如果正在冻结当前进程,继续循环。
113
114???????????????? spin_lock_irq(&pdflush_lock);
115???????????????? if (!list_empty(&my_work->list)) {
116???????????????????????? printk("pdflush: bogus wakeup!n");
117???????????????????????? my_work->fn = NULL;
118???????????????????????? continue;
119???????????????? }
120???????????????? if (my_work->fn == NULL) {
121???????????????????????? printk("pdflush: NULL work functionn");
122???????????????????????? continue;
123???????????????? }
124???????????????? spin_unlock_irq(&pdflush_lock);
上面是对被意外唤醒情况的处理。
125
126???????????????? (*my_work->fn)(my_work->arg0);
127
带参数arg0执行任务函数。
128???????????????? /*
129????????????????? * Thread creation: For how long have there been zero
130????????????????? * available threads?
131????????????????? */
132???????????????? if (jiffies - last_empty_jifs > 1 * HZ) {
133???????????????????????? /* unlocked list_empty() test is OK here */
134???????????????????????? if (list_empty(&pdflush_list)) {
135???????????????????????????????? /* unlocked test is OK here */
136???????????????????????????????? if (nr_pdflush_threads < MAX_PDFLUSH_THREADS)
137???????????????????????????????????????? start_one_pdflush_thread();
138???????????????????????? }
139???????????????? }
如果pdflush_list为空超过1妙,并且线程数量还有可以增长的余地,则重新启动一个新的pdflush线程实例。
140
141???????????????? spin_lock_irq(&pdflush_lock);
142???????????????? my_work->fn = NULL;
143
144???????????????? /*
145????????????????? * Thread destruction: For how long has the sleepiest
146????????????????? * thread slept?
147????????????????? */
148???????????????? if (list_empty(&pdflush_list))
149???????????????????????? continue;
如果pdflush_list依然为空,继续循环。
150???????????????? if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
151???????????????????????? continue;
如果线程数量不大于最小线程数,继续循环。
152???????????????? pdf = list_entry(pdflush_list.prev,struct pdflush_work,list);
153???????????????? if (jiffies - pdf->when_i_went_to_sleep > 1 * HZ) {
154???????????????????????? /* Limit exit rate */
155???????????????????????? pdf->when_i_went_to_sleep = jiffies;
156???????????????????????? break;????????????????????????????????? /* exeunt */
157???????????????? }
如果pdflush_list的最后一个内核线程睡眠超过1秒,可能系统变得较为轻闲,结束本线程。为什么是最后一个?因为这个list是作为栈来使用的,所以栈底的元素也肯定就是最老的元素。
158???????? }
159???????? nr_pdflush_threads--;
160???????? spin_unlock_irq(&pdflush_lock);
161???????? return 0;
nr_pdflush_threads减1,退出本线程。
162 }
163
? |
是不是少做了些工作?没错,好象没有处理SIGCHLD信号。其实用kthread创建的进程都是自己清理自己的,根本就无须父进程wait,不会产生僵尸进程,请参看
File: kernel/workqueue.c
?
?? 200????? /* SIG_IGN makes children autoreap: see do_notify_parent(). */
?? 201????? sa.sa.sa_handler = SIG_IGN;
?? 202????? sa.sa.sa_flags = 0;
?? 203????? siginitset(&sa.sa.sa_mask,sigmask(SIGCHLD));
?? 204????? do_sigaction(SIGCHLD,&sa,(struct k_sigaction *)0);
? |
另外在sigaction的手册页中可以详细的看到关于忽略SIGCHLD的“后果”:
?
?????? POSIX.1-1990? disallowed setting the action for SIGCHLD to SIG_IGN.
?????? POSIX.1-2001 allows this possibility,so that ignoring SIGCHLD? can
?????? be? used? to prevent the creation of zombies (see wait(2)).? Never-
?????? theless,the historical BSD and System V? behaviours? for? ignoring
?????? SIGCHLD? differ,? so? that? the? only completely portable method of
?????? ensuring that terminated children do not become zombies is to catch
?????? the SIGCHLD signal and perform a wait(2) or similar.
? |
无疑Linux内核是符合较新的POSIX标准的,这也给我们提供了一个避免产生僵尸进程的“简易”方法,不过要注意:这种手法是不可以移植的。
请折回头来再次考虑函数__pdflush(),这次我们关注其间的竞争:
?
135???????????????????????????????? /* unlocked test is OK here */
136???????????????????????????????? if (nr_pdflush_threads < MAX_PDFLUSH_THREADS)
137???????????????????????????????????????? start_one_pdflush_thread();
? |
虽然开锁判断线程数不会造成数据损坏,但是如果有几个进程并行判断nr_pdflush_threads的值,并都一致认为线程数还有可以增长的余地,然后都调用start_one_pdflush_thread()去产生新的pdflush线程实例,那么线程数就可能超过MAX_PDFLUSH_THREADS,最坏的情况下可能是其两倍。
再来看接下来的行:
?
152???????????????? pdf = list_entry(pdflush_list.prev,list);
153???????????????? if (jiffies - pdf->when_i_went_to_sleep > 1 * HZ) {
154???????????????????????? /* Limit exit rate */
155???????????????????????? pdf->when_i_went_to_sleep = jiffies;
156???????????????????????? break;????????????????????????????????? /* exeunt */
157???????????????? }
? |
考虑瞬间的迸发请求,然后都在同一时刻停止运行,这时所有进程退出的时候都不会满足153行的判定,然后都会去睡眠,再假设接下来的n秒内都没有新的请求出发,那么pdflush内核线程数最大的情况将持续n秒,不符合当初的设计要求3。
?
195 /*
196? * Attempt to wake up a pdflush thread,and get it to do some work for you.
197? * Returns zero if it indeed managed to find a worker thread,and passed your
198? * payload to it.
199? */
200 int pdflush_operation(void (*fn)(unsigned long),unsigned long arg0)
201 {
202???????? unsigned long flags;
203???????? int ret = 0;
204
205???????? if (fn == NULL)
206???????????????? BUG();????????? /* Hard to diagnose if it's deferred */
207
208???????? spin_lock_irqsave(&pdflush_lock,flags);
209???????? if (list_empty(&pdflush_list)) {
210???????????????? spin_unlock_irqrestore(&pdflush_lock,flags);
211???????????????? ret = -1;
212???????? } else {
213???????????????? struct pdflush_work *pdf;
214
215???????????????? pdf = list_entry(pdflush_list.next,list);
216???????????????? list_del_init(&pdf->list);
217???????????????? if (list_empty(&pdflush_list))
218???????????????????????? last_empty_jifs = jiffies;
219???????????????? pdf->fn = fn;
220???????????????? pdf->arg0 = arg0;
221???????????????? wake_up_process(pdf->who);
222???????????????? spin_unlock_irqrestore(&pdflush_lock,flags);
223???????? }
224???????? return ret;
225 }
226
? |
上面的函数用来给pdflush线程分配任务,如果当前有空闲线程可用,则分配一个任务给它,接着唤醒它,让它去执行。
总结:
内核编程需要缜密的思维,稍有不甚就有可能引发意外,无论你的代码有多短,必须慎之又慎。虽然pdflush的线程池实现存在以上提到的两点竞争,但是他们都不会造成十分严重的后果,只不过不符合设计要求,不能作为一个良好的实现而推行。
注意:
本文中“内核线程”、“线程”和“进程”交叉使用,但实际上他们都代表“内核线程”,并且这样也没啥不妥,“线程”作为“内核线程”的简称,而“内核线程”本质就是共享内核数据空间的一组“进程”,所以在某些情况下两者互换,并无大碍。
原文:http://blog.chinaunix.net/u/5251/showart_320793.html (编辑:李大同)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|