亚欧色一区w666天堂,色情一区二区三区免费看,少妇特黄A片一区二区三区,亚洲人成网站999久久久综合,国产av熟女一区二区三区

  • 發布文章
  • 消息中心
點贊
收藏
評論
分享

避免和解決內存碎片化

2024-11-18 09:21:50
11
0

Memory fragmentation is a kernel programming issue with a long history. As a system runs, pages are allocated for a variety of tasks with the result that memory fragments over time. A busy system with a long uptime may have very few blocks of pages which are physically-contiguous. Since Linux is a virtual memory system, fragmentation normally is not a problem; physically scattered memory can be made virtually contiguous by way of the page tables.

內(nei)(nei)存(cun)(cun)(cun)碎片(memory fragmentation)是內(nei)(nei)核編程(cheng)中(zhong)一個歷(li)史悠久的(de)(de)問題。系(xi)統(tong)運(yun)(yun)行過程(cheng)中(zhong),各(ge)種任務都會(hui)請(qing)求分配內(nei)(nei)存(cun)(cun)(cun)頁框(kuang),導致隨(sui)著(zhu)時(shi)間的(de)(de)推移(yi),內(nei)(nei)存(cun)(cun)(cun)的(de)(de)碎片化現(xian)象會(hui)愈發嚴重。對(dui)一個繁忙(mang)的(de)(de)系(xi)統(tong)來說,長時(shi)間運(yun)(yun)行一段時(shi)間后(hou),只有很(hen)少的(de)(de)空(kong)閑頁框(kuang)還能在物理(li)(li)上(shang)保(bao)持連(lian)續。當然,由(you)于 Linux 支持虛擬(ni)內(nei)(nei)存(cun)(cun)(cun)管理(li)(li),所(suo)以物理(li)(li)內(nei)(nei)存(cun)(cun)(cun)碎片通(tong)常不是問題;在頁表的(de)(de)幫助下(xia),物理(li)(li)上(shang)分散的(de)(de)內(nei)(nei)存(cun)(cun)(cun)在虛擬(ni)地(di)址上(shang)呈現(xian)出來仍然是連(lian)續的(de)(de)。

But there are a few situations where physically-contiguous memory is absolutely required. These include large kernel data structures (except those created with vmalloc()) and any memory which must appear contiguous to peripheral devices. DMA buffers for low-end devices (those which cannot do scatter/gather I/O) are a classic example. If a large ("high order") block of memory is not available when needed, something will fail and yet another user will start to consider switching to BSD.

但仍有一些情況下必須要求使用物理上保持連續的內存塊。這樣的例子包括用于存放內核創建的一些大型的數據結構(使用 vmalloc() 函(han)數(shu)創建的(de)(de)(de)除外)以及為了滿(man)足外設要求必須(xu)保證(zheng)某些(xie)內存塊的(de)(de)(de)物理內存地址連續。第二個(ge)例(li)子常見(jian)于在操作某些(xie)低端外設(不支持(chi) scatter/gather 模式)的(de)(de)(de) DMA 緩沖區時。如果在需要時無法分配 “高階(jie)”(high-order,指(zhi)物理上由(you)多個(ge)連續的(de)(de)(de)頁(ye)框組成)內存塊,將影響一(yi)些(xie)應(ying)用實現,說得嚴重點可能會(hui)導(dao)致一(yi)些(xie) Linux 用戶轉(zhuan)而投(tou)奔(ben) BSD。

Over the years, a number of approaches to the memory fragmentation problem have been considered, but none have been merged. Adding any sort of overhead to the core memory management code tends to be a hard sell. But this resistance does not mean that people stop trying. One of the most persistent in this area has been Mel Gorman, who has been working on an anti-fragmentation patch set for some years. Mel is back with version 27 of his patch, now rebranded "page clustering." This version appears to have attracted some interest, and may yet get into the mainline.

多年(nian)來,為了(le)解決內(nei)(nei)存碎(sui)片的問題(ti),社區已經考慮了(le)許多方法(fa),但還沒有一(yi)個被內(nei)(nei)核(he)主(zhu)(zhu)線(xian)所(suo)接(jie)受。任何(he)試圖對(dui)內(nei)(nei)存管理核(he)心代(dai)碼(ma)進(jin)行(xing)修改的嘗試都需要慎(shen)之又(you)慎(shen)。但這(zhe)并沒有阻止人們努力嘗試的決心。要問在這(zhe)個領域,哪些人最有毅力,恐怕就不得不提起(qi) Mel Gorman,他多年(nian)來一(yi)直在研究(jiu)反碎(sui)片(anti-fragmentation)問題(ti)。最近針對(dui)該問題(ti)他又(you)向(xiang)內(nei)(nei)核(he)提交(jiao)了(le)他的補(bu)丁的第 27 個版本,并給它起(qi)了(le)一(yi)個新的名字叫做 "page clustering"(譯者(zhe)注,clustering 有聚(ju)合,一(yi)簇的意思,即下(xia)文(wen)所(suo)解釋(shi)的算(suan)法(fa)中對(dui)頁框進(jin)行(xing)分類的概念)。這(zhe)個版本似(si)乎引起(qi)了(le)大家的一(yi)些興趣(qu),很有可能(neng)會進(jin)入內(nei)(nei)核(he)主(zhu)(zhu)線(xian)。

The core observation in Mel's patch set remains that some types of memory are more easily reclaimed than others. A page which is backed up on a filesystem somewhere can be readily discarded and reused, for example, while a page holding a process's task structure is pretty well nailed down. One stubborn page is all it takes to keep an entire large block of memory from being consolidated and reused as a physically-contiguous whole. But if all of the easily-reclaimable pages could be kept together, with the non-reclaimable pages grouped into a separate region of memory, it should be much easier to create larger blocks of free memory.

Mel 補丁的(de)(de)(de)(de)新(xin)版本并(bing)沒有改變其(qi)(qi)(qi)核心思想(xiang)(譯者注(zhu),這個(ge)補丁已經延續了很久,較早的(de)(de)(de)(de)介紹(shao)(shao)請參考首次(ci)介紹(shao)(shao) Mel 的(de)(de)(de)(de)補丁(更(geng)新(xin)到 V6);第二次(ci)介紹(shao)(shao) Mel 的(de)(de)(de)(de)補丁(更(geng)新(xin)到 V19)),仍(reng)然是(shi)(shi)基(ji)于他對內(nei)存(cun)(cun)的(de)(de)(de)(de)觀察,并(bing)根據其(qi)(qi)(qi)在運行過(guo)程(cheng)中是(shi)(shi)否易(yi)于回(hui)(hui)(hui)(hui)收(shou)(reclaim)對頁(ye)(ye)(ye)框(kuang)進行了分類。例如,用于緩存(cun)(cun)文件的(de)(de)(de)(de)內(nei)存(cun)(cun)頁(ye)(ye)(ye)應(ying)該是(shi)(shi)容易(yi)被回(hui)(hui)(hui)(hui)收(shou)和重用的(de)(de)(de)(de),而一(yi)個(ge)包含了進程(cheng)任務結構體的(de)(de)(de)(de)內(nei)存(cun)(cun)頁(ye)(ye)(ye)則不能隨意(yi)回(hui)(hui)(hui)(hui)收(shou)。這種不能回(hui)(hui)(hui)(hui)收(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)一(yi)旦夾(jia)雜在其(qi)(qi)(qi)他可回(hui)(hui)(hui)(hui)收(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)中會導致(zhi)內(nei)核無法將(jiang)他們整合為一(yi)個(ge)連(lian)續的(de)(de)(de)(de)內(nei)存(cun)(cun)塊。但是(shi)(shi),如果通過(guo)將(jiang)不可回(hui)(hui)(hui)(hui)收(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)和可回(hui)(hui)(hui)(hui)收(shou)頁(ye)(ye)(ye)框(kuang)分隔開,也就(jiu)是(shi)(shi)說(shuo)如果將(jiang)所有易(yi)于回(hui)(hui)(hui)(hui)收(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)保持在一(yi)起(qi)的(de)(de)(de)(de)話,那么創建更(geng)大的(de)(de)(de)(de)空(kong)閑內(nei)存(cun)(cun)塊會變得容易(yi)得多。

So Mel's patch divides each memory zone into three types of blocks: non-reclaimable, easily reclaimable, and movable. The "movable" type is a new feature in this patch set; it is used for pages which can be easily shifted elsewhere using the kernel's page migration mechanism. In many cases, moving a page might be easier than reclaiming it, since there is no need to involve a backing store device. Grouping pages in this way should also make the creation of larger blocks "just happen" when a process is migrated from one NUMA node to another.

Mel 的(de)補丁(ding)(ding)正是根據以(yi)上(shang)思(si)想將每個(ge)(ge)(ge)存儲區域(zone)的(de)頁(ye)框(kuang)劃(hua)分為三(san)種類型(xing):不(bu)可(ke)回收的(de)(“non-reclaimable”),易于回收的(de)(“easily reclaimable”)和可(ke)移動的(de)(“movable”)。“可(ke)移動” 類型(xing)是此次(ci)補丁(ding)(ding)版本中新增加的(de)類型(xing);所謂的(de)“可(ke)移動”是指(zhi)利用內(nei)核的(de)頁(ye)面(mian)遷移機制(page migration)可(ke)以(yi)方便地將頁(ye)框(kuang)上(shang)的(de)內(nei)容轉(zhuan)移到其(qi)他頁(ye)框(kuang)上(shang)去。在許多情況下,“移動”(move)頁(ye)面(mian)會比回收(reclaim)頁(ye)面(mian)更(geng)容易,因為不(bu)需要涉(she)及向磁盤(pan)設(she)備進行寫操作。采(cai)用這種方式(shi)對頁(ye)框(kuang)進行分類后(指(zhi)引(yin)入“移動”方式(shi)后)會帶來一個(ge)(ge)(ge)好(hao)處(chu),就是當我們(men)將一個(ge)(ge)(ge)進程從一個(ge)(ge)(ge) NUMA 節(jie)點遷移到另一個(ge)(ge)(ge) NUMA 節(jie)點時,大塊的(de)連續內(nei)存會很自然地呈現出來。

So, in this patch, movable pages (those marked with __GFP_MOVABLE) are generally those belonging to user-space processes. Moving a user-space page is just a matter of copying the data and changing the page table entry, so it is a relatively easy thing to do. Reclaimable pages (__GFP_RECLAIMABLE), instead, usually belong to the kernel. They are either allocations which are expected to be short-lived (some kinds of DMA buffers, for example, which only exist for the duration of an I/O operation) or can be discarded if needed (various types of caches). Everything else is expected to be hard to reclaim.

在此補丁中,可移動頁(即當我們使用 __GFP_MOVABLE 選項申請分配的頁框)通常屬于用戶空間的進程。移動屬于用戶空間的頁框只需要復制數據以及更改頁表條目,是一件相對容易的事情。相反,可回收頁面(即使用 __GFP_RECLAIMABLE 選項申請分配的頁框)通常由內(nei)(nei)核使(shi)用。它們(men)或者(zhe)是(shi)預期(qi)使(shi)用后會(hui)被快速釋放的內(nei)(nei)存(例如(ru),某(mou)些(xie)(xie)類型(xing)的 DMA 緩(huan)沖區,僅在輸入或者(zhe)輸出操作期(qi)間存在),或者(zhe)是(shi)一些(xie)(xie)如(ru)果(guo)有必要可以立即回收(shou)的內(nei)(nei)存(譬如(ru)各種類型(xing)的緩(huan)存)。除了這兩(liang)種類型(xing)以外(wai)的頁框都(dou)歸類于不(bu)可回收(shou)的內(nei)(nei)存。

By simply grouping different types of allocation in this way, Mel was able to get some pretty good results:

通過簡單地以這種(zhong)方(fang)式(shi)對分(fen)配的內存進行分(fen)類歸組后,Mel 宣稱(針對避免內存碎片(pian)化)能夠獲(huo)得一些非(fei)常好的效果:

In benchmarks and stress tests, we are finding that 80% of memory is available as contiguous blocks at the end of the test. To compare, a standard kernel was getting < 1% of memory as large pages on a desktop and about 8-12% of memory as large pages at the end of stress tests.
在基準測試和壓力測試中,我們發現(采用補丁后) 80% 內存在測試結束后仍然是連續的。作為對比,我們發現采用標準內核(未加入補丁情況下)的臺式機上通常的運行結果是連續的大內存小于 1%,如果是運行壓力測試后則連續的大內存占比在大約 8-12%。

Linus has, in the past, been generally opposed to efforts to reduce memory fragmentation. His comments this time around have been much more detail-oriented, however: should allocations be considered movable or non-movable by default? The answer would appear to be "non-movable," since somebody always has to make some effort to ensure that a specific allocation can be moved. Since the discussion is now happening at this level, some sort of fragmentation avoidance might just find its way into the kernel.

Linus 過去一直對減少內存碎片的的補丁修改持反對意見。但這次,他的回應 看上去卻似乎更加注重細節:Linus 在回復中提到,當我們調用分配頁框的接口(alloc_page())時,如(ru)(ru)果不(bu)特殊指明,缺省的(de)(de)(de)請求類(lei)型(xing)應(ying)該是(shi)(shi)被視為可(ke)移(yi)(yi)動(dong)(dong)(movable)還是(shi)(shi)不(bu)可(ke)移(yi)(yi)動(dong)(dong)(non-movable)?他傾向于(yu)采用(yong)缺省為 “不(bu)可(ke)移(yi)(yi)動(dong)(dong)”,因為相(xiang)對(dui)于(yu) “不(bu)可(ke)移(yi)(yi)動(dong)(dong)”,內核(he)針(zhen)對(dui) “可(ke)移(yi)(yi)動(dong)(dong)” 類(lei)型(xing)的(de)(de)(de)頁框會執(zhi)行額(e)外的(de)(de)(de)操(cao)作(譯者(zhe)注(zhu),所(suo)謂額(e)外操(cao)作即前文所(suo)述利用(yong)頁遷移(yi)(yi)機制移(yi)(yi)動(dong)(dong)頁框,需(xu)要(yao)補充說明的(de)(de)(de)是(shi)(shi),Linus 之所(suo)以提出這(zhe)種想法的(de)(de)(de)目的(de)(de)(de)無非(fei)是(shi)(shi)從使(shi)用(yong)的(de)(de)(de)角度出發,即如(ru)(ru)果能夠讓調(diao)用(yong)者(zhe)在請求分配(pei)時對(dui)需(xu)要(yao)移(yi)(yi)動(dong)(dong)的(de)(de)(de)場(chang)景(jing)明確提出其請求,會促(cu)使(shi)調(diao)用(yong)者(zhe)更明確其意(yi)(yi)圖并意(yi)(yi)識(shi)到這(zhe)么做的(de)(de)(de)后果會增加內核(he)額(e)外的(de)(de)(de)動(dong)(dong)作)。值得(de)注(zhu)意(yi)(yi)的(de)(de)(de)是(shi)(shi),這(zhe)次針(zhen)對(dui)這(zhe)個(ge)補丁的(de)(de)(de)討論(lun)已(yi)經詳細到這(zhe)個(ge)地步,看(kan)來(lai)有(you)關避免內存碎片的(de)(de)(de)改動(dong)(dong)有(you)希(xi)望在不(bu)久將會進入內核(he)主線(xian)。

A related approach to fragmentation is the lumpy reclaim mechanism posted by Andy Whitcroft but originally by Peter Zijlstra. Memory reclaim in Linux is normally done by way of a least-recently-used (LRU) list; the hope is that, if a page must be discarded, going after the least recently used page will minimize the chances of throwing out a page which will be needed soon. This mechanism will tend to free pages which are scattered randomly in the physical address space, however, making it hard to create larger blocks of free memory.

另一個和碎片(pian)化有(you)關的(de)(de)(de)補丁是(shi) Andy Whitcroft 提交的(de)(de)(de) lumpy reclaim 機(ji)制改進(jin),這個方法最(zui)初由(you) Peter Zijlstra 提出。Linux 中(zhong)的(de)(de)(de)內(nei)(nei)(nei)存(cun)回收(shou)通常(chang)利(li)用 LRU 鏈(lian)(lian)表來完成(譯者注(zhu), LRU 是(shi) least-recently-used 的(de)(de)(de)縮寫(xie));其原理是(shi),如果必須(xu)釋(shi)(shi)放頁框(kuang)(kuang)(kuang),則(ze)內(nei)(nei)(nei)核會(hui)(hui)從 LRU 鏈(lian)(lian)表中(zhong)選擇最(zui)近最(zui)少被使用的(de)(de)(de)頁框(kuang)(kuang)(kuang)進(jin)行釋(shi)(shi)放,避(bi)免換(huan)出那些經常(chang)被訪(fang)問的(de)(de)(de)頁框(kuang)(kuang)(kuang)。但基于(yu)(yu)這種機(ji)制會(hui)(hui)傾向(xiang)于(yu)(yu)造成內(nei)(nei)(nei)存(cun)的(de)(de)(de)碎片(pian)化,妨礙內(nei)(nei)(nei)核分配更大的(de)(de)(de)連續(xu)的(de)(de)(de)空閑(xian)內(nei)(nei)(nei)存(cun)塊。(譯者注(zhu),具體(ti)原因(yin)是(shi)釋(shi)(shi)放的(de)(de)(de)過(guo)程是(shi)按照(zhao) LRU 鏈(lian)(lian)表中(zhong)的(de)(de)(de)順序進(jin)行的(de)(de)(de),并沒有(you)考慮釋(shi)(shi)放的(de)(de)(de)頁框(kuang)(kuang)(kuang)之間(jian)物理地址是(shi)否連續(xu))

The lumpy reclaim patch tries to address this problem by modifying the LRU algorithm slightly. When memory is needed, the next victim is chosen from the LRU list as before. The reclaim code then looks at the surrounding pages (enough of them to form a higher-order block) and tries to free them as well. If it succeeds, lumpy reclaim will quickly create a larger free block while reclaiming a minimal number of pages.

lumpy reclaim 補丁嘗試(shi)通過對 LRU 算法進行輕(qing)微的(de)(de)(de)(de)調整來解決(jue)這個問題。當(dang)需(xu)要分配(pei)內(nei)存時,首(shou)先基于 LRU 鏈(lian)表按照上節描述的(de)(de)(de)(de)方法選擇可以(yi)釋放(fang)的(de)(de)(de)(de)頁(ye)(ye)框。區別是(shi),在(zai)此原有基礎上,修改后(hou)的(de)(de)(de)(de)回收(shou)代(dai)碼(ma)會查看該(gai)被釋放(fang)的(de)(de)(de)(de)頁(ye)(ye)框的(de)(de)(de)(de)周圍是(shi)否有連續(xu)的(de)(de)(de)(de)頁(ye)(ye)框可以(yi)和(he)剛(gang)剛(gang)釋放(fang)的(de)(de)(de)(de)頁(ye)(ye)框一起形成(cheng)更大的(de)(de)(de)(de)內(nei)存塊(kuai),如果有則嘗試(shi)釋放(fang)它們。一旦成(cheng)功(gong),該(gai)補丁可以(yi)在(zai)回收(shou)少量頁(ye)(ye)框的(de)(de)(de)(de)同時快速地創建更大的(de)(de)(de)(de)連續(xu)空閑內(nei)存塊(kuai)。

Clearly, this approach will work better if the surrounding pages can be freed. As a result, it combines well with a clustering mechanism like Mel Gorman's. The distortion of the LRU approach could have performance implications, since the neighboring pages may be under heavy use when the lumpy reclaim code goes after them. In an attempt to minimize this effect, lumpy reclaim only happens when the kernel is having trouble satisfying a request for a larger block of memory.

顯然(ran),如果(guo)一個(ge)可(ke)(ke)以(yi)回(hui)收(shou)的頁(ye)框的周(zhou)圍的頁(ye)框也(ye)是(shi)很易于(yu)釋放的,那么這種(zhong)方(fang)法(fa)(指(zhi) lumpy reclaim 補丁(ding)(ding))工作起來的效果(guo)將更好。因(yin)此,它非常適合與像 Mel Gorman 提交的那個(ge)補丁(ding)(ding)結合起來一起工作。對 LRU 方(fang)法(fa)的修改可(ke)(ke)能會(hui)影響性能,因(yin)為(wei)當 lumpy reclaim 補丁(ding)(ding)的邏(luo)輯在(zai)處理那些(xie)周(zhou)圍相鄰的頁(ye)框時,這些(xie)頁(ye)框可(ke)(ke)能正在(zai)被頻繁(fan)地使用。為(wei)了盡量減少這種(zhong)影響,lumpy reclaim 補丁(ding)(ding)采用的策略是(shi)只(zhi)有在(zai)內核無法(fa)分配更大內存塊時,才會(hui)執行(xing)以(yi)上操作。

If - and when - these patches may be merged is yet to be seen. Core memory management patches tend to inspire a high level of caution; they can easily create chaos when exposed to real-world workloads. The problem doesn't go away by itself, however, so something is likely to happen, sooner or later.

這(zhe)些補(bu)丁是(shi)否會被合(he)(he)入(ru)(ru)內核(he)(he)以及何時會被合(he)(he)入(ru)(ru)還(huan)有待觀察。針對(dui)內存管理核(he)(he)心(xin)子系統的(de)(de)(de)修改往往會受到(dao)社(she)區的(de)(de)(de)嚴格審查(cha);特別地(di)真實環境下(xia)的(de)(de)(de)壓(ya)力測試才是(shi)對(dui)它們真正的(de)(de)(de)考(kao)驗(yan)。總而(er)言之(zhi),這(zhe)個話題還(huan)遠未結(jie)束,讓(rang)我們看看最終會是(shi)什么樣的(de)(de)(de)結(jie)果(guo)吧。(譯者注(zhu),最終的(de)(de)(de)結(jie)果(guo)是(shi),lumpy reclaim 補(bu)丁隨 2.6.23 版本合(he)(he)入(ru)(ru)主(zhu)線,而(er) Mel 的(de)(de)(de)補(bu)丁經過修改后隨 2.6.24 版本合(he)(he)入(ru)(ru)主(zhu)線)

 

0條評論
作者已關閉評論
趙****生
5文(wen)章數
0粉絲數
趙****生
5 文章(zhang) | 0 粉絲(si)

避免和解決內存碎片化

2024-11-18 09:21:50
11
0

Memory fragmentation is a kernel programming issue with a long history. As a system runs, pages are allocated for a variety of tasks with the result that memory fragments over time. A busy system with a long uptime may have very few blocks of pages which are physically-contiguous. Since Linux is a virtual memory system, fragmentation normally is not a problem; physically scattered memory can be made virtually contiguous by way of the page tables.

內存碎(sui)片(memory fragmentation)是內核編程中一個歷史(shi)悠久的(de)(de)(de)問題。系統運行(xing)過程中,各種(zhong)任務都會(hui)請求分配(pei)內存頁(ye)框(kuang),導致隨著時間(jian)(jian)的(de)(de)(de)推移,內存的(de)(de)(de)碎(sui)片化現象會(hui)愈發嚴重。對一個繁忙的(de)(de)(de)系統來說,長(chang)時間(jian)(jian)運行(xing)一段時間(jian)(jian)后,只(zhi)有很少的(de)(de)(de)空閑頁(ye)框(kuang)還能在物理(li)上保持連續。當(dang)然,由于 Linux 支持虛擬(ni)(ni)內存管理(li),所以物理(li)內存碎(sui)片通(tong)常不是問題;在頁(ye)表的(de)(de)(de)幫助下(xia),物理(li)上分散的(de)(de)(de)內存在虛擬(ni)(ni)地址上呈現出來仍然是連續的(de)(de)(de)。

But there are a few situations where physically-contiguous memory is absolutely required. These include large kernel data structures (except those created with vmalloc()) and any memory which must appear contiguous to peripheral devices. DMA buffers for low-end devices (those which cannot do scatter/gather I/O) are a classic example. If a large ("high order") block of memory is not available when needed, something will fail and yet another user will start to consider switching to BSD.

但仍有一些情況下必須要求使用物理上保持連續的內存塊。這樣的例子包括用于存放內核創建的一些大型的數據結構(使用 vmalloc() 函(han)數創建的除外)以及為了滿足(zu)外設要求必(bi)須(xu)保證(zheng)某(mou)些(xie)內(nei)(nei)存(cun)塊(kuai)的物理(li)內(nei)(nei)存(cun)地址(zhi)連(lian)續。第二個例子常見于(yu)在操(cao)作(zuo)某(mou)些(xie)低端外設(不支持 scatter/gather 模式(shi))的 DMA 緩(huan)沖區時。如(ru)果(guo)在需要時無法分配 “高階(jie)”(high-order,指(zhi)物理(li)上由多個連(lian)續的頁框組成(cheng))內(nei)(nei)存(cun)塊(kuai),將(jiang)影響一些(xie)應用實現,說(shuo)得嚴重點可(ke)能會導致一些(xie) Linux 用戶轉而(er)投(tou)奔 BSD。

Over the years, a number of approaches to the memory fragmentation problem have been considered, but none have been merged. Adding any sort of overhead to the core memory management code tends to be a hard sell. But this resistance does not mean that people stop trying. One of the most persistent in this area has been Mel Gorman, who has been working on an anti-fragmentation patch set for some years. Mel is back with version 27 of his patch, now rebranded "page clustering." This version appears to have attracted some interest, and may yet get into the mainline.

多(duo)(duo)年來,為了(le)解(jie)決內(nei)存(cun)碎(sui)片的問(wen)題(ti),社區已經考慮(lv)了(le)許(xu)多(duo)(duo)方(fang)法,但還沒有(you)(you)(you)一(yi)(yi)個被內(nei)核主線(xian)(xian)所接受。任何試圖對內(nei)存(cun)管理(li)核心(xin)代碼進行(xing)修改的嘗試都需要慎(shen)之又慎(shen)。但這并沒有(you)(you)(you)阻止(zhi)人們努力(li)嘗試的決心(xin)。要問(wen)在(zai)這個領域,哪些人最(zui)有(you)(you)(you)毅(yi)力(li),恐怕就不得不提起(qi) Mel Gorman,他多(duo)(duo)年來一(yi)(yi)直在(zai)研究(jiu)反碎(sui)片(anti-fragmentation)問(wen)題(ti)。最(zui)近(jin)針對該問(wen)題(ti)他又向內(nei)核提交了(le)他的補丁的第 27 個版本(ben),并給它起(qi)了(le)一(yi)(yi)個新(xin)的名(ming)字(zi)叫做 "page clustering"(譯者注,clustering 有(you)(you)(you)聚合,一(yi)(yi)簇的意(yi)思,即下文所解(jie)釋的算法中(zhong)對頁框進行(xing)分類的概念)。這個版本(ben)似乎(hu)引起(qi)了(le)大家的一(yi)(yi)些興趣,很有(you)(you)(you)可能會(hui)進入內(nei)核主線(xian)(xian)。

The core observation in Mel's patch set remains that some types of memory are more easily reclaimed than others. A page which is backed up on a filesystem somewhere can be readily discarded and reused, for example, while a page holding a process's task structure is pretty well nailed down. One stubborn page is all it takes to keep an entire large block of memory from being consolidated and reused as a physically-contiguous whole. But if all of the easily-reclaimable pages could be kept together, with the non-reclaimable pages grouped into a separate region of memory, it should be much easier to create larger blocks of free memory.

Mel 補(bu)丁(ding)的(de)(de)(de)(de)新(xin)版本并沒有改變其核(he)心思(si)想(譯者注,這個(ge)補(bu)丁(ding)已經延續了很久,較早的(de)(de)(de)(de)介紹(shao)(shao)請參考首次介紹(shao)(shao) Mel 的(de)(de)(de)(de)補(bu)丁(ding)(更新(xin)到 V6);第二次介紹(shao)(shao) Mel 的(de)(de)(de)(de)補(bu)丁(ding)(更新(xin)到 V19)),仍然是基于(yu)他(ta)對內(nei)存(cun)的(de)(de)(de)(de)觀(guan)察(cha),并根據其在(zai)運(yun)行過程中(zhong)是否易(yi)于(yu)回(hui)(hui)(hui)收(shou)(shou)(reclaim)對頁(ye)(ye)(ye)框(kuang)進行了分類。例如(ru),用于(yu)緩(huan)存(cun)文件的(de)(de)(de)(de)內(nei)存(cun)頁(ye)(ye)(ye)應該是容(rong)易(yi)被(bei)回(hui)(hui)(hui)收(shou)(shou)和重用的(de)(de)(de)(de),而一(yi)個(ge)包含了進程任務結構體的(de)(de)(de)(de)內(nei)存(cun)頁(ye)(ye)(ye)則不(bu)能(neng)隨(sui)意回(hui)(hui)(hui)收(shou)(shou)。這種不(bu)能(neng)回(hui)(hui)(hui)收(shou)(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)一(yi)旦夾(jia)雜(za)在(zai)其他(ta)可(ke)回(hui)(hui)(hui)收(shou)(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)中(zhong)會(hui)導致內(nei)核(he)無法將他(ta)們整(zheng)合(he)為一(yi)個(ge)連續的(de)(de)(de)(de)內(nei)存(cun)塊。但是,如(ru)果(guo)通過將不(bu)可(ke)回(hui)(hui)(hui)收(shou)(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)和可(ke)回(hui)(hui)(hui)收(shou)(shou)頁(ye)(ye)(ye)框(kuang)分隔開,也就是說如(ru)果(guo)將所有易(yi)于(yu)回(hui)(hui)(hui)收(shou)(shou)的(de)(de)(de)(de)頁(ye)(ye)(ye)框(kuang)保持在(zai)一(yi)起的(de)(de)(de)(de)話,那(nei)么創建(jian)更大(da)的(de)(de)(de)(de)空閑內(nei)存(cun)塊會(hui)變得容(rong)易(yi)得多。

So Mel's patch divides each memory zone into three types of blocks: non-reclaimable, easily reclaimable, and movable. The "movable" type is a new feature in this patch set; it is used for pages which can be easily shifted elsewhere using the kernel's page migration mechanism. In many cases, moving a page might be easier than reclaiming it, since there is no need to involve a backing store device. Grouping pages in this way should also make the creation of larger blocks "just happen" when a process is migrated from one NUMA node to another.

Mel 的(de)(de)補(bu)丁正是根據以上(shang)(shang)思想(xiang)將(jiang)每個(ge)(ge)存儲區(qu)域(zone)的(de)(de)頁(ye)(ye)框(kuang)劃分為(wei)三種(zhong)類(lei)(lei)型:不可(ke)(ke)(ke)回(hui)(hui)收的(de)(de)(“non-reclaimable”),易于(yu)回(hui)(hui)收的(de)(de)(“easily reclaimable”)和(he)可(ke)(ke)(ke)移動(dong)的(de)(de)(“movable”)。“可(ke)(ke)(ke)移動(dong)” 類(lei)(lei)型是此次補(bu)丁版本(ben)中新增加的(de)(de)類(lei)(lei)型;所謂(wei)的(de)(de)“可(ke)(ke)(ke)移動(dong)”是指利用內核的(de)(de)頁(ye)(ye)面遷(qian)移機制(page migration)可(ke)(ke)(ke)以方(fang)便(bian)地將(jiang)頁(ye)(ye)框(kuang)上(shang)(shang)的(de)(de)內容轉移到(dao)(dao)其他(ta)頁(ye)(ye)框(kuang)上(shang)(shang)去。在許(xu)多情況下(xia),“移動(dong)”(move)頁(ye)(ye)面會比回(hui)(hui)收(reclaim)頁(ye)(ye)面更容易,因為(wei)不需要涉及向磁盤設(she)備進行(xing)寫操(cao)作。采用這種(zhong)方(fang)式對頁(ye)(ye)框(kuang)進行(xing)分類(lei)(lei)后(hou)(hou)(指引入“移動(dong)”方(fang)式后(hou)(hou))會帶來一個(ge)(ge)好處,就是當我們將(jiang)一個(ge)(ge)進程從一個(ge)(ge) NUMA 節(jie)點遷(qian)移到(dao)(dao)另(ling)一個(ge)(ge) NUMA 節(jie)點時,大塊的(de)(de)連續內存會很自然(ran)地呈現出來。

So, in this patch, movable pages (those marked with __GFP_MOVABLE) are generally those belonging to user-space processes. Moving a user-space page is just a matter of copying the data and changing the page table entry, so it is a relatively easy thing to do. Reclaimable pages (__GFP_RECLAIMABLE), instead, usually belong to the kernel. They are either allocations which are expected to be short-lived (some kinds of DMA buffers, for example, which only exist for the duration of an I/O operation) or can be discarded if needed (various types of caches). Everything else is expected to be hard to reclaim.

在此補丁中,可移動頁(即當我們使用 __GFP_MOVABLE 選項申請分配的頁框)通常屬于用戶空間的進程。移動屬于用戶空間的頁框只需要復制數據以及更改頁表條目,是一件相對容易的事情。相反,可回收頁面(即使用 __GFP_RECLAIMABLE 選項申請分配的(de)(de)頁(ye)框(kuang))通常由(you)內核使用(yong)。它們或(huo)者(zhe)是預期使用(yong)后會被快速釋放的(de)(de)內存(cun)(例如(ru)(ru),某(mou)些類(lei)型(xing)的(de)(de) DMA 緩沖區,僅(jin)在輸入或(huo)者(zhe)輸出操作期間存(cun)在),或(huo)者(zhe)是一(yi)些如(ru)(ru)果有必要可以立即回收的(de)(de)內存(cun)(譬如(ru)(ru)各種類(lei)型(xing)的(de)(de)緩存(cun))。除了這兩種類(lei)型(xing)以外的(de)(de)頁(ye)框(kuang)都歸類(lei)于不可回收的(de)(de)內存(cun)。

By simply grouping different types of allocation in this way, Mel was able to get some pretty good results:

通過簡單地以這種(zhong)方式對(dui)分配的(de)內存進(jin)行分類歸組后(hou),Mel 宣稱(針對(dui)避免內存碎片化)能(neng)夠獲得一些(xie)非常好的(de)效果:

In benchmarks and stress tests, we are finding that 80% of memory is available as contiguous blocks at the end of the test. To compare, a standard kernel was getting < 1% of memory as large pages on a desktop and about 8-12% of memory as large pages at the end of stress tests.
在基準測試和壓力測試中,我們發現(采用補丁后) 80% 內存在測試結束后仍然是連續的。作為對比,我們發現采用標準內核(未加入補丁情況下)的臺式機上通常的運行結果是連續的大內存小于 1%,如果是運行壓力測試后則連續的大內存占比在大約 8-12%。

Linus has, in the past, been generally opposed to efforts to reduce memory fragmentation. His comments this time around have been much more detail-oriented, however: should allocations be considered movable or non-movable by default? The answer would appear to be "non-movable," since somebody always has to make some effort to ensure that a specific allocation can be moved. Since the discussion is now happening at this level, some sort of fragmentation avoidance might just find its way into the kernel.

Linus 過去一直對減少內存碎片的的補丁修改持反對意見。但這次,他的回應 看上去卻似乎更加注重細節:Linus 在回復中提到,當我們調用分配頁框的接口(alloc_page())時,如(ru)果(guo)(guo)不特殊指明(ming)(ming),缺省(sheng)的(de)請求類(lei)型應該是(shi)被視為可(ke)(ke)(ke)(ke)移(yi)(yi)(yi)動(movable)還(huan)是(shi)不可(ke)(ke)(ke)(ke)移(yi)(yi)(yi)動(non-movable)?他(ta)傾向于采用(yong)(yong)(yong)(yong)缺省(sheng)為 “不可(ke)(ke)(ke)(ke)移(yi)(yi)(yi)動”,因為相對于 “不可(ke)(ke)(ke)(ke)移(yi)(yi)(yi)動”,內(nei)(nei)(nei)核針(zhen)對 “可(ke)(ke)(ke)(ke)移(yi)(yi)(yi)動” 類(lei)型的(de)頁(ye)框會(hui)執(zhi)行額(e)外(wai)的(de)操作(譯(yi)者(zhe)注,所(suo)謂額(e)外(wai)操作即(ji)前文所(suo)述利(li)用(yong)(yong)(yong)(yong)頁(ye)遷(qian)移(yi)(yi)(yi)機制移(yi)(yi)(yi)動頁(ye)框,需要補充說明(ming)(ming)的(de)是(shi),Linus 之所(suo)以(yi)提(ti)出這(zhe)種想法的(de)目的(de)無非是(shi)從使用(yong)(yong)(yong)(yong)的(de)角度(du)出發(fa),即(ji)如(ru)果(guo)(guo)能(neng)夠讓調用(yong)(yong)(yong)(yong)者(zhe)在(zai)請求分配時對需要移(yi)(yi)(yi)動的(de)場(chang)景明(ming)(ming)確提(ti)出其(qi)請求,會(hui)促使調用(yong)(yong)(yong)(yong)者(zhe)更明(ming)(ming)確其(qi)意圖并意識到這(zhe)么做的(de)后果(guo)(guo)會(hui)增(zeng)加(jia)內(nei)(nei)(nei)核額(e)外(wai)的(de)動作)。值得注意的(de)是(shi),這(zhe)次針(zhen)對這(zhe)個補丁的(de)討論已(yi)經詳(xiang)細到這(zhe)個地步,看來(lai)有關避(bi)免內(nei)(nei)(nei)存(cun)碎片的(de)改(gai)動有希望(wang)在(zai)不久(jiu)將會(hui)進入(ru)內(nei)(nei)(nei)核主線。

A related approach to fragmentation is the lumpy reclaim mechanism posted by Andy Whitcroft but originally by Peter Zijlstra. Memory reclaim in Linux is normally done by way of a least-recently-used (LRU) list; the hope is that, if a page must be discarded, going after the least recently used page will minimize the chances of throwing out a page which will be needed soon. This mechanism will tend to free pages which are scattered randomly in the physical address space, however, making it hard to create larger blocks of free memory.

另一個和碎片化有(you)關的(de)(de)(de)(de)補丁是(shi)(shi) Andy Whitcroft 提交的(de)(de)(de)(de) lumpy reclaim 機(ji)制改進(jin),這個方(fang)法最初由(you) Peter Zijlstra 提出。Linux 中的(de)(de)(de)(de)內存回收(shou)通常(chang)利用 LRU 鏈(lian)表來(lai)完(wan)成(譯者(zhe)注, LRU 是(shi)(shi) least-recently-used 的(de)(de)(de)(de)縮寫(xie));其原理是(shi)(shi),如果(guo)必須釋放(fang)(fang)頁框(kuang),則(ze)內核(he)(he)會(hui)(hui)從 LRU 鏈(lian)表中選擇最近最少被使用的(de)(de)(de)(de)頁框(kuang)進(jin)行釋放(fang)(fang),避免(mian)換出那些經(jing)常(chang)被訪問(wen)的(de)(de)(de)(de)頁框(kuang)。但(dan)基(ji)于這種機(ji)制會(hui)(hui)傾(qing)向于造成內存的(de)(de)(de)(de)碎片化,妨礙(ai)內核(he)(he)分配更大的(de)(de)(de)(de)連續(xu)的(de)(de)(de)(de)空閑內存塊(kuai)。(譯者(zhe)注,具體原因(yin)是(shi)(shi)釋放(fang)(fang)的(de)(de)(de)(de)過程(cheng)是(shi)(shi)按照 LRU 鏈(lian)表中的(de)(de)(de)(de)順序進(jin)行的(de)(de)(de)(de),并沒有(you)考慮(lv)釋放(fang)(fang)的(de)(de)(de)(de)頁框(kuang)之(zhi)間物理地址是(shi)(shi)否(fou)連續(xu))

The lumpy reclaim patch tries to address this problem by modifying the LRU algorithm slightly. When memory is needed, the next victim is chosen from the LRU list as before. The reclaim code then looks at the surrounding pages (enough of them to form a higher-order block) and tries to free them as well. If it succeeds, lumpy reclaim will quickly create a larger free block while reclaiming a minimal number of pages.

lumpy reclaim 補丁嘗(chang)試(shi)通(tong)過對 LRU 算法(fa)進行輕微的(de)(de)(de)調(diao)整來解決這個問題。當需要分配內(nei)存(cun)時(shi),首先基(ji)于 LRU 鏈表按照上(shang)節描述的(de)(de)(de)方法(fa)選擇可以釋(shi)(shi)放的(de)(de)(de)頁框(kuang)。區別是(shi),在此原有(you)基(ji)礎上(shang),修改后的(de)(de)(de)回收代碼會查看該被釋(shi)(shi)放的(de)(de)(de)頁框(kuang)的(de)(de)(de)周(zhou)圍是(shi)否(fou)有(you)連續的(de)(de)(de)頁框(kuang)可以和剛剛釋(shi)(shi)放的(de)(de)(de)頁框(kuang)一起形成(cheng)更大(da)的(de)(de)(de)內(nei)存(cun)塊,如果有(you)則嘗(chang)試(shi)釋(shi)(shi)放它們。一旦成(cheng)功,該補丁可以在回收少量頁框(kuang)的(de)(de)(de)同時(shi)快(kuai)速(su)地創建(jian)更大(da)的(de)(de)(de)連續空(kong)閑內(nei)存(cun)塊。

Clearly, this approach will work better if the surrounding pages can be freed. As a result, it combines well with a clustering mechanism like Mel Gorman's. The distortion of the LRU approach could have performance implications, since the neighboring pages may be under heavy use when the lumpy reclaim code goes after them. In an attempt to minimize this effect, lumpy reclaim only happens when the kernel is having trouble satisfying a request for a larger block of memory.

顯然,如(ru)果(guo)(guo)一個可以回收的(de)頁框(kuang)的(de)周圍(wei)的(de)頁框(kuang)也是很易于釋放的(de),那么這種(zhong)方法(指(zhi) lumpy reclaim 補丁)工(gong)作(zuo)起來的(de)效果(guo)(guo)將更好(hao)。因此,它(ta)非常適(shi)合(he)與像 Mel Gorman 提(ti)交的(de)那個補丁結合(he)起來一起工(gong)作(zuo)。對 LRU 方法的(de)修改可能會影響性能,因為當 lumpy reclaim 補丁的(de)邏輯在(zai)處(chu)理那些(xie)周圍(wei)相鄰的(de)頁框(kuang)時,這些(xie)頁框(kuang)可能正在(zai)被頻繁地使用。為了盡量減少這種(zhong)影響,lumpy reclaim 補丁采(cai)用的(de)策略是只有(you)在(zai)內(nei)核無法分配更大內(nei)存(cun)塊時,才會執行以上操作(zuo)。

If - and when - these patches may be merged is yet to be seen. Core memory management patches tend to inspire a high level of caution; they can easily create chaos when exposed to real-world workloads. The problem doesn't go away by itself, however, so something is likely to happen, sooner or later.

這些補(bu)丁(ding)是否(fou)會(hui)被合(he)入(ru)內(nei)核以(yi)及何(he)時會(hui)被合(he)入(ru)還有待觀察。針對內(nei)存(cun)管理核心子系統的(de)(de)修改往(wang)往(wang)會(hui)受(shou)到社區的(de)(de)嚴格審查;特別(bie)地(di)真實環境(jing)下的(de)(de)壓力測(ce)試(shi)才是對它們(men)真正的(de)(de)考驗。總而(er)言之,這個(ge)話(hua)題還遠未結(jie)(jie)束,讓我們(men)看看最(zui)(zui)終會(hui)是什么樣的(de)(de)結(jie)(jie)果吧。(譯者注(zhu),最(zui)(zui)終的(de)(de)結(jie)(jie)果是,lumpy reclaim 補(bu)丁(ding)隨 2.6.23 版本(ben)合(he)入(ru)主線,而(er) Mel 的(de)(de)補(bu)丁(ding)經過修改后隨 2.6.24 版本(ben)合(he)入(ru)主線)

 

文章來自個人專欄
文(wen)章 | 訂閱
0條評論
作者已關閉評論
作者已關閉評論
0
0