怎么分析MongoDbMmap引擎

這篇文章給大家介紹怎么分析MongoDB Mmap引擎，內(nèi)容非常詳細(xì)，感興趣的小伙伴們可以參考借鑒，希望對大家能有所幫助。

讓客戶滿意是我們工作的目標(biāo)，不斷超越客戶的期望值來自于我們對這個(gè)行業(yè)的熱愛。我們立志把好的技術(shù)通過有效、簡單的方式提供給客戶，將通過不懈努力成為客戶在信息化領(lǐng)域值得信任、有價(jià)值的長期合作伙伴，公司提供的服務(wù)項(xiàng)目有：域名注冊、網(wǎng)頁空間、營銷軟件、網(wǎng)站建設(shè)、佛坪網(wǎng)站維護(hù)、網(wǎng)站推廣。

MongoDB在3.0之前一直使用mmap引擎作為默認(rèn)存儲(chǔ)引擎，本篇從源碼角度對mmap引擎作分析，業(yè)界一直以來對10gen用mmap實(shí)現(xiàn)存儲(chǔ)引擎褒貶不一，本文對此不作探討。

存儲(chǔ)按照db來分目錄，每個(gè)db目錄下有 .ns文件 {dbname}.0, {dbname}.1 等文件。journal 目錄下存放的是WAL（write ahead log) 用于故障恢復(fù)。目錄結(jié)構(gòu)如下：

db
|------journal
           |----_j.0
           |----_j.1
           |----lsn
|------local
           |----local.ns
           |----local.0
           |----local.1
|------mydb
           |----mydb.ns
           |----mydb.0
           |----mydb.1

這三類文件構(gòu)成了mmap引擎的持久化單元。本文主要從代碼層次分析每類文件的結(jié)構(gòu)。

Namespace元數(shù)據(jù)管理

.ns文件映射

mmap引擎加載某個(gè)database時(shí)，首先初始化namespaceIndex，namespaceIndex相當(dāng)于database的元數(shù)據(jù)入口。
mongo/db/storage/mmap_v1/catalog/namespace_index.cpp

 89    DurableMappedFile _f{MongoFile::Options::SEQUENTIAL};      
 90    std::unique_ptr<NamespaceHashTable> _ht;               
154    const std::string pathString = nsPath.string(); 
159    _f.open(pathString);
232    p = _f.getView();
242    _ht.reset(new NamespaceHashTable(p, (int)len, "namespace index"));

如上，創(chuàng)建對.ns文件的mmap，將內(nèi)存的view直接映射到hashtable上(不不進(jìn)行任何解析)。因此.ns文件是一個(gè)hashtable的內(nèi)存鏡像。

hashtable的key-value關(guān)系string->NamespaceDetails(namespace_details.h)，采用的是開放尋址hash。

39 int NamespaceHashTable::_find(const Namespace& k, bool& found) const {        
46     while (1) {        
47         if (!_nodes(i).inUse()) {        
48             if (firstNonUsed < 0)
49                 firstNonUsed = i;        
50         }       
51        
52         if (_nodes(i).hash == h && _nodes(i).key == k) {        
53             if (chain >= 200) 
54                 log() << "warning: hashtable " << _name << " long chain " << std::endl;        
55             found = true; 
56             return i;        
57         }        
58         chain++;        
59         i = (i + 1) % n; 
60         if (i == start) {        
62             log() << "error: hashtable " << _name << " is full n:" << n << std::endl;        
63             return -1;
64         }       
65         if (chain >= maxChain) {        
66             if (firstNonUsed >= 0)        
67                 return firstNonUsed;
68             log() << "error: hashtable " << _name << " max chain reached:" << maxChain << std::endl;
69             return -1;
70         }       
71     }       
72 }

上述過程是開放式尋址hash的經(jīng)典的查找過程，如果有沖突，向后跳一格，如果跳到查找的起點(diǎn)依然沒有找到可用的空槽，則說明hashtable滿了。

元數(shù)據(jù)內(nèi)容窺探

一個(gè)NamespaceDetails對象對應(yīng)該db下的某張表的元數(shù)據(jù)(namespace_index.h)，大小為496bytes，mongod默認(rèn)為.ns文件分配16MB的空間，且.ns文件唯一且不可動(dòng)態(tài)伸縮空間，可以推斷出一個(gè)mongod實(shí)例至多可建表大概30000個(gè)。該類有22個(gè)字段，重要字段有如下6個(gè)。

struct NamespaceDetails {
// extent對應(yīng)于一個(gè)內(nèi)存連續(xù)塊，由于mmap，也是文件連續(xù)區(qū)域。一張表有多個(gè)extent。
// 以雙向鏈表的形式組織，firstExtent和lastExtent分別對應(yīng)extent的首尾指針
DiskLoc firstExtent;  
DiskLoc lastExtent;
// 有若干種（26種）按照最小尺寸劃分的freelist，
// 表中刪除掉的行對應(yīng)的數(shù)據(jù)塊放到freelist中，按照數(shù)據(jù)塊的尺寸劃分為若干規(guī)則的freelist。
DiskLoc deletedListSmall[SmallBuckets];
// 兼容舊版本mmap引擎的廢棄字段
DiskLoc deletedListLegacyGrabBag;
// 該表是否是capped，capped-table是ring-buffer類型的table，MongoDB中用來存放oplog
int isCapped;
// 和deletedListSmall字段一樣，都是freelist的一部分，只是大小不同
DiskLoc deletedListLarge[LargeBuckets];
}

為了便于下文闡述，結(jié)合上述對namespaceIndex構(gòu)建過程的描述與對元數(shù)據(jù)的注解，筆者先勾勒出如下的元數(shù)據(jù)結(jié)構(gòu)。

單表結(jié)構(gòu)

上文我們討論了單表元數(shù)據(jù)(NamespaceDetails)中重要字段的含義，接下來進(jìn)行深入探討。

Extent的組織形式

每張表由若干extent組成，每個(gè)extent為一塊連續(xù)的內(nèi)存區(qū)域(也即連續(xù)的硬盤區(qū)域)，由firstExtent 和 lastExtent 記錄首尾位置，每個(gè)extent的結(jié)構(gòu)為

/*extents are datafile regions where all the records within the region belong to the same namespace.*/
struct Extent {
    DiskLoc myLoc;
    DiskLoc xnext; //雙向鏈表中前節(jié)點(diǎn)指針
    DiskLoc xprev; //雙向鏈表中后節(jié)點(diǎn)指針
    Namespace nsDiagnstic;
    int length;
    // 一個(gè)Record對應(yīng)表中的一行，每個(gè)extent在物理上由若干地址連續(xù)的
    // Record組成，但是這些record在邏輯上的前后關(guān)系并不等價(jià)于物理上
    // 的前后關(guān)系，first/last Record維護(hù)了邏輯上的先后關(guān)系，在維護(hù)游
    // 表迭代時(shí)使用
    DiskLoc firstRecord;
    DiskLoc lastRecord;
    char _extentData[4];
}

上述描述的組織結(jié)構(gòu)如下圖所示：

Extent 的分配與回收由ExtentManger管理，ExtentManager 首先嘗試從已有文件中分配一個(gè)滿足條件的連續(xù)塊，如果沒有找到，則生成一個(gè)新的{dbname}.i 的文件。

143 void DataFile::open(OperationContext* txn,                                                       
144                     const char* filename,                                                         
145                     int minSize,                                                                 
146                     bool preallocateOnly) {                                                       
147     long size = _defaultSize();                                                                   
148                                                                                                   
149     while (size < minSize) {                                                                     
150         if (size < maxSize() / 2) {                                                               
151             size *= 2;                                                                           
152         } else {                                                                                 
153             size = maxSize();                                                                     
154             break;                                                                               
155         }                                                                                         
156     }                                                                                             
157                                                                                                   
158     if (size > maxSize()) {                                                                       
159         size = maxSize();                                                                         
160     }                                                                                             
161                                                                                                   
162     invariant(size >= 64 * 1024 * 1024 || mmapv1GlobalOptions.smallfiles);

文件的大小 {dbname}.0的大小默認(rèn)為64MB。之后每次新建會(huì)擴(kuò)大一倍，以maxSize(默認(rèn)為2GB)為上限。

一個(gè)extent被分為若干Records，每個(gè)Record對應(yīng)表中的一行(一個(gè)集合中的文檔），每一張表被RecordStore類封裝，并對外提供出CRUD的接口。

Record分配

首先從已有的freelist(上文中提到的deletedBuckets)中分配，每張表按照內(nèi)存塊尺寸維護(hù)了不同規(guī)格的freelist，每個(gè)freelist是一個(gè)單向鏈表，當(dāng)刪除Record時(shí)，將record放入對應(yīng)大小的freelist中。
如下按照從小到大的順序遍歷DeletedBuckets，如果遍歷到有空閑且符合大小的空間，則分配：

107         for (myBucket = bucket(lenToAlloc); myBucket < Buckets; myBucket++) {
108             // Only look at the first entry in each bucket. This works because we are either
109             // quantizing or allocating fixed-size blocks.
110             const DiskLoc head = _details->deletedListEntry(myBucket);
111             if (head.isNull())
112                 continue;
113             DeletedRecord* const candidate = drec(head);
114             if (candidate->lengthWithHeaders() >= lenToAlloc) {
115                 loc = head;
116                 dr = candidate;
117                 break;
118             }
119         }

上述代碼分配出一塊尺寸合適的內(nèi)存塊，但是該內(nèi)存塊依然可能比申請的尺寸大一些。mmap引擎在這里的處理方式是：將多余的部分砍掉，并歸還給freelist。

133     const int remainingLength = dr->lengthWithHeaders() - lenToAlloc;
134     if (remainingLength >= bucketSizes[0]) {
135         txn->recoveryUnit()->writingInt(dr->lengthWithHeaders()) = lenToAlloc;
136         const DiskLoc newDelLoc = DiskLoc(loc.a(), loc.getOfs() + lenToAlloc);
137         DeletedRecord* newDel = txn->recoveryUnit()->writing(drec(newDelLoc));
138         newDel->extentOfs() = dr->extentOfs();       
139         newDel->lengthWithHeaders() = remainingLength;
140         newDel->nextDeleted().Null();
141         
142         addDeletedRec(txn, newDelLoc);
143     }

上述分片內(nèi)存的過程如下圖所示：

如若從已有的freelist中分配失敗，則會(huì)嘗試申請新的extent,并將新的extent加到尺寸規(guī)則最大的freelist中。并再次嘗試從freelist中分配內(nèi)存。

 59 const int RecordStoreV1Base::bucketSizes[] = {
  ...
 83     MaxAllowedAllocation,      // 16.5M
 84     MaxAllowedAllocation + 1,  // Only MaxAllowedAllocation sized records go here.
 85     INT_MAX,                   // "oversized" bucket for unused parts of extents.
 86 };
 87

上述過程為mmap引擎對內(nèi)存管理的概況，可見每個(gè)record在分配時(shí)不是固定大小的，申請到的內(nèi)存塊要將多出的部分添加到deletedlist中，record釋放后也是鏈接到對應(yīng)大小的deletedlist中，這樣做時(shí)間久了之后會(huì)產(chǎn)生大量的內(nèi)存碎片，mmap引擎也有針對碎片的compact過程以提高內(nèi)存的利用率。

碎片Compact

compact以命令的形式，暴露給客戶端，該命令以collection為維度，在實(shí)現(xiàn)中，以extent為最小粒度。

compact整體過程分為兩步，如上圖，第一步將extent從freelist中斷開，第二步將extent中已使用空間copy到新的extent,拷貝過去保證內(nèi)存的緊湊。從而達(dá)到compact的目的。

orphanDeletedList 過程
將collection 對應(yīng)的namespace 下的deletedlist 置空，這樣新創(chuàng)建的record就不會(huì)分配到已有的extent。

443         WriteUnitOfWork wunit(txn);
444         // Orphaning the deleted lists ensures that all inserts go to new extents rather than
445         // the ones that existed before starting the compact. If we abort the operation before
446         // completion, any free space in the old extents will be leaked and never reused unless
447         // the collection is compacted again or dropped. This is considered an acceptable
448         // failure mode as no data will be lost.
449         log() << "compact orphan deleted lists" << endl;
450         _details->orphanDeletedList(txn);

對于每個(gè)extent，每個(gè)extent記錄了首尾record，遍歷所有record，并將record插入到新的extent中,新的extent在插入時(shí)由于空間不足而自動(dòng)分配（參考上面的過程），extent重新設(shè)置從最小size開始增長。

452     // Start over from scratch with our extent sizing and growth
453     _details->setLastExtentSize(txn, 0);
454
455     // create a new extent so new records go there
456     increaseStorageSize(txn, _details->lastExtentSize(txn), true);
467     for (std::vector<DiskLoc>::iterator it = extents.begin(); it != extents.end(); it++) {
468         txn->checkForInterrupt();
469         invariant(_details->firstExtent(txn) == *it);
470         // empties and removes the first extent
471         _compactExtent(txn, *it, extentNumber++, adaptor, options, stats);
472         invariant(_details->firstExtent(txn) != *it);
473         pm.hit();
474     }

在_compactExtent的過程中，該extent的record逐漸被插入到新的extent里，空間逐步釋放，當(dāng)全部record都清理完后，該extent又變成嶄新的，沒有使用過的extent了。如下圖

324         while (!nextSourceLoc.isNull()) {
325             txn->checkForInterrupt();
326
327             WriteUnitOfWork wunit(txn);
328             MmapV1RecordHeader* recOld = recordFor(nextSourceLoc);
329             RecordData oldData = recOld->toRecordData();
330             nextSourceLoc = getNextRecordInExtent(txn, nextSourceLoc);
371             CompactDocWriter writer(recOld, rawDataSize, allocationSize);
372             StatusWith<RecordId> status = insertRecordWithDocWriter(txn, &writer);
398             _details->incrementStats(txn, -(recOld->netLength()), -1);
          }

上述即是_compactExtent函數(shù)中遍歷該extent的record，并插入到其他extent，并逐步釋放空間的過程（398行）。

mmap數(shù)據(jù)回寫

上面我們介紹.ns文件結(jié)構(gòu)時(shí)談到.ns文件是通過mmap 映射到內(nèi)存中的一個(gè)hashtable上，這個(gè)映射過程是通過DurableMappedFile 實(shí)現(xiàn)的。我們看下該模塊是如何做持久化的
在mmap 引擎的 finishInit中

252 void MMAPV1Engine::finishInit() {

253     dataFileSync.go();

這里調(diào)用 DataFileSync類的定時(shí)任務(wù)，在backgroud線程中定期落盤

 67     while (!inShutdown()) {
 69         if (storageGlobalParams.syncdelay == 0) {
 70             // in case at some point we add an option to change at runtime
 71             sleepsecs(5);
 72             continue;
 73         }
 74
 75         sleepmillis(
 76             (long long)std::max(0.0, (storageGlobalParams.syncdelay * 1000) - time_flushing));

 83         Date_t start = jsTime();
 84         StorageEngine* storageEngine = getGlobalServiceContext()->getGlobalStorageEngine();
 85
 86         dur::notifyPreDataFileFlush();
 87         int numFiles = storageEngine->flushAllFiles(true);
 88         dur::notifyPostDataFileFlush();
 97         }
 98     }

flushAllFiles最終會(huì)調(diào)用每個(gè)memory-map-file的flush方法

245 void MemoryMappedFile::flush(bool sync) {                                                         
246     if (views.empty() || fd == 0 || !sync)                                                       
247         return;                                                                                   
248                                                                                                   
249     bool useFsync = !ProcessInfo::preferMsyncOverFSync();                                         
250                                                                                                   
251     if (useFsync ? fsync(fd) != 0 : msync(viewForFlushing(), len, MS_SYNC) != 0) {               
252         // msync failed, this is very bad                                                         
253         log() << (useFsync ? "fsync failed: " : "msync failed: ") << errnoWithDescription()       
254               << " file: " << filename() << endl;                                                 
255         dataSyncFailedHandler();                                                                 
256     }                                                                                             
257 }

fsync vs msync

不管調(diào)用fsync 還是msync落盤，我們的預(yù)期都是內(nèi)核會(huì)高效的查找出數(shù)據(jù)中的臟頁執(zhí)行寫回，但是根據(jù)https://jira.mongodb.org/browse/SERVER-14129 以及下面的代碼注釋中
在有些操作系統(tǒng)上（比如SmartOS與 Solaris的某些版本)， msync并不能高效的尋找臟頁，因此mmap引擎在這里對操作系統(tǒng)區(qū)別對待了。

208         // On non-Solaris (ie, Linux, Darwin, *BSD) kernels, prefer msync.
209         // Illumos kernels do O(N) scans in memory of the page table during msync which
210         // causes high CPU, Oracle Solaris 11.2 and later modified ZFS to workaround mongodb
211         // Oracle Solaris Bug:                                                                   
212         //  18658199 Speed up msync() on ZFS by 90000x with this one weird trick
213         bool preferMsyncOverFSync;

關(guān)于怎么分析MongoDb Mmap引擎就分享到這里了，希望以上內(nèi)容可以對大家有一定的幫助，可以學(xué)到更多知識(shí)。如果覺得文章不錯(cuò)，可以把它分享出去讓更多的人看到。

分享標(biāo)題：怎么分析MongoDbMmap引擎
網(wǎng)頁URL：http://www.chinadenli.net/article10/gooogo.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站導(dǎo)航、用戶體驗(yàn)、定制開發(fā)、Google、品牌網(wǎng)站建設(shè)、網(wǎng)站收錄

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

欧美一区二区三区老妇人-欧美做爰猛烈大尺度电-99久久夜色精品国产亚洲a-亚洲福利视频一区二区