{"id":255590,"date":"2016-09-27T14:41:58","date_gmt":"2016-09-27T06:41:58","guid":{"rendered":"http:\/\/blog.zhenglei.net\/?p=255590"},"modified":"2016-09-27T14:41:58","modified_gmt":"2016-09-27T06:41:58","slug":"understanding-flash-unpredictable-write-performance","status":"publish","type":"post","link":"https:\/\/blog.zhenglei.net\/?p=255590","title":{"rendered":"Understanding Flash: Unpredictable Write Performance"},"content":{"rendered":"<p><a href=\"https:\/\/flashdba.com\/2014\/12\/10\/understanding-flash-unpredictable-write-performance\/\">https:\/\/flashdba.com\/2014\/12\/10\/understanding-flash-unpredictable-write-performance\/<\/a><\/p>\n<div class=\"post-info\">\n<p><span class=\"time\">December 10, 2014<\/span> <span class=\"post-comments\"><a href=\"https:\/\/flashdba.com\/2014\/12\/10\/understanding-flash-unpredictable-write-performance\/#comments\">6 Comments<\/a><\/span><\/p>\n<\/div>\n<p><a href=\"https:\/\/flashdba.files.wordpress.com\/2014\/12\/fast-page-slow-page.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2469\" src=\"https:\/\/flashdba.files.wordpress.com\/2014\/12\/fast-page-slow-page.jpg?w=1260&amp;h=557\" alt=\"fast-page-slow-page\" width=\"630\" height=\"371\" \/><\/a><\/p>\n<p>I\u2019ve spent a lot of time in this <a href=\"https:\/\/flashdba.com\/storage-for-dbas\/\">blog series<\/a> talking about the challenges involved in using flash, such as the way that\u00a0<a href=\"https:\/\/flashdba.com\/2014\/06\/20\/understanding-flash-blocks-pages-and-program-erases\/\">pages have to be erased before they are written<\/a> and the restriction that\u00a0erase operations take place on a whole block. I also described the problem of erase operations being slow in comparison to reads and writes \u2013 and the resulting <a href=\"https:\/\/flashdba.com\/2014\/10\/15\/understanding-flash-garbage-collection-matters\/\">processes we have to put in place<\/a> to manage that problem (i.e. <em>garbage collection<\/em>) . And most recently I covered the way that garbage collection can result in <a href=\"https:\/\/flashdba.com\/2014\/11\/24\/understanding-flash-the-write-cliff\/\">unpredictable performance<\/a>.<\/p>\n<p>But so far we\u2019ve always worked under the assumption that reads and writes to NAND flash have the same predictably low latency. This post is all about bursting that particular bubble\u2026<\/p>\n<h2>Programming\u00a0NAND Flash: A Quick Recap<\/h2>\n<p>You might remember from my post on the subject of\u00a0<a href=\"https:\/\/flashdba.com\/2014\/07\/03\/understanding-flash-slc-mlc-and-tlc\/\">SLC, MLC and TLC<\/a> that I used the analogy\u00a0of electrons in a bucket to explain the programming of NAND flash cells:<\/p>\n<p><a href=\"https:\/\/flashdba.files.wordpress.com\/2014\/06\/slc-mlc-tlc-buckets.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2189\" src=\"https:\/\/flashdba.files.wordpress.com\/2014\/06\/slc-mlc-tlc-buckets.jpg?w=945&amp;h=194\" alt=\"slc-mlc-tlc-buckets\" width=\"630\" height=\"129\" \/><\/a><\/p>\n<p>I\u2019d now like to change that analogy slightly, so I\u2019m asking you to consider that you have an empty bucket and a powerful hose pipe. You can turn the hose on and off whenever you want to fill the bucket up, but you cannot empty water out of the bucket unless you completely empty it. Ok, now we\u2019re ready.<\/p>\n<p>For SLC we simply say that an empty bucket denotes a binary value of 1\u00a0and a full bucket denotes binary 0. Thus when you want to program an SLC bucket you simply let rip with your hose pipe until it\u2019s full. No need to measure whether the water line is above or below the halfway point (the threshold), just go crazy. Blam! That was quick, wasn\u2019t it?<\/p>\n<p>For MLC however, we have three thresholds \u2013 and again we start with the bucket empty (denoting binary 11). Now, if I want to program the binary values of 01 or 10 in the above diagram I need to be careful, because if I overfill I cannot go backwards. <a href=\"https:\/\/flashdba.files.wordpress.com\/2014\/12\/bucket.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright  wp-image-2471\" src=\"https:\/\/flashdba.files.wordpress.com\/2014\/12\/bucket.jpg?w=308&amp;h=207\" alt=\"bucket\" width=\"205\" height=\"138\" \/><\/a>I therefore have to fill a little, test, fill some more, test and so on. It\u2019s actually kind of tricky \u2013 and it\u2019s one of\u00a0the reasons that MLC is both slower than SLC and has a lower wear limit. But here\u2019s the thing\u2026\u00a0if I want to program my MLC to have a value of binary 00 in the above diagram, I have no such problems because (as with SLC) I can just open the hose up on full power and hit it.<\/p>\n<p>What we\u2019ve demonstrated here is that programming a <strong>full charge<\/strong>\u00a0value to an MLC cell is faster than programming any of the other available values. With a little more thought you can probably see that TLC has this problem to an even worse degree \u2013 imagine how accurate you need to be with that hose when you have seven thresholds to consider!<\/p>\n<p>One final thought. We read and write (program) to NAND flash at the page level, which means we are accessing a large collection of cells as if they are one single unit. What are the chances that when we write a page we will want\u00a0<em>every<\/em> cell to be programmed to full charge? I\u2019d say extremely low. So even if some cells are programmed \u201cthe fast way\u201d, just one \u201cslow\u201d program operation to a non-full-charge threshold will slow the whole program operation down. In other words, I can hardly\u00a0ever take advantage of the faster latency experienced by full charge operations.<\/p>\n<h2>Fast Pages and Slow Pages<\/h2>\n<p>The majority of flash seen in the data centre today is MLC, which contains two bits per cell. Is there a way to program MLC in order that, at least sometimes, I can program at the faster speeds of a full-charge operation?<\/p>\n<p><a href=\"https:\/\/flashdba.files.wordpress.com\/2014\/12\/mlc-bucket-msb-lsb.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"  wp-image-2474 alignleft\" src=\"https:\/\/flashdba.files.wordpress.com\/2014\/12\/mlc-bucket-msb-lsb.jpg?w=519&amp;h=468\" alt=\"mlc-bucket-msb-lsb\" width=\"346\" height=\"312\" \/><\/a>Let\u2019s\u00a0take my MLC bucket diagram from above and remap the binary values like the diagram on the left.\u00a0What have I\u00a0changed? Well most importantly I\u2019ve reordered the binary values that correspond to each voltage level;\u00a0empty charge still represents 11 but now full charge represents 10. Why did I do that?<\/p>\n<p>The clue is the dotted line separating the\u00a0<em>most significant bit<\/em> (MSB) and the\u00a0<em>least significant bit<\/em> (LSB) of each value. Let\u2019s consider two NAND flash pages, each comprising many cells. Now, instead of having both bits from each\u00a0MLC cell used for a single page, I will put all of the MSB values into one page and call that the\u00a0<strong>slow page<\/strong>. Then I\u2019ll take all of the LSB values and put that into the other page and call that the\u00a0<strong>fast page<\/strong>.<\/p>\n<p>Why did I do this? Well, consider what happens when I want to program my fast page: in the diagram you can see that it\u2019s possible to turn the LSB value from one\u00a0to zero by programming it to either of the two higher thresholds\u2026\u00a0<em>including the full charge threshold<\/em>. In fact, if you forget about the MSB side for a second, the LSB side very similar to\u00a0an SLC cell \u2013 and therefore performs like one.<\/p>\n<p>The slow page, meanwhile, has to be programmed just like we discussed previously and therefore sees no benefit from this configuration. What\u2019s more, if I want to program\u00a0the fast page\u00a0in this way I can\u2019t store data in the corresponding slow page (the one with the matching MSBs) because every time I program a full charge to this cell the MSB ends up with a value of one. Also, when I want to program the slow page I have to erase the whole block\u00a0first and then program both pages together (slowly!).<\/p>\n<p>It\u2019s kind of complicated\u2026 but potentially we now\u00a0have the option to <em>program certain MLC pages using a faster operation<\/em>, with the trade-off that <em>other pages will be affected as a result<\/em>.<\/p>\n<h3>Getting To The Point<\/h3>\n<p>I\u00a0should point out here that this is pretty low-level stuff which\u00a0requires direct access to NAND flash (rather than via an SSD for example). It may also require a working relationship with the flash manufacturer. So why am I mentioning it here?<\/p>\n<p>Well first of all I want to show you that <strong>NAND flash is actually a difficult and unpredictable medium<\/strong> on which to store data \u2013 unless you truly understand how it works and make allowances for its behaviour. <a href=\"https:\/\/flashdba.files.wordpress.com\/2013\/05\/nand-flash.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-1470\" src=\"https:\/\/flashdba.files.wordpress.com\/2013\/05\/nand-flash.jpg?w=356&amp;h=252\" alt=\"NAND-flash\" width=\"178\" height=\"126\" \/><\/a>This is one of the reasons why so many flash products exist on the market with completely differing performance characteristics.<\/p>\n<p>When you look at the datasheet for an MLC flash product and you see write \/ program times shown as, for example, 1.4 milliseconds it\u2019s important to realise that this is the average of its bi-modal behaviour. Fast (LSB) pages may well have program times of 300 microseconds, while slow (MSB) pages might take up to 2.5 milliseconds.<\/p>\n<p>Secondly, I want to point out that direct access to the flash (instead of via an SSD) brings certain benefits. What if, in my all flash array, I send all inbound user writes to fast pages but then, later on during garbage collection, I move data to be stored in slow pages? If I could do that, I\u2019d effectively be hiding much\u00a0of the slower performance of MLC writes from my users. And that would be a wonderful thing\u2026<\/p>\n<p>\u2026which is why, at Violin, we\u2019ve been doing it for years<img decoding=\"async\" class=\"emoji\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/2\/svg\/1f642.svg\" alt=\"\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/flashdba.com\/2014\/12\/10\/understa &hellip; <a href=\"https:\/\/blog.zhenglei.net\/?p=255590\">\u7ee7\u7eed\u9605\u8bfb <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[206],"tags":[310,308,309,311],"class_list":["post-255590","post","type-post","status-publish","format-standard","hentry","category-usb","tag-mlc","tag-nand","tag-slc","tag-smlc"],"_links":{"self":[{"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=\/wp\/v2\/posts\/255590","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=255590"}],"version-history":[{"count":1,"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=\/wp\/v2\/posts\/255590\/revisions"}],"predecessor-version":[{"id":255591,"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=\/wp\/v2\/posts\/255590\/revisions\/255591"}],"wp:attachment":[{"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=255590"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=255590"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.zhenglei.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=255590"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}