Using LVM cache for storage tiering

SSDs are small, expensive but fast. HDDs are large and cheap, but slow. Let’s combine the two technologies to get the speed of SSDs with the price and size of HDDs. This can be achieved with storage tiering using LVM cache.

Hardware vs. Software solutions

There are so-called “Hybrid HDDs” on the market. The SSD part is relatively small and you can not tune that cache or getting any statistics about cache hits and cache misses. Further, modern SSD provides a NVMe interface which is much faster than the old school SATA ports. That makes Hybrid HDD/SDD quite useless. A software solution is much faster and more flexible.

dm-cache vs. dm-writecache

There are two implementations available, dm-cache and dm-writecache. This article will focus on dm-cache.

dm-cache

Provides both, write and read cache and is used where not only write operations are critical but also read operations. Use cases are very versatile, can be everything from VM storage to file servers and the like. Another benefit of dm-cache over dm-writecache is that the cache can be created, activated and destroyed online.

dm-writecache

Dm-writecache provides only write cache, but no read cache. dm-writecache has less overhead because there will be no promotion/demotion of data for read caches, performance tests show slightly better results with dm-writecache over dm-cache. Use cases are from my point of view a bit limited as there is no read cache. If you need the maximum possible performance on write operations, then dm-writecache is your choice. Be aware that for creating the cache, the LV must be offline.

I tested this with Fedora 31, RHEL 8.1 and 8.2 beta. It will not work with RHEL8.1 but with RHEL 8.2 beta. On GA it will probably be in the status Technical Preview which means it is not fully supported.

dmesg|tail -2
[  343.446581] TECH PREVIEW: DM writecache may not be fully supported.
               Please review the provided documentation for limitations.

Enterprise users should stick to dm-cache.

cachepool vs. cache volumes

There are two kinds of methods of how the cache is used: cachepool and cache volumes (cachevol). The main difference is that cachepool is using two logical volumes to store the actual cache and the cache metadata, where cachevol is using one device for both. If you have more than one LV to cache it may be better to use cachepool as you can place the metadata on a different device which will provide a better over-all throughput.

Writeback vs. writethrough caching

Writeback means that the write commit happens when the data is written to the cache device only. Writethrough means that the commit happens when the data is written to the backend device as well. So writetrough is more or less useless, write operations are probably even slower (at least latency wise) that with no caching at all.

You may consider using redundant cache devices when using writeback cache mode to ensure resilience. This can be done using a software raid or a hardware raid controller.

Performance

In my test setup I just made some very basic performance tests with dd which is not really a benchmark.

Write speed to the LV was about 340 Mbyte/s, read speed approx. 410 Mbyte/s which is quite slower than the native speed to the used SATA SSD (Samsung SSD 840) but a massive gain compared to the native speed of the HDDs.

With more modern devices (NVNe PICe) you can expect much better performance.

Assumptions

  • Your slow HDD device is /dev/vdb
  • Your fast SDD device is /dev/vdc

Lets do it

Create the PVs

[root@f31-lvmtest ~]# pvcreate /dev/vdb
  Physical volume "/dev/vdb" successfully created.
[root@f31-lvmtest ~]# pvcreate /dev/vdc
  Physical volume "/dev/vdc" successfully created.
[root@f31-lvmtest ~]# 

Create the VG and LV

[root@f31-lvmtest ~]# vgcreate vg_data /dev/vdb /dev/vdc
  Volume group "vg_data" successfully created
[root@f31-lvmtest ~]# lvcreate -L 10G -n lv_data vg_data /dev/vdb
  Logical volume "lv_data" created.
[root@f31-lvmtest ~]#

The created logical volume “lv_data” is later also referenced as the “Origin LV” in manpages and other documentation available.

Creating the cache pool and meta data LVs

[root@f31-lvmtest ~]# lvcreate -n cachepool_meta -L 100M vg_data /dev/vdc
  Logical volume "cachepool_meta" created.
[root@f31-lvmtest ~]# lvcreate -n cachepool -L 5G vg_data /dev/vdc
  Logical volume "cachepool" created.
[root@f31-lvmtest ~]# 

Be aware that the cache pool and metadata LVs can reside on different devices to speed up things even more.

Assemble the cache pool and attach it to the origin LV

[root@f31-lvmtest ~]# lvconvert --type cache-pool --poolmetadata cachepool_meta vg_data/cachepool
  Converted vg_data/cachepool and vg_data/cachepool_meta to cache pool.
[root@f31-lvmtest ~]# lvconvert --type cache --cache-pool cachepool --cachemode writethrough vg_data/lv_data
  Logical volume vg_data/lv_data is now cached.
[root@f31-lvmtest ~]#

Now the lv_data LV is cached. Lets have a look to the output of lsblk

[root@f31-lvmtest ~]# lsblk 
NAME                            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0                              11:0    1 1024M  0 rom  
vda                             252:0    0   20G  0 disk 
├─vda1                          252:1    0    1G  0 part /boot
└─vda2                          252:2    0   19G  0 part 
  ├─fedora-root         253:0    0   15G  0 lvm  /
  └─fedora-swap         253:1    0    2G  0 lvm  [SWAP]
vdb                             252:16   0  100G  0 disk 
└─vg_data-lv_data_corig         253:5    0   10G  0 lvm  
  └─vg_data-lv_data             253:2    0   10G  0 lvm  
vdc                             252:32   0  100G  0 disk 
├─vg_data-cachepool_cpool_cdata 253:3    0    5G  0 lvm  
│ └─vg_data-lv_data             253:2    0   10G  0 lvm  
└─vg_data-cachepool_cpool_cmeta 253:4    0  100M  0 lvm  
  └─vg_data-lv_data             253:2    0   10G  0 lvm 
[root@f31-lvmtest ~]#

Useful tools

The lvm command has some options to display cache statistics but they are not very nice to read.

f31-lvmtest:~# lvs -a -o +devices,cache_total_blocks,cache_used_blocks,cache_dirty_blocks,cache_read_hits,cache_read_misses,cache_write_hits,cache_write_misses,segtype
  LV                      VG             Attr       LSize   Pool              Origin          Data%  Meta%  Move Log Cpy%Sync Convert SyncAction Mismatches Devices                  CacheTotalBlocks CacheUsedBlocks  CacheDirtyBlocks CacheReadHits    CacheReadMisses  CacheWriteHits   CacheWriteMisses Type      
  root                    fedora -wi-ao----  15.00g                                                                                                 /dev/vda2(0)                                                                                                                                    linear    
  swap                    fedora -wi-ao----   2.00g                                                                                                 /dev/vda2(3840)                                                                                                                                 linear    
  [cachepool_cpool]       vg_data        Cwi---C---   5.00g                                   0.01   0.68            0.00                                   cachepool_cpool_cdata(0)            81920                8                0                5               51                0                0 cache-pool
  [cachepool_cpool_cdata] vg_data        Cwi-ao----   5.00g                                                                                                 /dev/vdc(25)                                                                                                                                    linear    
  [cachepool_cpool_cmeta] vg_data        ewi-ao---- 100.00m                                                                                                 /dev/vdc(0)                                                                                                                                     linear    
  lv_data                 vg_data        Cwi-a-C---  10.00g [cachepool_cpool] [lv_data_corig] 0.01   0.68            0.00                                   lv_data_corig(0)                    81920                8                0                5               51                0                0 cache     
  [lv_data_corig]         vg_data        owi-aoC---  10.00g                                                                                                 /dev/vdb(0)                                                                                                                                     linear    
  [lvol0_pmspare]         vg_data        ewi------- 100.00m                                                                                                 /dev/vdb(2560)                                                                                                                                  linear    
f31-lvmtest:~#

It is recommended to create an alias for this options:

echo "alias lvs-cache='lvs -a -o +devices,cache_total_blocks,cache_used_blocks,cache_dirty_blocks,cache_read_hits,cache_read_misses,cache_write_hits,cache_write_misses,segtype'" >> ~/.bashrc

For a more pretty output of this information I found this shell script: https://github.com/standard-error/lvmcache-statistics/blob/master/lvmcache-statistics.sh.

The LV is hardcoded, lets change that:

--- lvmcache-statistics.sh.orig 2020-01-29 11:41:57.477584497 +0100
+++ lvmcache-statistics.sh      2020-01-28 12:17:01.667578929 +0100
@@ -24,7 +24,8 @@
 ##################################################################
 set -o nounset
 
-LVCACHED=/dev/vg00/lvol0
+#LVCACHED=/dev/vg00/lvol0
+LVCACHED=$1
 
 RESULT=$(dmsetup status ${LVCACHED})
 if [ $? -ne 0 ]; then

Afterwards you can provide your LV as a parameter:

[root@f31-lvmtest ~]# ./lvmcache-statistics.sh /dev/vg_data/lv_data
------------------------------------
LVM Cache report of /dev/vg_data/lv_data
------------------------------------
- Cache Usage: 6.3% - Metadata Usage: 1.4%
- Read Hit Rate: 86.0% - Write Hit Rate: 80.4%
- Demotions/Promotions/Dirty: 0/60180/0
- Features in use: metadata2 writethrough no_discard_passdown 
[root@f31-lvmtest ~]#

What happens when the cache disk failes?

In write-through cache mode: No data lost. In write-back mode: Partial loss of data. All data that was not yet written to Origin LV is lost. Means: In writeback mode you better have redundancy or a fast backup/recovery solution. There may also be workloads where data loss is not important at all.

When using single non-redundant cache devices, ensure proper monitoring of the SSD with smartctl or similar SMART monitoring software and replace it as soon as blocks are relocated which means that the SSD is near the end of its life cycle.

Removing the cache

In the case of a failure of the cache device or you just want to replace it with a faster device, un-caching can be done, even online.

Cache device is operational

If the cache device is operational its straigt forward:

f31-lvmtest:~# lvconvert --uncache vg_data/lv_data
  Logical volume "cachepool_cpool" successfully removed
  Logical volume vg_data/lv_data is not cached.
f31-lvmtest:~#

It also removes the cachepool and cachepool metadata LVs from the VG.

Cache device failure

In the case of a failure this looks as following:

lvconvert --uncache vg_data/lv_data
  WARNING: Couldn't find device with uuid TDYR1f-4QWp-Z8n4-FgaS-T7ni-NVai-3hO3J9.
  WARNING: VG vg_data is missing PV TDYR1f-4QWp-Z8n4-FgaS-T7ni-NVai-3hO3J9 (last written to /dev/vdc).
  WARNING: Couldn't find device with uuid TDYR1f-4QWp-Z8n4-FgaS-T7ni-NVai-3hO3J9.
  WARNING: Cache pool data logical volume vg_data/cachepool_cpool_cdata is missing.
  WARNING: Cache pool metadata logical volume vg_data/cachepool_cpool_cmeta is missing.
Do you really want to uncache vg_data/lv_data with missing LVs? [y/n]: y
  WARNING: Couldn't find device with uuid TDYR1f-4QWp-Z8n4-FgaS-T7ni-NVai-3hO3J9.
  Logical volume "cachepool_cpool" successfully removed
  Logical volume vg_data/lv_data is not cached.

Ensure you remove the failed device from the VG as well:

vgreduce vg_data --removemissing

Afterwards you need to activate the LV again:

[root@f31-lvmtest ~]# lvchange -a y /dev/vg_data/lv_data

Try to mount the device and check what kind of data loss you experienced…

Conclusion

dm-cache brings for both, read and write operations, a huge performance gain which makes it realy worth using it. Use writeback cache for better performance and depending on the resilience requirements, use redundant cache devices (and of course redundant HDDs as well).

Happy caching 🙂

21 thoughts on “Using LVM cache for storage tiering

  1. Peter G says:

    Nice write up. Thanks.

    In the “What happens when the cache disk failes?”, should one of the modes you mentioned be “writethrough” mode, rather than “writeback” mode ?

  2. Yannis M. says:

    Hello thanks for the write up.

    Quick question, was testing this within a VM and when I simulated a complete failure of the cache device then I could not read/write anymore into the data LV. Is this the expected behavior ?

    As I understand it, one has to issue ‘vgreduce vg_data –removemissing’ and then ‘lvchange -a y /dev/vg_data/lv_data’ in order to bring the data LV back in read/write mode ? Is this this correct ?

    Many thanks,
    Yannis

  3. ERIC DUNCAN says:

    Why is your cache size only 8 GB? If this was a 240 GB SSD, what would you do with the rest of the space? Or, can you just increase the cache drive to something much bigger?

    Could you use the remaining SSD space for another volume?

    Could you provide more details around redundancy with Write-back using SSDs? I’d be interesting in running something like 2x 240GB SSDs in RAID1, most likely with mdm.

    • Luc de Louw says:

      Hi,

      This was a virtual test system one disk was stored on a spinning device, the other for the SSD. Depending on the workload it can make sense to not use the whole SSD as a cache and create other LVs for other usage. Be aware that in such a case I/O will be shared with between the cache and the other-purpose LVs.

      I would recommend to use the whole SSD (or even better, Optane a.k.a 3dXpoint) as a cache to separate I/O.

      Hope this helps.

      Thanks,

      Luc

  4. Stephan says:

    Hey Luc,
    a quick question. I’m intenting to use the LVM SSD Caching for my NAS in write back mode. Any Idea whats happening when the System is writing at the moment and a powerloss occurs? Does the System continues to write the data after the next start?

    Thanks for the answer!
    Cheers,
    Stephan

    • Luc de Louw says:

      Hi Stephan,

      The data is either already written on the slow device or it is still on the fast device. A loss of power should have no more consequences as on file system level. However, I never tested that scenario. You can test this also by yourself using a virtual machine.

      Thanks,

      Luc

  5. Tommi says:

    Hi Stephan,

    What you think about using LVM cache for virtual machines image storage?

    Through NFS ofc. It might be difficult, cause there might be one large file, per server which is opened through the network, or limited amount anyway. How cache will handle those?

    I could imagine, that it works better as cloud-storage for nextcloud etc.

      • Luc de Louw says:

        Hi Tommi,

        I actually use this setup for VM storage. Depending on how many VMs concurrently access the disk, you may consider Intel Optane for the best performance, for a few VMs, a fast SSD is good enought

        HTH,

        Luc

        • Tommi says:

          Hi Luc,

          It’s still quite expensive to make redundant SSD-pack, which is more than 10 terabytes. You can get 20+T redundant arrays with 10 to 15krpm at a reasonable price.

          That’s why I’m thinking about the tiered system.

  6. Joey says:

    Hi Luc,

    Would you happen to know how to do things like sync checks and raid scrubbing on lvm-raid volumes behind the cache?

    I haven’t been able to find any way to do that, other than turning off caching. That would be a bit cumbersome if I have to do that after every power-outage or crash.

    • Luc de Louw says:

      Hi,

      On LVM-RAID1 I do my checks like:

      lvchange –syncaction check /dev/vg_data/lv_blah_corig

      Also resyncing after an unclean system shutdown automatically works as expected.

      Hope this helps,

      Luc

  7. Arun says:

    Hi Luc.

    Currently, my laptop has a 500GB nvme and a 2TB HDD installed.
    My Partition layout on NVME is:

    100MB EFI (Made by windows – VFAT)
    16M Windows MSFT
    380GB Windows NTFS
    512M MS Recovery
    512M /boot partition (VFAT extended boot partition)
    100Gb / (Linux root EXT4)

    My 2TB HDD is:
    500GB – NTFS (Windows extra storage)
    1.5TB – EXT4 /home
    16GB SWAP partition.

    I have found my Linux is quite slow due to /home being on HDD. But as my use case includes VM’s sometimes. 100GB is often not enough for me to use for /

    How can I convert this system in the best way to use the LVM DM cache, as you mentioned? My idea is to use the 100GB nVME SSD partition as cash + meta + LV. and extend that LV into a 1.5TB HDD partition.

    How can this work well? Do you have any ideas?

    • Luc de Louw says:

      Hi,

      This should work, however, I recommend to backup all data and test the procedure on a virtual machine first.

      Thanks,

      Luc

  8. labrunner says:

    Hi thanks for laying this out.

    I am testing this for myself during a rebuild of my lab.

    I created mdadm raid10 over 8 x 8TB
    Created mdadm raid1 with 2 nvme drives for cache
    Created mdadm raid2 with 2 nvme drives for meta
    System has a total of 4 nvme drives, no sharing of devices.

    Each component has been tested indivually and operate as expected.
    The raid10 has been tested and benchmarked without lvm cache and works as expected (actually, a little better).

    I put it all together in lvm vg as you described to create the cache.

    When I benchmark it, I get almost double speeds on threaded random iops with 16k blocksize, and even more than double on 4k blocksizes.

    However, when I benchmark sequential reads with 1M blocksizes, speed is abyssmal. It is about 1/3rd of what it should be and its also very erratic, meaning throughput is not stable, it fluctuates between 100MiB/s and 600MiB/s.

    Without the cache it does close to 800MiB/s stable. This can be higher when I create mdadm with a larger chunk size but I prefer small random iops over large block throughput so that’s good.

    Do you have any idea what’s going on?

    Thanks!

    • Luc de Louw says:

      Hi,

      I have no explanation for this, I suggest to get in touch with the developers, maybe they have an idea.

      Thanks,

      Luc

  9. Pix says:

    Hey Luc greate article, I tried this on my pc, when I tried to configure writethrough cache on on mypc write prformance degraded lower than write speed on sata hard disk.
    write speed without caching on sata hdd is something around 40mbps
    and write speed with caching writethrough is around 10 mpbs.
    using a partition of 2TB HDD (5400 rpm ) for slow disk
    and partition of WD NVMe for fast caching disk (average read speed is around 2 gbps and write speeds around 200 – 250 mpbs ) can you please share you email so that I can share more details.

  10. Raj says:

    Hi,

    I intend to make a simple, low cost NAS appliance for home usage with linux os. I was thinking of getting 1 SSD for faster writes and then push the files written on SSD to cheaper 7200RPM hdd. For the second tier I would keep it in RAID 1 so thay my imp docs and pics are not lost.

    Since the mobile pics/vdos and other docs are small in size, i dont find redundancy as a need on SSD layer but its a must in case of the persistent storage tier.

    I will be building this all at home, i am looking at a software solution for tiering using linux os on my custom desktop.

    Any links or references are much appreciated.

    Thanks in advance.
    Cheers

  11. Alexis Leroy says:

    Hi Luc,

    Funny that, of all time, today I needed to upgrade my 8TB hard-disk +80GB nvme to 16TB with the same cache
    So, while searching for the correct way to resize, I ended up re-reading your excellent article and some others.

    If you wish to add a bit of useful information, talking about resize might be a good subject, given that a cached LV cannot be resized as-is.
    Many people end up recreating the cache volumes entirely 🙁

    I ended up doing the following after the pvmove from my 8TB disk to the new 16TB:
    # lvconvert –splitcache spindles/slow-8T
    Logical volume spindles/slow-8T is not cached and spindles/CacheDataLV is unused.
    # lvresize -l +100%FREE /dev/spindles/slow-8T
    Size of logical volume spindles/slow-8T changed from <7,28 TiB (1907721 extents) to 14,55 TiB (3814911 extents).
    Logical volume spindles/slow-8T successfully resized.
    # lvconvert –type cache –cachepool /dev/spindles/CacheDataLV /dev/spindles/slow-8T
    Do you want wipe existing metadata of cache pool spindles/CacheDataLV? [y/n]: y
    Logical volume spindles/slow-8T is now cached.
    # resize2fs /dev/mapper/spindles-slow–8T

    All done with the filesystem mounted. Only the EXT4 resize took some time, the rest was immediate.
    I do love LVM, at home and at work 😀

    I hope this can help, as your article helped me earlier.
    Best regards,

Leave a Reply

Your email address will not be published. Required fields are marked *