Leveraging Network-Bound Disk Encryption at Enterprise Scale

Tang and Clevis

Tang and Clevis

Network-Bound Disk Encryption (NBDE) adds scaling to LUKS by automated disk unlocking on system startup.

Why should I encrypt disks? If you dont want to see your corporate and private data leaked, you should do so as an additional security measure.

Use cases

There are basically two use cases for disk encryption. The first one is to prevent data leaks when a device gets stolen or lost (mobile computers, unsecured server rooms etc.). Theft of devices is usually not a threat for enterprise grade data centers with physical security.

Here comes the second use case for this enterprise grade data centers: At some point in time, disks will get disposed, either because of a defect or they get outdated technology wise. That means a data leak is possible at the end of the disks life cycle. A defect disk can not be wiped at all. For someone with deep pockets, there is still a chance to at least partially access the data. Wiping six TiB disks takes many hours just for overwriting them with zeros, not even with random data. An encrypted disk without a passphrase set can just simply get disposed without considering if it needs to be wiped or physically destroyed.

Note: Disk encryption does not help protecting you from a data theft by a person having access to the data, it also does not help against misbehaving software.

As you can imagine, it is a good idea to encrypt your storage. The standard for disk encryption in Linux is LUKS (Linux Unified Key Setup).

Adding Tang and Clevis for scaling

Unfortunately LUKS does not scale at all, because the passphrase must be entered manually on system startup, a no-go for data center operations. Tang and Clevis adds the scaling factor to the game.

Tang is the server component, Clevis and LUKS-meta the client component. The secret itself is stored on the client, the client asks the server for the data needed for the decryption of the key stored in the LUKS meta data. For more information on the crypto algorithms used, please see the Slide Deck “Tang and Clevis” by Fraser Tweedale

Availability and support

Tang and clevis have been added to RHEL 7.4 and are supported. The packages tang-nagios and clevis-udisk2 are in technical preview phase and are not supported. The packages are included in the base subscription.

It is also available for Fedora as well.

Set up the Tang servers

Setting up a Tang server is straight forward. For redundancy, please set up at least two Tang servers, a maximum of seven Tang Servers are supported by the client, which corresponds to the number of LUKS slots (eight) minus the one used for the initial passphrase.

[root@tang1 ~]# yum -y install tang
[root@tang1 ~]# systemctl enable --now tangd.socket
[root@tang1 ~]# jose jwk gen -i '{"alg":"ES512"}' -o /var/db/tang/new_sig.jwk
[root@tang1 ~]# jose jwk gen -i '{"alg":"ECMR"}' -o /var/db/tang/new_exc.jwk

Display the Thumbprint to be added to the Kickstart later on.

[root@tang1 ~]# jose jwk thp -i /var/db/tang/new_sig.jwk

Automated client setup during Kickstart

Be aware that you can run into problems when re-provisioning a system that contains old LUKS keys. You probably want to wipe them. In the following setup, all the slots are located on the second partition.

# Wipe LUKS keys on the second partition of disk vda
%pre
cryptsetup isLuks /dev/vda2  && dd if=/dev/zero of=/dev/vda2 bs=512 count=2097152
%end

part /boot      --fstype ext2 --size=512 --ondisk=vda
part pv.0       --size=1 --grow --ondisk=vda --encrypted --passphrase=dummy-master-pass

volgroup vg_luksclient pv.0

logvol /        --name=lv_root    --vgname=vg_luksclient --size=4096
logvol /home    --name=lv_home    --vgname=vg_luksclient --size=512 --fsoption=nosuid,nodev
logvol /tmp     --name=lv_tmp    --vgname=vg_luksclient --size=512 --fsoption=nosuid,nodev,noexec
logvol /var     --name=lv_var    --vgname=vg_luksclient --size=2048 --fsoption=nosuid,nodev
logvol /var/log --name=lv_var_log --vgname=vg_luksclient --size=2048 --fsoption=nosuid,nodev
logvol swap     --fstype swap --name=lv_swap    --vgname=vg_luksclient --size=4096

Be aware that the transfer of the Kickstart file will be done in clear text, that means that this dummy-master-pass is exposed. It should be automatically removed. You can add a master key via a secure way after the installation with Ansible, Puppet or simply manually via SSH.

Ensure you have the clevis-dracut package installed so that the init ramdisk will get created in the right way.

%packages
clevis-dracut
%end

In the %post section of the Kickstart file, add the following to register your system to the Tang servers.

%post
clevis bind luks -f -k- -d /dev/vda2 tang '{"url":"http://tang1.example.com","thp":"vkaGTzcBNEeF_X5KX-w9754Gl80"}' <<< "dummy-master-pass"
clevis bind luks -f -k- -d /dev/vda2 tang '{"url":"http://tang2.example.com","thp":"x_KcDG92bVP3SUL9KOzmzps4sZg"}' <<< "dummy-master-pass"
%end

In case you want to remove the master password, put the following line into your %post section of the Kickstart file:

%post
cryptsetup luksRemoveKey /dev/vda2 - <<<"dummy-master-pass"
%end

Usage of a passphrase

There are pros and cons about doing so. On one hand, if all Tang servers are unavailable, there is not a slight chance to access the data if there is no master password set. On the other hand, a master password can be leaked and it should be changed from time to time which needs to be automated (i.e. with Ansible) to scale.

I personally tend to use a master password. Choose wisely depending on your specific use case if you set a master password or not.

Good to know

Be aware that the password prompt on system startup will always show up. It disappears automatically after a few seconds if a Tang server have been reached.

Documentation

The following documents helps you further to get an idea about the Tang/Clevis setup:

A nice presentation from a conference is available here: https://www.usenix.org/conference/lisa16/conference-program/presentation/atkisson

Another more technical presentation is available here: http://redhat.slides.com/npmccallum/sad#/

Important commands

There are a few LUKS and clevis related commands you should know about.

cryptsetup

Cryptsetup is used to handle the LUKS slots, adding and removal of passphrases. More information is available in man 8 cryptsetup

luksmeta

luksmeta gives you access to the meta data of LUKS. I.e. showing which slots are in use:

[root@luksclient ~]# luksmeta show -d /dev/vda2 
0   active empty
1   active cb6e8904-81ff-40da-a84a-07ab9ab5715e
2   active cb6e8904-81ff-40da-a84a-07ab9ab5715e
3   active cb6e8904-81ff-40da-a84a-07ab9ab5715e
4 inactive empty
5 inactive empty
6 inactive empty
7 inactive empty
[root@luksclient ~]#

The following command is reading the meta data and put the encrypted content to the file meta

luksmeta load -d /dev/vda2 -s 1  > meta

It looks like this:

eyJhbGciOiJFQ0RILUVTIiwiY2xldmlzIjp7InBpbiI6InRhbmciLCJ0YW5nIjp7ImFkdiI6eyJrZXlzIjpbeyJhbGciOiJFUzUxMiIsImNydiI6IlAtNTIxIiwia2V5X29wcyI6WyJ2ZXJpZnkiXSwia3R5IjoiRUMiLCJ4IjoiQVdCeFZSYk9MOXBYNjhRU0lqSEZyNzVuNUVXdDZGblkySmNaNVgxX0s4MldaNW9kMUNQTUJwQ0dsS1ZFZ29LOFQwMERPazFsMHJRQ2kyOEg4SDBsVXlfaCIsInkiOiJBTzNLdmsyc2pqYVpSM3RrbW5KcVQyWGYtd1lnbXZSa0JqNUpmNFgzWmtHTDRHbTYtbE5qemhzVEdraEZLRmZUdnJLUElDTHBSQndCTnNXc0JuZUlVTEViIn0seyJhbGciOiJFQ01SIiwiY3J2IjoiUC01MjEiLCJrZXlfb3BzIjpbImRlcml2ZUtleSJdLCJrdHkiOiJFQyIsIngiOiJBTm05WmQwUDFyT1F6MXhhQVFNTzJxRjRua3ZHMVpKS2VHNkFaWjdPTEo1ejhKS000N0otMUhZWnkyZk0zT29ZQVdiUndQdnJ6aUt4MFJmNWh0QlkzNXBxIiwieSI6IkFTdVZYR3JRQ0c4R3dKTENXbWpVbC1jN0llUUh4TC01cFRGYTJaOU1ESnU4Ym9JZFo3WlNiZHBHZUFWMnhMTzlCTnlqbE5zSzB2ZWJrR3ZDcmU5bDl0aFYifSx7ImFsZyI6IkVTNTEyIiwiY3J2IjoiUC01MjEiLCJrZXlfb3BzIjpbInZlcmlmeSJdLCJrdHkiOiJFQyIsIngiOiJBT01Yc2JqcGd3MUVwMEdmSHd2NFRGTzFhWlNxZlY2NWlURVpWWDc4Y3M1SkI2dlBHaUZwd2RiZnpVWlpzR2FCZVludXdzTHJ0UTZTYm0zWHVTdGNHTFlFIiwieSI6IkFSZUZMckNHZnB6S2tzaTZvVXdLdEZjOW9IbHVHdDJjd3AwNmR1M3dEUGgta0t5N3RfTmZEU1JOSXRuWkZIbWs3eVlwSnkxYlpiRDRUTklTcXFIZDlDbDIifV19LCJ1cmwiOiJodHRwOi8vdGFuZzEuaG9tZS5kZWxvdXcuY2gifX0sImVuYyI6IkEyNTZHQ00iLCJlcGsiOnsiY3J2IjoiUC01MjEiLCJrdHkiOiJFQyIsIngiOiJBQnlMWjZmcWJKVVdzalZVc1ZjN0hwWlhLQ1BIZjJWenhyTExkODdvajBERnhGeTJRUTJHSXNEbFJ6OTg0cmtkNDJVQ3pDVy1VcGE4bG9nTl9BT0hsU0syIiwieSI6IkFBcHJaLUxFMUk3NUxWMXZtTHhkYUl0TmlETnpjUUVpLXJsR1FwVjFnT2IwWU5rbDFyWVgxdU45OE9WcHdiWUowTEpYYnYtRGZnSjU2RjBPMkNFczJOck4ifSwia2lkIjoicTAzQXd4VG5sU3lpQjRnelBTYTBfcXhsVzU4In0..Ws5k2fgQ26yN-mMv.1NwlYoyaUmF5X0jqGDcKO3HWn02StXotqnjZKaZtSUXioyW0-rc8HxH6HkkJTMQJk_EXr8ZXB4hmTXfUqAtRqgpEW4SdzJIw_AsGbJm5h_8lQLPIF4o.fbbNxK51MC14hX46Dgkj6Q

You can decrypt it:

[root@luksclient ~]# clevis decrypt tang < meta 
OTQy6NGfqTjppwIrrM4cc15zr-sxy5PPmKExHul1m-pcMjEHjGdoN5uqD9vcEiuMM56VapPV_LedXYEkktYO-g[root@luksclient ~]#

OTQy6NGfqTjppwIrrM4cc15zr-sxy5PPmKExHul1m-pcMjEHjGdoN5uqD9vcEiuMM56VapPV_LedXYEkktYO-g is the cleartext passphrase returned. It actually can be used to type it in the console, I recommend a serial console where you can copy-paste 😉

If you run the same command again when both Tang servers are down, you will get an error:

[root@luksclient ~]# clevis decrypt tang < meta
Error communicating with the server!
[root@luksclient ~]#

As you can see, you don’t need to provide a Tang Server URL.

lsblk

Lsblk is a nice little tool which shows the available storage in a tree. You can see the different layers of the storage subsystem.

[root@luksclient ~]# lsblk 
NAME                                          MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
vda                                           252:0    0   20G  0 disk  
├─vda1                                        252:1    0  512M  0 part  /boot
└─vda2                                        252:2    0 19.5G  0 part  
  └─luks-f0a70f08-b745-429f-ba8e-ec07e8953c3d 253:0    0 19.5G  0 crypt 
    ├─vg_luksclient-lv_root                   253:1    0    4G  0 lvm   /
    ├─vg_luksclient-lv_swap                   253:2    0    4G  0 lvm   [SWAP]
    ├─vg_luksclient-lv_var_log                253:3    0    2G  0 lvm   /var/log
    ├─vg_luksclient-lv_var                    253:4    0    2G  0 lvm   /var
    ├─vg_luksclient-lv_tmp                    253:5    0  512M  0 lvm   /tmp
    └─vg_luksclient-lv_home                   253:6    0  512M  0 lvm   /home
[root@luksclient ~]# 

json_reformat

If you want to play with JSON, install the package yajl.

With json_reformat you can minimize JSON and you are required to do so as clevis encrypt sss does not allow spaces, it fails.

Lets reformat this:

[root@luksclient ~]# echo '{"t": 1,"pins": {"tang": [{"url": "http://tang1.example.com"}, {"url": "http://tang2.example.com"}]}}'|json_reformat -m && echo ""
{"t":1,"pins":{"tang":[{"url":"http://tang1.example.com"},{"url":"http://tang2.example.com"}]}}
[root@localhost ~]# 

How to figure out to which servers the client is enrolled

I was curious how clevis figures out what Tang server to connect to. There is nothing written to the initrd, that means it must be stored somewhere in the LUKS metadata. It was taking me some time to figure out how it works.

Just decode the meta data to JSON:

 luksmeta load -d /dev/vda2 -s 1|jose b64 dec -i- |json_reformat 

Unfortunately the JSON seems to be invalid, at least json_reformat brings an error parse error: premature EOF. However, you will see the URL.

Test scenarios

I made a few tests with to figure out how Tang and Clevis works when something is going south.

Tang server(s) not available during system installatioon

If only one Tang server is available, installation work, server gets enrolled to only one Tang server. The server must be enrolled to the second Tang server manually after it came up again.

If both servers are down during installation, the installation finished successful, the temporary passphrase is still active as LUKS will deny removing the last passphrase available. Of course, the LUKS metadata is not available. You can enroll the servers manually after one or both servers come back online. Remember to remove the temporary passphrase afterwards.

Tang Server(s) not available during reboot

If one Tang server is not available, the other one is used, no impact.

If both servers are down, Plymouth asks for the LUKS passphrase. If you removed the the passphrase, you will not be able to boot the server. After starting one or both Tang servers, boot continues.

Drawbacks

Tang and Clevis are both very young projects and not yet mature. I’ve figured out the following drawbacks:

Missing Registry

At the moment there is no way to report which servers are registered to what Tang server. This makes it hard to check from a central point if a server is really registered to two (or more) Tang servers to ensure smooth operation in the case of a failed Tang server.

This is particular true if one (or more) Tang server is down during install time of the client system. As a workaround, set up a monitoring script that checks if there are two active slots. I.e.

if [ $(luksmeta show -d /dev/vda2|grep " active"|grep -v empty|wc -l) -ne 2 ] || [ $? -eq 0 ]; then
        echo "Something is wrong with the LUKS metadata, please check"|mail -s "LUKS Metadata failure" monitoring@example.com
fi

Logging

Logging of Tang requests is very basic at the moment, some improvement is needed here as well. Again, the documentation for the return codes is lacking

Scalability

When using more than one Tang server, always that one defined in the first slot be be accessed. There is no round-robin or similar load-balancing method. This means that that the sequence of Tang Server must be shuffled on the client which involves some logic in the Kickstart file.

One Tang server should be able to handle more than 2k requests per second, so the problem only kicks in very large environments, where more than 2000 server are booting (or getting installed) at the same time.

Maturity

Its a brand new project using completely new ideas and methods. At the moment not much experience is there, an issue that will be solved over time.

Documentation

There is almost no documentation available which goes beyond a few lines to show how to set up the server and client. Whats missing is how to troubleshoot the environment. Another missing part is how to handle key rotation, its unclear for me if and what has to be done on the client.

Easy-to-read documentation is important, in particular for Tang and Clevis which is using some new style die-hard cryptographic mathematics.

Conclusion

Both, client and server have a very small footprint and are performing well. The idea of Tang and Clevis is brilliant and a first incarnation is ready to use. Due to the drawbacks mentioned above I think it is not yet ready for production and it will take a while until it is.

Due to the nature of the project, stability and reliability is a key point, that is why people should test it and provide feedback.

I would like to thank the involved engineers, cool stuff.

Have fun:-)