Building a virtual CEPH storage cluster

cephThis post will guide you trough the procedure to build up a testbed on RHEL7 for a complete CEPH cluster. At the end you will have an admin server, one monitoring node and three storage nodes. CEPH is a object and block storage mostly used for virtual machine images and bulk BLOBS such as video- and other media. It is not intended to be used as a file storage (yet).

Machine set up
I’ve set up five virtual machines, one admin and monitoring server and five OSD servers.

  • ceph-admin.example.com
  • ceph-mon01.example.com
  • ceph-osd01.example.com
  • ceph-osd02.example.com
  • ceph-osd03.example.com

Each of them have a disk for the OS of 10GB, the OSD servers additional 3x10GB disks for the storage, in total 90GB for the stroage. Each virtual machine got 1GB RAM assigned, which is barley good enough for some first tests.

Configure your network
While it is recommended to have two separate networks, one public and one for cluster interconnect (heartbeat, replication etc). However, for this testbed only one network is used.

While it is recommended practice to have your servers configured using the Fully qualified hostname (FQHN) you must also configure the short hostname for CEPH.

Check if this is working as needed:

[root@ceph-admin ~]# hostname
ceph-admin.example.com
[root@ceph-admin ~]# hostname -s
ceph-admin
[root@ceph-admin ~]# 

To be able to resolve the short hostname, edit your /etc/resolv.conf and enter a domain search path

[root@ceph-admin ~]# cat /etc/resolv.conf 
search example.com
nameserver 192.168.100.148
[root@ceph-admin ~]# 

Note: In my network, all is IPv6 enabled and I first tried to set CEPH up with all IPv6. I was unable to get it working properly with IPv6! Disable IPv6 before you start. Disclaimer: Maybe I made some mistakes.

You also need to keep time in sync. The usuage of NTP or chrony is best practice anyway.

Register and subscribe the machines and attach the repositories needed

This procedure needs to be repeated on every node, inlcuding the admin server and the monitoring node(s)

[root@ceph-admin ~]# subscription-manager register
[root@ceph-admin ~]# subscription-manager list --available > pools

Search the pools file for the Ceph subscription and attach the pool in question.

[root@ceph-admin ~]# subscription-manager attach --pool=<the-pool-id>

Disable all repositories and enable the needed ones

[root@ceph-admin ~]# subscription-manager repos --disable="*"
[root@ceph-admin ~]# subscription-manager repos --enable=rhel-7-server-rpms \
--enable=rhel-7-server-rhceph-1.2-calamari-rpms \
--enable=rhel-7-server-rhceph-1.2-installer-rpms \
--enable=rhel-7-server-rhceph-1.2-mon-rpms \
--enable=rhel-7-server-rhceph-1.2-osd-rpms

Set up a CEPH user
Of course, you should set a secure password instead of this example 😉

[root@ceph-admin ~]# useradd -d /home/ceph -m -p $(openssl passwd -1 <super-secret-password>) ceph

Creating the sudoers rule for the ceph user

[root@ceph-admin ~]# echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph
[root@ceph-admin ~]# chmod 0440 /etc/sudoers.d/ceph

Setting up passwordless SSH logins. First create a ssh key for root. Do not set a pass phrase!

[root@ceph-admin ~]# ssh-keygen -t rsa -N "" -f /root/.ssh/id_rsa

And add the key to ~/.ssh/authorized_keys of the ceph user on the other nodes.

[root@ceph-admin ~]# ssh-copy-id ceph@ceph-mon01
[root@ceph-admin ~]# ssh-copy-id ceph@ceph-osd01
[root@ceph-admin ~]# ssh-copy-id ceph@ceph-osd02
[root@ceph-admin ~]# ssh-copy-id ceph@ceph-osd03

Configure your ssh configuration.

To make your life easier (not providing –username ceph) when you run ceph-deploy) set up the ssh client config file. This can be done for the user root in ~/.ssh/config or in /etc/ssh/ssh_config.

Host ceph-mon01
     Hostname ceph-mon01
     User ceph

Host ceph-osd01
     Hostname ceph-osd01
     User ceph

Host ceph-osd02
     Hostname ceph-osd02
     User ceph

Host ceph-osd03
     Hostname ceph-osd03
     User ceph

Set up the admin server

Go to https://access.redhat.com and download the ISO image. Copy the image to your admin server and mount it loop-back.

[root@ceph-admin ~]# mount rhceph-1.2.3-rhel-7-x86_64.iso /mnt -o loop

Copy the required product certificated to /etc/pki/product

[root@ceph-admin ~]# cp /mnt/RHCeph-Calamari-1.2-x86_64-c1e8ca3b6c57-285.pem /etc/pki/product/285.pem
[root@ceph-admin ~]# cp /mnt/RHCeph-Installer-1.2-x86_64-8ad6befe003d-281.pem /etc/pki/product/281.pem
[root@ceph-admin ~]# cp /mnt/RHCeph-MON-1.2-x86_64-d8afd76a547b-286.pem /etc/pki/product/286.pem
[root@ceph-admin ~]# cp /mnt/RHCeph-OSD-1.2-x86_64-25019bf09fe9-288.pem /etc/pki/product/288.pem

Install the setup files

[root@ceph-admin ~]# yum install /mnt/ice_setup-*.rpm

Set up a config directory:

[root@ceph-admin ~]# mkdir ~/ceph-config
[root@ceph-admin ~]# cd ~/ceph-config

and run the installer

[root@ceph-admin ~]# ice_setup -d /mnt

To initilize, run calamari-ctl

[root@ceph-admin ceph-config]# calamari-ctl initialize
[INFO] Loading configuration..
[INFO] Starting/enabling salt...
[INFO] Starting/enabling postgres...
[INFO] Initializing database...
[INFO] Initializing web interface...
[INFO] You will now be prompted for login details for the administrative user account.  This is the account you will use to log into the web interface once setup is complete.
Username (leave blank to use 'root'): 
Email address: luc@example.com
Password: 
Password (again): 
Superuser created successfully.
[INFO] Starting/enabling services...
[INFO] Restarting services...
[INFO] Complete.
[root@ceph-admin ceph-config]#

Create the cluster

Ensure you are running the following command in the config directory! In this example it is ~/ceph-config.

[root@ceph-admin ceph-config]# ceph-deploy new ceph-mon01

Edit some settings in ceph.conf

osd_journal_size = 1000
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 128
osd_pool_default_pgp_num = 128

In production, the first value should be bigger, at least 10G. The number of placement groups is according the number of your cluster members, the OSD servers. For small clusters up to 5, 128 pgs are fine.

Install the CEPH software on the nodes.

[root@ceph-admin ceph-config]# ceph-deploy install ceph-admin ceph-mon01 ceph-osd01 ceph-osd02 ceph-osd03

Adding the initual monitor server

[root@ceph-admin ceph-config]# ceph-deploy mon create-initial

Connect the all nodes server to calamari:

[root@ceph-admin ceph-config]# ceph-deploy calamari connect ceph-mon01 ceph-osd01 ceph-osd02 ceph-osd03 ceph-admin

Make your admin server being an admin server

[root@ceph-admin ceph-config]# yum -y install ceph ceph-common
[root@ceph-admin ceph-config]# ceph-deploy admin ceph-mon01 ceph-osd01 ceph-osd02 ceph-osd03 ceph-admin

Purge and add your data disks:

[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdb
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdc
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdd
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdb
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdc
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdd
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdb
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdc
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd03:vdd

[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd01:vdb
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd01:vdc
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd01:vdd
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd02:vdb
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd02:vdc
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd02:vdd
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd03:vdb
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd03:vdc
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd03:vdd

You now can check the health of your cluster:

[root@ceph-admin ceph-config]# ceph health
HEALTH_OK
[root@ceph-admin ceph-config]# 

Or with some more information:

[root@ceph-admin ceph-config]# ceph status
    cluster 117bf1bc-04fd-4ae1-8360-8982dd38d6f2
     health HEALTH_OK
     monmap e1: 1 mons at {ceph-mon01=192.168.100.150:6789/0}, election epoch 2, quorum 0 ceph-mon01
     osdmap e42: 9 osds: 9 up, 9 in
      pgmap v73: 192 pgs, 3 pools, 0 bytes data, 0 objects
            318 MB used, 82742 MB / 83060 MB avail
                 192 active+clean
[root@ceph-admin ceph-config]# 

Whats next?
A storage is worthless if not used. A follow-up post will guide you trough how to use CEPH as storage for libvirt.

Further reading

Creating and managing iSCSI targets

If you want to create and manage iSCSI targets with Fedora or RHEL, you stumble upon tgtd and tgtadm. This tools are easy to use but have some obstacles to take care of. This is a quick guide on how to use tgtd and tgtadm.

iSCSI terminology
In the iSCSI world, we not taking about server and client, but iSCSI-Targets, which is the server and iSCSI-Initiators which are the clients

Install the tool set
It is just one package to install, afterwards enable the service:

target:~# yum install scsi-target-utils
target:~# chkconfig tgtd on
target:~# service tgtd start

Or Systemd style:

target:~# systemctl start tgtd.service
target:~# systemctl enable tgtd.service

Online configuration vs. configuration file
There are basically two ways of configuring iSCSI targets:

  • Online configuration with tgtadm, changes are getting available instantly, but not consistent over reboots
  • Configuration files. Changes are presistent, but not instantly available

Well, there is the dump parameter for tgtadm but i.e. passwords are replaced with “PLEASE_CORRECT_THE_PASSWORD” which makes tgtadm completely useless if you are using CHAP authentication.

If you do not use CHAP authentication and use IP based ACLs instead, tgtadm can help you, just dump the config to /etc/tgt/conf.d

Usage of tgtadm

After you have created the storage such as a logical volume (used in this example), a partition or even a file, you can add the first target:

target:~# tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn.2013-07.com.example.storage.ssd1

Then you can add a LUN (logical Unit) to the target

target:~# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 --backing-store /dev/vg_storage_ssd/lv_storage_ssd

It is always a good idea to restrict access to your iSCSI targets. There are two ways to do so: IP based and user (CHAP Authentication) based ACL.

In this example we first add two addresses and later on remove one of them again just as a demo

target:~# tgtadm --lld iscsi --mode target --op bind --tid 1 --initiator-address=192.168.0.106
target:~# tgtadm --lld iscsi --mode target --op bind --tid 1 --initiator-address=192.168.0.107

Go to both initiators where the IPs and check if the Targets are visible:

iscsiadm --mode discovery --type sendtargets --portal 192.168.0.1

Lets remove the ACL for the IP address 192.168.0.107

target:~# tgtadm --lld iscsi --mode target --op unbind --tid 1 --initiator-address=192.168.0.107

Test if the Target is still visible on the host with IP address 192.168.0.107, it is not anymore.

If you want to use CHAP authentication, please be aware that tgtadm –dump does not save password, so initiators will not be able to login after a restart of the tgtd.

To add a new user:

target:~# tgtadm --lld iscsi --op new --mode account --user iscsi-user --password secret

And add the ACL to the target:

target:~# tgtadm --lld iscsi --op bind --mode account --tid 2 --user iscsi-user

To remove an account for the target:

target:~# tgtadm --lld iscsi --op unbind --mode account --tid 2 --user iscsi-user

As a wrote further above, configurations done by tgtadm are not persistent over reboot or restart of tgtd. For basic configurations as descibed above, the dump parameter is working fine. As configuration files in /etc/tgt/conf.d/ are automatically included, you just dump the config into a separate file.

target:~# tgt-admin --dump |grep -v default-driver > /etc/tgt/conf.d/my-targets.conf

The other way round
If you are using more sophisticated configuration, you probably want to manage your iSCSI configration the other way round.

You can edit your configuration file(s) in /etc/tgt/conf.d and invoke tgt-admin with the respective parameters to update the config instantly.

tgt-admin (not to be mistaken as tgtadm) is a perl script which basically parses /etc/tgt/targets.conf and updates the targets by invoking tgtadm.

To update your Target(s) issue:

tgt-admin --update ALL --force

For all your targets, incl. active ones (–force) or

tgt-admin --update --tid=1 --force

For updating Target ID 1

SIGKILL is nasty but sometimes needed
tgtd can not be stopped as usual daemons, you need to do a sledgehammer operation and invoke kill -9 to the process followed by service tgtd start command.

How the start up and stop process is working in a proper workaround way is being teached by Systemd, have a look at /usr/lib/systemd/system/tgtd.service which does not actually stop tgtd but just removes the targets.

Conclusion
tgtadm can be help- and sometimes harmful. Carefully consider what is the better way for you, creating config files with tgtadm or update the configuration files and activate them with tgt-admin.

Confused about write barriers on file systems…

As ext3 is already known as a very robust file system why is the default mount option still barrier=0? The problem is LVM and the device mapper. They  do not support barriers.

When mounting ext3 on a LV, the option barrier=1 it should be ignored and a warning written. So far so good. Trying this brings a lot of confusion. According to a Red Hat bugzilla entry one should get a warning, but no signs of that neither in /var/log/messages nor dmesg output. Even more confusing is the output of mount , it shows the LV is mounted with the barrier=1 option.

The conclusion is: Enable write barriers on physical disk-partitions brings a plus of reliability to your files system, on LVM setups it should better be disabled for the moment.

Is this fun? No…

How are jornaling options affect performance of the ext3 filesystem

The need for speed
Everyone looks for the optimum of speed in its servers. Todays servers have plenty of spare CPU power and RAM is dirty cheap. Todays common bottleneck is storage.

One way to solve the bottleneck is trowing money on it, the other smarter way is choosing the best matching file system and it options for the purpose of the server.

On Linux systems a bunch of file systems is available and ready to use. There are some high performance file systems such as SGI’s XFS or reiserfs. Both are known to be quite performant but having the drawback of being unreliable in case of a hard crash or a power loss.

As file systems are the key point for reliability, XFS and reiserfs are out of question. So whats left? ext3.

Problem
Ext3 is not known for its high performance, it is rather slow compared to xfs, especially if you handle with of lot of files such as on a web server.

Solution
Choose the right options for journaling of ext3. You have the choice of three different journaling options, data=writeback which means written data is first written to RAM and later on disk. This is the most performant option, but with the greatest risk of loosing data in case of a crash or power loss. Before choosing this option use xfs, it is more performant at the same risk.

The compromise is data=ordered Lets quote the man page: –All data is forced directly out to the main file system prior to its metadata being committed to the journal — At the end of the day this means data loss in hardly happening but not impossible. This option offers a balance between write speed and reliability.

Whats about the third option data=journal? It means that one would think if all data is written first in the journal and then to its final destination on disk, I/O performance gets decreased.

In theory, data=writeback is the fastest and data=journal the slowest option. Belief it or not: data=journal is in many cases the fastest option, especially in mostly-read applications when you concurrently read lots of small files (such as on web servers).

At the end of the day: With the ext3 file system you got a extremely reliable file system with quite a good performance if you choose the journaling options that matches your needs. However, data=journal gives you a high performance penalty on write operations.

Further reading: Th antique article on IBM developer network: http://www.ibm.com/developerworks/linux/library/l-fs8.html from 2001.

Have fun!