etcd is the key value store for all object definitions, as well as the persistent master state. Other components watch for changes, then bring themselves into the desired state.
{product-title} versions prior to 3.5 used etcd version 2 (v2), while 3.5 and later use version 3 (v3). The data model between the two versions of etcd is different. etcd v3 can use both the v2 and v3 data model, whereas etcd v2 can only use the v2 data model. In an etcd v3 server, the v2 and v3 data stores exist in parallel and are independent.
For both v2 and v3 operations, you can use the ETCDCTL_API
environment
variable to use the proper API:
$ etcdctl -v etcdctl version: 3.2.5 API version: 2 $ ETCDCTL_API=3 etcdctl version etcdctl version: 3.2.5 API version: 3.2
See Migrating etcd Data (v2 to v3) section in the {product-title} 3.7 documentation for information about how to migrate to v3.
The etcd backup process is composed of two different procedures:
-
Configuration backup: Including the required etcd configuration and certificates
-
Data backup: Including both v2 and v3 data model.
The data backup procedure can be done on any host that has connectivity to the
etcd cluster, where the proper certificates are provided, and where the
etcdctl
tool is installed.
Note
|
The backup files must be copied to an external system, ideally outside the {product-title} environment, and then encrypted. |
The etcd configuration files to be preserved are all stored in the /etc/etcd
directory of the instances where etcd is running. This includes the etcd
configuration file (/etc/etcd/etcd.conf
) and the required certificates for
cluster communication. All those files are generated at installation time by the
Ansible installer.
To back up the etcd configuration:
$ ssh master-0 # mkdir -p /backup/etcd-config-$(date +%Y%m%d)/ # cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/
Note
|
The backup is to be performed on every etcd member of the cluster as the certificates and configuration files are unique. |
The restore procedure for etcd configuration files replaces the appropriate files, then restarts the service.
If an etcd host has become corrupted and the /etc/etcd/etcd.conf
file is lost,
restore it using:
$ ssh master-0 # cp /backup/yesterday/master-0-files/etcd.conf /etc/etcd/etcd.conf # restorecon -Rv /etc/etcd/etcd.conf # systemctl restart etcd.service
In this example, the backup file is stored in the
/backup/yesterday/master-0-files/etcd.conf
path where it can be used as an
external NFS share, S3 bucket, etc.
Note
|
The {product-title} installer creates aliases to avoid typing all the
flags named However, the |
Before backing up etcd:
-
etcdctl
binaries should be available or, in containerized installations, therhel7/etcd
container should be available -
Ensure connectivity with the etcd cluster (port 2379/tcp)
-
Ensure the proper certificates to connect to the etcd cluster
-
To ensure the etcd cluster is working, check its health:
# etcdctl --cert-file=/etc/etcd/peer.crt \ --key-file=/etc/etcd/peer.key \ --ca-file=/etc/etcd/ca.crt \ --peers="https://*master-0.example.com*:2379,\ https://*master-1.example.com*:2379,\ https://*master-2.example.com*:2379"\ cluster-health member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379 member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379 member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379 cluster is healthy
Or if using the etcd v3 API:
# ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \ --key=/etc/etcd/peer.key \ --cacert="/etc/etcd/ca.crt" \ --endpoints="https://*master-0.example.com*:2379,\ https://*master-1.example.com*:2379,\ https://*master-2.example.com*:2379" endpoint health https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms
-
Check the member list:
# etcdctl2 member list 2a371dd20f21ca8d: name=master-1.example.com peerURLs=https://192.168.55.12:2380 clientURLs=https://192.168.55.12:2379 isLeader=false 40bef1f6c79b3163: name=master-0.example.com peerURLs=https://192.168.55.8:2380 clientURLs=https://192.168.55.8:2379 isLeader=false 95dc17ffcce8ee29: name=master-2.example.com peerURLs=https://192.168.55.13:2380 clientURLs=https://192.168.55.13:2379 isLeader=true
Or, if using etcd the v3 API:
# etcdctl3 member list 2a371dd20f21ca8d, started, master-1.example.com, https://192.168.55.12:2380, https://192.168.55.12:2379 40bef1f6c79b3163, started, master-0.example.com, https://192.168.55.8:2380, https://192.168.55.8:2379 95dc17ffcce8ee29, started, master-2.example.com, https://192.168.55.13:2380, https://192.168.55.13:2379
-
Note
|
While the The |
-
Perform the backup:
# mkdir -p */backup/etcd-$(date +%Y%m%d)* # systemctl stop etcd.service # etcdctl2 backup \ --data-dir /var/lib/etcd \ --backup-dir */backup/etcd-$(date +%Y%m%d)* # cp /var/lib/etcd/member/snap/db */backup/etcd-$(date +%Y%m%d)* # systemctl start etcd.service
While stopping the etcd service is not strictly necessary, doing so ensures that the etcd data is fully synchronized.
The
etcdctl2 backup
command creates etcd v2 data backup where copying thedb
file while the etcd service is not running is equivalent to runningetcdctl3 snapshot
for etcd v3 data backup:# mkdir -p */backup/etcd-$(date +%Y%m%d)* # etcdctl3 snapshot save */backup/etcd-$(date +%Y%m%d)*/db Snapshot saved at /backup/etcd-<date>/db # systemctl stop etcd.service # etcdctl2 backup \ --data-dir /var/lib/etcd \ --backup-dir */backup/etcd-$(date +%Y%m%d)* # systemctl start etcd.service
NoteThe
etcdctl snapshot save
command requires the etcd service to be running.In this example, a
/backup/etcd-<date>/
directory is created, where<date>
represents the current date, which must be an external NFS share, S3 bucket, or any external storage location.In the case of an all-in-one cluster, the etcd data directory is located in
/var/lib/origin/openshift.local.etcd
The following restores healthy data files and starts the etcd cluster as a single node, then adds the rest of the nodes in case an etcd cluster is required.
-
Stop all etcd services:
# systemctl stop etcd.service
-
Clean the etcd data directories to ensure the proper backup is restored, but keeping the running copy:
# mv /var/lib/etcd /var/lib/etcd.old # mkdir /var/lib/etcd # chown -R etcd.etcd /var/lib/etcd/ # restorecon -Rv /var/lib/etcd/
Alternatively, you can wipe the etcd data directory:
# rm -Rf /var/lib/etcd/*
NoteIn case an all-in-one cluster, the etcd data directory is located in
/var/lib/origin/openshift.local.etcd
-
Restore a healthy backup data file to one of the etcd nodes:
# cp -R /backup/etcd-xxx/* /var/lib/etcd/ # mv /var/lib/etcd/db /var/lib/etcd/member/snap/db
Perform this step on all etcd hosts (including master hosts collocated with etcd).
-
Run the etcd service, forcing a new cluster.
This creates a custom file for the etcd service, which overwrites the execution command adding the
--force-new-cluster
option:# mkdir -p /etc/systemd/system/etcd.service.d/ # echo "[Service]" > /etc/systemd/system/etcd.service.d/temp.conf # echo "ExecStart=" >> /etc/systemd/system/etcd.service.d/temp.conf # sed -n '/ExecStart/s/"$/ --force-new-cluster"/p' \ /usr/lib/systemd/system/etcd.service \ >> /etc/systemd/system/etcd.service.d/temp.conf # systemctl daemon-reload # systemctl restart etcd
-
Check for error messages:
$ journalctl -fu etcd.service
-
Check for health status (in this case, a single node):
# etcdctl2 cluster-health member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379 cluster is healthy
-
Restart the etcd service in cluster mode:
# rm -f /etc/systemd/system/etcd.service.d/temp.conf # systemctl daemon-reload # systemctl restart etcd
-
Check for health status and member list
# etcdctl2 cluster-health member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379 cluster is healthy # etcdctl2 member list 5ee217d17301: name=master-0.example.com peerURLs=http://localhost:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
-
Once the first instance is running, it is safe to restore multiple etcd servers as desired.
Fix the peerURLS
parameter
After restoring the data and creating a new cluster, the peerURLs
parameter
shows localhost
instead the IP where etcd is listening for peer
communication:
# etcdctl2 member list 5ee217d17301: name=master-0.example.com peerURLs=http://*localhost*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
-
Get the member ID from the
etcdctl member list
output. -
Get the IP where etcd is listening for peer communication:
$ ss -l4n | grep 2380
-
Update the member information with that IP:
# etcdctl2 member update *5ee217d17301* https://*192.168.55.8*:2380 Updated member with ID 5ee217d17301 in cluster
-
To verify, check that the IP is in the output of the following:
$ etcdctl2 member list 5ee217d17301: name=master-0.example.com peerURLs=https://*192.168.55.8*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
Add more members
In the instance joining the cluster:
-
Get the etcd name for the instance in the
ETCD_NAME
variable:# grep ETCD_NAME /etc/etcd/etcd.conf
-
Get the IP where etcd listens for peer communication:
# grep ETCD_INITIAL_ADVERTISE_PEER_URLS /etc/etcd/etcd.conf
-
Delete the previous etcd data:
# rm -Rf /var/lib/etcd/*
-
On the etcd host where etcd is properly running, add the new member:
$ etcdctl2 member add <name> <advertise_peer_urls>
-
The command outputs some variables. For example:
ETCD_NAME="master2" ETCD_INITIAL_CLUSTER="master1=https://10.0.0.7:2380,master2=https://10.0.0.5:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
Add those values to the
/etc/etcd/etcd.conf
file of the new host:# vi /etc/etc/etcd.conf
-
Once those values are replaced, start the etcd service in the node joining the cluster:
# systemctl start etcd.service
-
Check for error messages:
$ journalctl -fu etcd.service
-
Repeat the above for every etcd node joining the cluster.
-
Verify the cluster status and cluster health once all the nodes joined:
# etcdctl2 member list 5cd050b4d701: name=master1 peerURLs=https://10.0.0.7:2380 clientURLs=https://10.0.0.7:2379 isLeader=true d0c57659d8990cbd: name=master2 peerURLs=https://10.0.0.5:2380 clientURLs=https://10.0.0.5:2379 isLeader=false e4696d637de3eb2d: name=master3 peerURLs=https://10.0.0.6:2380 clientURLs=https://10.0.0.6:2379 isLeader=false
# etcdctl2 cluster-health member 5cd050b4d701 is healthy: got healthy result from https://10.0.0.7:2379 member d0c57659d8990cbd is healthy: got healthy result from https://10.0.0.5:2379 member e4696d637de3eb2d is healthy: got healthy result from https://10.0.0.6:2379 cluster is healthy
The restore procedure for v3 data is similar to the v2 data.
Snapshot integrity may be optionally verified at restore time. If the snapshot
is taken with etcdctl snapshot save
, it will have an integrity hash that is
checked by etcdctl snapshot restore
. If the snapshot is copied from the data
directory, there is no integrity hash and it will only restore by using
--skip-hash-check
.
Important
|
The procedure to restore only the v3 data must be performed on a single etcd host. You can then add the rest of the nodes to the cluster. |
-
Stop all etcd services:
# systemctl stop etcd.service
-
Clear all old data, because
etcdctl
recreates it in the node where the restore procedure is going to be performed:# rm -Rf /var/lib/etcd
-
Use the
snapshot restore
command with the data from/etc/etcd/etcd.conf
to match the following command:# etcdctl3 snapshot restore */backup/etcd-xxxxxx/backup.db* \ --data-dir /var/lib/etcd \ --name *master-0.example.com* \ --initial-cluster *"master-0.example.com=https://192.168.55.8:2380"* \ --initial-cluster-token *"etcd-cluster-1"* \ --initial-advertise-peer-urls *https://192.168.55.8:2380* 2017-10-03 08:55:32.440779 I | mvcc: restore compact to 1041269 2017-10-03 08:55:32.468244 I | etcdserver/membership: added member 40bef1f6c79b3163 [https://192.168.55.8:2380] to cluster 26841ebcf610583c
-
Restore permissions and
selinux
context to the restored files:# chown -R etcd.etcd /var/lib/etcd/ # restorecon -Rv /var/lib/etcd
-
Start the etcd service:
# systemctl start etcd
-
Check for any error messages:
$ journalctl -fu etcd.service
Adding more nodes
Once the first instance is running, it is safe to restore multiple etcd servers as desired.
-
Get the etcd name for the instance in the
ETCD_NAME
variable:# grep ETCD_NAME /etc/etcd/etcd.conf
-
Get the IP where etcd listens for peer communication:
# grep ETCD_INITIAL_ADVERTISE_PEER_URLS /etc/etcd/etcd.conf
-
On the etcd host where etcd is still running, add the new member:
# etcdctl3 member add *<name>* \ --peer-urls="*<advertise_peer_urls>*"
-
The command outputs some variables. For example:
ETCD_NAME="master2" ETCD_INITIAL_CLUSTER="master-0.example.com=https://192.168.55.8:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
Add those values to the
/etc/etcd/etcd.conf
file of the new host:# vi /etc/etc/etcd.conf
-
In the recently added etcd node, clean the etcd data directories to ensure the proper backup is restored keeping the running copy:
# mv /var/lib/etcd /var/lib/etcd.old # mkdir /var/lib/etcd # chown -R etcd.etcd /var/lib/etcd/ # restorecon -Rv /var/lib/etcd/
or wipe the etcd data directory:
# rm -Rf /var/lib/etcd/*
-
Start the etcd service in the recently added etcd host:
# systemctl start etcd
-
Check for errors:
# journalctl -fu etcd.service
-
Repeat the previous steps for every etcd node that is required to be added.
-
Verify the cluster has been properly set:
# etcdctl3 endpoint health https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 1.423459ms https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.767481ms https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.599694ms # etcdctl3 endpoint status https://master-0.example.com:2379, 40bef1f6c79b3163, 3.2.5, 28 MB, true, 9, 2878 https://master-1.example.com:2379, 1ea57201a3ff620a, 3.2.5, 28 MB, false, 9, 2878 https://master-2.example.com:2379, 59229711e4bc65c8, 3.2.5, 28 MB, false, 9, 2878
Scaling the etcd cluster can be performed vertically by adding more resources to the etcd hosts, or horizontally by adding more etcd hosts.
If etcd is collocated on master instances, horizontally scaling etcd prevents the API and controller services competing with etcd for resources.
Note
|
Due to the voting system etcd uses, the cluster must always contain an odd number of members. |
The new host requires a fresh RHEL7 dedicated host. The etcd storage should be
located on an SSD disk to achieve maximum performance and ideally on a dedicated
disk mounted in /var/lib/etcd
.
Note
|
{product-title} version 3.7 ships with an automated way to add a new etcd host using Ansible. |
-
Before adding a new etcd host, perform a backup of both etcd configuration and data to prevent data loss.
-
Check the current etcd cluster status to avoid adding new hosts to an unhealthy cluster:
# etcdctl --cert-file=/etc/etcd/peer.crt \ --key-file=/etc/etcd/peer.key \ --ca-file=/etc/etcd/ca.crt \ --peers="https://*master-0.example.com*:2379,\ https://*master-1.example.com*:2379,\ https://*master-2.example.com*:2379"\ cluster-health member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379 member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379 member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379 cluster is healthy
Or, using etcd v3 API:
# ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \ --key=/etc/etcd/peer.key \ --cacert="/etc/etcd/ca.crt" \ --endpoints="https://*master-0.example.com*:2379,\ https://*master-1.example.com*:2379,\ https://*master-2.example.com*:2379" endpoint health https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms
-
Before running the
scaleup
playbook, ensure the new host is registered to the proper Red Hat software channels:# subscription-manager register \ --username=*<username>* --password=*<password>* # subscription-manager attach --pool=*<poolid>* # subscription-manager repos --disable="*" # subscription-manager repos \ --enable=rhel-7-server-rpms \ --enable=rhel-7-server-extras-rpms
etcd is hosted in the
rhel-7-server-extras-rpms
software channel.
-
Modify the Ansible inventory file and create a new group named
[new_etcd]
and add the new host. Then, add thenew_etcd
group as a child of the[OSEv3]
group:[OSEv3:children] masters nodes etcd <new_etcd> ... [OUTPUT ABBREVIATED] ... [etcd] master-0.example.com master-1.example.com master-2.example.com [new_etcd] etcd0.example.com
-
Run the etcd
scaleup
playbook from the host that executed the initial installation and where the Ansible inventory file is:$ ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/scaleup.yml
-
Once the above has finished, modify the inventory file to reflect the current status by moving the new etcd host from the
[new_etcd]
group to the[etcd]
group:[OSEv3:children] masters nodes etcd <new_etcd> ... [OUTPUT ABBREVIATED] ... [etcd] master-0.example.com master-1.example.com master-2.example.com etcd0.example.com
-
If using Flannel, modify the
flanneld
service configuration, located at/etc/sysconfig/flanneld
on every {product-title} host, to include the new etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,*https://etcd0.example.com:2379*
-
Restart the
flanneld
service:# systemctl restart flanneld.service
The following steps can be performed on any etcd member. If using the Ansible
installer, the first host provided in the [etcd]
Ansible inventory is used to
generate the etcd configuration and certificates stored in
/etc/etcd/generated_certs
, so perform the next steps in that etcd host.
Steps to be performed on the current etcd cluster
-
In order to create the etcd certificates, run the
openssl
command with the proper values. To make this process easier, create some environment variables:export NEW_ETCD_HOSTNAME="*etcd0.example.com*" export NEW_ETCD_IP="*192.168.55.21*" export CN=$NEW_ETCD_HOSTNAME export SAN="IP:${NEW_ETCD_IP}" export PREFIX="/etc/etcd/generated_certs/etcd-$CN/" export OPENSSLCFG="/etc/etcd/ca/openssl.cnf"
NoteThe custom
openssl
extensions used asetcd_v3_ca_*
include the $SAN environment variable assubjectAltName
. See/etc/etcd/ca/openssl.cnf
for more information. -
Create the directory where the configuration and certificates are stored:
# mkdir -p ${PREFIX}
-
Create the server certificate request and sign it:
# openssl req -new -config ${OPENSSLCFG} \ -keyout ${PREFIX}server.key \ -out ${PREFIX}server.csr \ -reqexts etcd_v3_req -batch -nodes \ -subj /CN=$CN # openssl ca -name etcd_ca -config ${OPENSSLCFG} \ -out ${PREFIX}server.crt \ -in ${PREFIX}server.csr \ -extensions etcd_v3_ca_server -batch
-
Create the peer certificate request and sign it:
# openssl req -new -config ${OPENSSLCFG} \ -keyout ${PREFIX}peer.key \ -out ${PREFIX}peer.csr \ -reqexts etcd_v3_req -batch -nodes \ -subj /CN=$CN # openssl ca -name etcd_ca -config ${OPENSSLCFG} \ -out ${PREFIX}peer.crt \ -in ${PREFIX}peer.csr \ -extensions etcd_v3_ca_peer -batch
-
Copy the current etcd configuration and
ca.crt
files from the current node as examples to be modified later:# cp /etc/etcd/etcd.conf ${PREFIX} # cp /etc/etcd/ca.crt ${PREFIX}
-
Add the new host to the etcd cluster. Note the new host is not configured yet so the status stays as
unstarted
until the new host is properly configured:# etcdctl2 member add ${NEW_ETCD_HOSTNAME} https://${NEW_ETCD_IP}:2380
This command outputs the following variables:
ETCD_NAME="<NEW_ETCD_HOSTNAME>" ETCD_INITIAL_CLUSTER="<NEW_ETCD_HOSTNAME>=https://<NEW_HOST_IP>:2380,<CLUSTERMEMBER1_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER2_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER3_NAME>=https:/<CLUSTERMEMBER3_IP>:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
-
Those values must be overwritten by the current ones in the sample
${PREFIX}/etcd.conf
file. Also, modify the following variables with the new host IP (${NEW_ETCD_IP}
can be used) in that file:ETCD_LISTEN_PEER_URLS ETCD_LISTEN_CLIENT_URLS ETCD_INITIAL_ADVERTISE_PEER_URLS ETCD_ADVERTISE_CLIENT_URLS
-
Modify the
${PREFIX}/etcd.conf
file and check for syntax errors or missing IPs otherwise the etcd service could fail:# vi ${PREFIX}/etcd.conf
-
Once the file has been properly modified, a
tgz
file with the certificates, the sample configuration file, and theca
is created and copied to the new host:# tar -czvf /etc/etcd/generated_certs/${CN}.tgz -C ${PREFIX} . # scp /etc/etcd/generated_certs/${CN}.tgz ${CN}:/tmp/
Steps to be performed on the new etcd host
The new host is required to be subscribed to the proper Red Hat software channels as explained above in the prerequisites section.
-
Install
iptables-services
to provide iptables utilities to open the required ports for etcd:# yum install -y iptables-services
-
Create firewall rules to allow etcd to communicate:
-
Port 2379/tcp for clients
-
Port 2380/tcp for peer communication
# systemctl enable iptables.service --now # iptables -N OS_FIREWALL_ALLOW # iptables -t filter -I INPUT -j OS_FIREWALL_ALLOW # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2379 -j ACCEPT # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2380 -j ACCEPT # iptables-save | tee /etc/sysconfig/iptables
NoteIn this example, a new chain
OS_FIREWALL_ALLOW
is created, which is the standard naming the {product-title} installer uses for firewall rules.WarningIf the environment is hosted in an IaaS environment, modify the security groups for the instance to allow incoming traffic to those ports as well.
-
-
Install etcd software:
# yum install -y etcd
-
Ensure the service is not running:
# systemctl disable etcd --now
-
Remove any etcd configuration and data:
# rm -Rf /etc/etcd/* # rm -Rf /var/lib/etcd/*
-
Untar the certificates and configuration files
# tar xzvf /tmp/*etcd0.example.com*.tgz -C /etc/etcd/
-
Restore etcd configuration and data owner:
# chown -R etcd.etcd /etc/etcd/ # chown -R etcd.etcd /var/lib/etcd/
-
Start etcd on the new host:
# systemctl enable etcd --now
-
Verify the host has been added to the cluster and the current cluster health:
# etcdctl --cert-file=/etc/etcd/peer.crt \ --key-file=/etc/etcd/peer.key \ --ca-file=/etc/etcd/ca.crt \ --peers="https://*master-0.example.com*:2379,\ https://*master-1.example.com*:2379,\ https://*master-2.example.com*:2379,\ https://*etcd0.example.com*:2379"\ cluster-health member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379 member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379 member 8b8904727bf526a5 is healthy: got healthy result from https://192.168.55.21:2379 member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379 cluster is healthy
Or, using etcd v3 API:
# ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \ --key=/etc/etcd/peer.key \ --cacert="/etc/etcd/ca.crt" \ --endpoints="https://*master-0.example.com*:2379,\ https://*master-1.example.com*:2379,\ https://*master-2.example.com*:2379,\ https://*etcd0.example.com*:2379"\ endpoint health https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms https://etcd0.example.com:2379 is healthy: successfully committed proposal: took = 1.498829ms
Steps to be performed on all {product-title} masters
-
Modify the master configuration to add the new etcd host to the list of the etcd servers {product-title} uses to store the data, located in the
etcClientInfo
section of the/etc/origin/master/master-config.yaml
file on every master:etcdClientInfo: ca: master.etcd-ca.crt certFile: master.etcd-client.crt keyFile: master.etcd-client.key urls: - https://master-0.example.com:2379 - https://master-1.example.com:2379 - https://master-2.example.com:2379 *- https://etcd0.example.com:2379*
-
Restart the master API service on every master:
# systemctl restart atomic-openshift-master-api
Or, on a single master cluster installation
# systemctl restart atomic-openshift-master
Warning
|
The number of etcd nodes must be odd, so at least two hosts must be added. |
-
If using Flannel, the
flanneld
service configuration located at/etc/sysconfig/flanneld
on every {product-title} host must be modified to include the new etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,*https://etcd0.example.com:2379*
-
Restart the
flanneld
service:# systemctl restart flanneld.service
An etcd host can fail beyond restoration. This section walks through removing the failed etcd host from the cluster.
Important
|
Ensure the etcd cluster maintains quorum while removing the etcd host, by removing a single host at a time from a cluster. |
Steps to be performed on all masters hosts
-
Edit the failed etcd host out of the
/etc/origin/master/master-config.yaml
master configuration file on every master:etcdClientInfo: ca: master.etcd-ca.crt certFile: master.etcd-client.crt keyFile: master.etcd-client.key urls: - https://master-0.example.com:2379 - https://master-1.example.com:2379 *- https://master-2.example.com:2379* (1)
-
The host to be removed.
-
-
Restart the master API service on every master:
# systemctl restart atomic-openshift-master-api
Or, if using a single master cluster installation:
# systemctl restart atomic-openshift-master
Steps to be performed in the current etcd cluster
-
Remove the failed host from the cluster by running the following on a functioning etcd host:
# etcdctl2 cluster-health member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379 member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379 failed to check the health of member 8372784203e11288 on https://192.168.55.21:2379: Get https://192.168.55.21:2379/health: dial tcp 192.168.55.21:2379: getsockopt: connection refused member 8372784203e11288 is unreachable: [https://192.168.55.21:2379] are all unreachable member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379 cluster is healthy # etcdctl2 member remove 8372784203e11288 Removed member 8372784203e11288 from cluster # etcdctl2 cluster-health member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379 member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379 member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379 cluster is healthy
NoteThe
remove
command requires the etcd ID, not the hostname. -
To ensure the etcd configuration does not use the failed host when the etcd service is restarted, modify the
/etc/etcd/etcd.conf
file on all remaining etcd hosts and remove the failed host in the value for theETCD_INITIAL_CLUSTER
variable:# vi /etc/etcd/etcd.conf
For example:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380,master-2.example.com=https://192.168.55.13:2380
becomes:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380
NoteRestarting the etcd services is not required, because the failed host has been removed using
etcdctl
on the command line. -
Modify the Ansible inventory file to reflect the current status of the cluster and to avoid issues if running a playbook:
[OSEv3:children] masters nodes etcd ... [OUTPUT ABBREVIATED] ... [etcd] master-0.example.com master-1.example.com
-
If using Flannel, modify the
flanneld
service configuration located at/etc/sysconfig/flanneld
on every host and remove the etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379
-
Restart the
flanneld
service:# systemctl restart flanneld.service
To replace an etcd host, first remove the etcd node from the cluster following the steps from Removing an etcd host, then scale up the etcd cluster with the new host using the scale up Ansible playbook or the manual procedure in Scaling etcd.
Warning
|
The etcd cluster should maintain a quorum during the replacement operation. This means that at least one host should be in operation at all times. If the host replacement operation occurs while the etcd cluster maintains a quorum, cluster operations are not affected, except if there is a large etcd data to replicate where some operations can be slowed down. |
Note
|
Ensure a backup of etcd data and configuration files exists before any procedure involving the etcd cluster to ensure restoration in the case of failure. |