It is important to have a failover of every system as it improves the availability of the system and reduces data loss. In this articale, we describe how you can have near-live synchronization between two Zimbra servers so that one of them is live and the other is kept in a warm or very warm standby state. The sync can work in reverse when the mirror (redundant) server becomes the active server. This allows easy fall-back to the original server once the failover condition is resolved.
Zimbra employs several different databases to store messages, message indexes, meta-data, account information and configuration. Although it is possible to synchronise two Zimbra servers at the disk level using DRBD or VSphere, the amount of disk operations from all these databases that need to be replicated would probably take up a lot of bandwidth which may be debilitating and/or expensive to implement if the two servers are in remote locations.
Fortunately, Zimbra keeps a log of almost all it\’s transactions in the **redolog**. The only thing not logged here are changes to the LDAP database. An incremental backup is made up of an LDAP dump and a collection of redologs. An incremental backup can be used to bring a backup server up to date if the last full backup of the backup server was more recent than the oldest log in the redolog.
Redolog
If the redolog can be piped to a mirror server in real time, then all the mirror server has to do is keep replaying the logs every so often and it will keep the same state as the live server. The only other thing to keep up to date is the LDAP database. Fortunately, the LDAP database doesn\’t change that often so it is quite easy to keep it synced on a directory level.
The easiest way to transfer the redologs is to use rsync. The only problem with that is that rsync does not run continuously. It also won\’t handle the archiving of redo.log very efficiently. When redo.log is renamed and moved to the archive rsync will delete it at the remote location then transfer it all over again to its new location in the archive. If we can catch this move taking place then we can move and rename the file on the mirror server before running rsync. Then rsync has very little to do, in theory, nothing except delete any files that have been purged. Another issue with rsync is that the file may be in the process of being written to when it is copied. This results in an incomplete file at the mirror. However, redologs are only ever appended to and so only the last record will be corrupted. Zimbra is designed to be tolerant of redolog corruption otherwise it would be of limited use as a disaster recovery tool.
To keep the redolog live, the `tail -f` command is used over ssh to pipe the file to the mirror. By calling `tail -f -c +0` it tails right back to the zeroth byte of the file, effectively a *copy-then-stream* command.
Redolog purging
If a Network edition is detected and incremental backups are enabled, then the redologs are replayed before any rsync is performed as well as after. This ensures that everything is replayed before the files all disappear to the backup directory.
For the Open Source Edition, or Network Edition with no incremental backups scheduled, the redologs are purged if they are more than a day old and have been replayed. If the mirror server is down then the redologs will just accumulate on the live server to be replayed when the sync process is restarted before being purged.
LDAP
LDAP stores its data in `/opt/zimbra/data/ldap`. This can be copied using rsync to the mirror as long as no changes take place during the copy. On versions of Zimbra older than 8.0, the directory is monitored for this during the rsync operation and repeated if there was any change during that time. For Zimbra 8+, the directory is monitored as before but changes are transferred using an LDIF export and import. This is necessitated by the long time it takes to rsync the sparse files that LDAP now uses to store data.
Known issues
* If the connection breaks at the very moment that the live stream of the redo.log starts, before the tail command reaches the point where it is tailing instead of cataloging the file, then some of the redo.log will not make it to the mirror resulting in some loss of transactions. Fortunately, this is only ever likely just after the log has rolled over so the worst-case losses should be minimal.
* LDAP is only checked every ten minutes so some losses are possible if the connection breaks in that time. However, LDAP isn\’t expected to change very often unless something major like a batch account migration is taking place.
* If any redologs go missing, or can\’t be replayed successfully for any reason, then there will be gaps in synced email events. Mail may go missing on the mirror server. Check the *live_sync* logs for any log sequence numbers that appear not to have been transferred and processed. In the case of suspected data loss, stop all services and repeat the initial rsync process.
Preparation
Exactly the same version of Zimbra must be installed on both the live and mirror server. To start with, we work on the live server. There is no need to stop Zimbra for most of the install. Only a short amount of down-time will need to be scheduled later to perform a final rsync operation between the two servers.
The mirror server should ideally have the same operating system as the live server and must have exactly the same version of Zimbra installed. The hostname must also be exactly the same.
Live Server
- Install inotify-tools
inotify tools are required. For RHEL/CentOS, inotify-tools is provided by the epel-release repository.
As user root: (Centos)
yum install epel-release -y
yum install inotify-tools -y
while for Ubuntu, as user root:
apt install inotify-tools -y
- Create log rotation
The script will create a log file which can be handled by logrotate. As user root:
echo /opt/zimbra/live_sync/log/live_sync.log {
daily
missingok
copytruncate
rotate 7
notifempty
compress
}>/etc/logrotate.d/zimbra_live_sync
- Create application directory
The script will live under the /opt/zimbra directory. As user root:
mkdir /opt/zimbra/live_sync
chown zimbra.zimbra /opt/zimbra/live_sync
- SSH Key
Create the SSH key and just press return every time you are prompted for a passphrase. As user zimbra:
sudo su - zimbra
cd /opt/zimbra/.ssh
ssh-keygen -b 4096 -f live_sync
echo command=\\/opt/zimbra/live_sync/sync_commands\\ $( cat live_sync.pub )>>authorized_keys
exit
- Main Script
The following script should be saved as live_syncd in the /opt/zimbra/live_sync directory. This should be owned by user zimbra and made executable.
sudo vim /opt/zimbra/live_sync/live_syncd
#!/bin/bash
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>
##########################################################################
# Title : live_syncd
# Author : Simon Blandford <simon -at- onepointltd -dt- com>
# Date : 2013-03-12
# Requires : zimbra sync_commands inotify-tools
# Category : Administration
# Version : 2.1.5
# Copyright : Simon Blandford, Onepoint Consulting Limited
# License : GPLv3 (see above)
##########################################################################
# Description
# Keep two Zimbra servers synchronised in near-realtime
##########################################################################
#******************************************************************************
#********************** Constants *********************************************
#******************************************************************************
LOG_LEVEL=5
REDO_LOG_HISTORY_DAYS=10
ERROR_CLEAR_MINUTES=10
LDAP_CHECK_MINUTES_INTERVAL=10
ZIMBRA_DIR=/opt/zimbra
BASE_DIR=$ZIMBRA_DIR/live_sync
LOCKING_DIR=$BASE_DIR/lock
PID_DIR=$BASE_DIR/pid
LOG_DIR=$BASE_DIR/log
LOG_FILE=$LOG_DIR/live_sync.log
LDAP_TEMP_DIR=$BASE_DIR/ldap
LDAP_TEMP_LDIF=$BASE_DIR/ldif.bak
STATUS_DIR=$BASE_DIR/status
SSH_IDENTITY_FILE=$ZIMBRA_DIR/.ssh/live_sync
REDOLOG_DIR=$ZIMBRA_DIR/redolog
REDO_LOG_FILE=$REDOLOG_DIR/redo.log
ARCHIVE_DIR=$REDOLOG_DIR/archive
LIVE_SYNC_ARCHIVE_DIR=$REDOLOG_DIR/live_sync_archives
LDAP_DATA_DIR=$ZIMBRA_DIR/data/ldap/
BACKUP_DIR=$ZIMBRA_DIR/backup
SYNC_COMMANDS_SCRIPT=$BASE_DIR/sync_commands
SSH=ssh -i $SSH_IDENTITY_FILE -o StrictHostKeyChecking=no -o CheckHostIP=no\\
-o PreferredAuthentications=hostbased,publickey
LOCK_STATE_DIR=$LOCKING_DIR/live_sync.lock
STOP_FILE=$STATUS_DIR/live_sync.stop
LAST_GOOD_REDO_REPLAY=$STATUS_DIR/last_good_redo_replay
LAST_GOOD_REDO_SYNC=$STATUS_DIR/last_good_redo_sync
LAST_GOOD_REDO_STREAM=$STATUS_DIR/last_good_redo_stream
LAST_GOOD_LDAP_SYNC=$STATUS_DIR/last_good_ldap_sync
LAST_GOOD_LDAP_START=$STATUS_DIR/last_good_ldap_start
WATCHES_FILE=$STATUS_DIR/watches
PID_FILE_LDAP=$PID_DIR/ldap_live_sync.pid
PID_FILE_REDO=$PID_DIR/redo_log_live_sync.pid
CONF_FILE=$BASE_DIR/live_sync.conf
#******************************************************************************
#********************** Functions *********************************************
#******************************************************************************
#Format for log output with errors and warnings going to >&2
logit () {
logit_1 () {
echo -n $( date ) :
case $ in
1)
echo -n Error :
;;
2)
echo -n Warning :
;;
3)
echo -n Info :
;;
4)
echo -n Debug :
;;
esac
echo $@
}
local msg_level output_chan
if [ $1 -le $LOG_LEVEL ]; then
msg_level=$1
shift
if [ $msg_level -le 2 ]; then
logit_1 $@ >&2
else
logit_1 $@
fi
fi
}
#Detect HSM
detect_hsm () {
local retval
#LDAP must be running
ldap status &>/dev/null || ldap start &>/dev/null
#MySQL must be running
mysql.server status &>/dev/null || mysql.server start &>/dev/null
#Preserve mailbox running state
zmmailboxdctl status &>/dev/null
prev_zmmailbox_status=$?
zmmailboxdctl start &>/dev/null
zmvolume -l | grep type: secondaryMessage >/dev/null
retval=$?
if [ $prev_zmmailbox_status -ne 0 ]; then
zmmailboxdctl stop &>/dev/null
fi
return $retval
}
#Ensure ldap, convertd and mysql servers are running and then replay redo logs
replay_redo_logs () {
local server_failed
ldap status &>/dev/null || ldap start &>/dev/null
mysql.server status &>/dev/null || mysql.server start &>/dev/null
server_failed=0
if ! ldap status &>/dev/null; then
logit 1 Start of local ldap server failed
ldap status >&2
#Return error to trigger a break in while loop
server_failed=1
fi
if ! mysql.server status &>/dev/null; then
logit 1 Start of local mysql server failed
mysql.server status >&2
#Return error to trigger a break in while loop
server_failed=1
fi
if [ x$convertd_enabled == xtrue ]; then
#Make sure indexing works while replaying redo log
zmconvertctl status &>/dev/null || zmconvertctl start &>/dev/null
if ! zmconvertctl status &>/dev/null; then
logit 2 Start of local convertd servers failed
zmconvertctl status >&2
fi
fi
[ $server_failed -eq 1 ] && return 1
logit 3 Replaying redologs...
if ! zmplayredo >/dev/null; then
logit 2 Replay of redolog failed
#No error returned here since break is not necessary
else
#If no errors then archive redo log files
if ! mkdir -p $LIVE_SYNC_ARCHIVE_DIR; then
logit 1 Unable to create directory $LIVE_SYNC_ARCHIVE_DIR
exit 1
fi
mv -f $ARCHIVE_DIR/* $LIVE_SYNC_ARCHIVE_DIR/ 2>/dev/null
touch $LAST_GOOD_REDO_REPLAY
fi
logit 3 Replaying redologs done
return 0
}
#The redo log sync daemon
redo_log_live_sync () {
local stream_pid archived_file i archived_redo_log_file prev_zmmailbox_status secondary_storage
logit 3 Starting redo log live sync process
#Wait for lock directory to be successfully created
while ! mkdir $LOCK_STATE_DIR &>/dev/null; do
sleep 2
done
logit 3 Detecting if HSM used
if detect_hsm; then
logit 3 HSM Detected
secondary_storage=yes
else
logit 3 No HSM Detected
fi
rmdir $LOCK_STATE_DIR
while [ ! -f $STOP_FILE ]; do
while [ ! -f $STOP_FILE ]; do
#Wait for lock directory to be successfully created
while ! mkdir $LOCK_STATE_DIR &>/dev/null; do
sleep 2
done
[ -f $STOP_FILE ] && break
logit 3 Syncing redologs...
#If incremental backups are enabled then gather redo logs from backups and copy
#to local archive directory
redo_sync_fail=false
for archived_redo_log_file in $( echo gather$REDO_LOG_HISTORY_DAYS | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT ); do
if [ -f $LIVE_SYNC_ARCHIVE_DIR/$( basename $archived_redo_log_file ) ]; then
logit 4 Already processed so skipping: $archived_redo_log_file
else
logit 4 Syncing incremental backup file: $archived_redo_log_file
if ! rsync -z -e $SSH --size-only $remote_address:$archived_redo_log_file \\
$ARCHIVE_DIR/.; then
logit 2 Rsync of a redolog, $archived_redo_log_file, failed
redo_sync_fail=true
fi
fi
done
#Suspend if HSM is running
if which zmhsm >/dev/null && zmhsm -u | grep Currently running >/dev/null; then
logit 3 Replaying redologs is suspended while HSM process is active
else
#Mailbox process must not be running now. Preserve state and stop.
zmmailboxdctl status &>/dev/null
prev_zmmailbox_status=$?
if [ $prev_zmmailbox_status -eq 0 ]; then
zmmailboxdctl stop &>/dev/null
fi
sleep 2
if zmmailboxdctl status &>/dev/null; then
logit 1 Unable to stop local Zimbra mailbox service
return 1
fi
logit 4 Syncing $REDO_LOG_FILE
if ! rsync -e $SSH -z \\
$remote_address:$REDO_LOG_FILE $REDO_LOG_FILE; then
logit 2 Rsync of $REDO_LOG_FILE failed
redo_sync_fail=true
fi
logit 4 Syncing $REDO_LOG_FILE done
if [ x$redo_sync_fail == xfalse ]; then
touch $LAST_GOOD_REDO_SYNC
else
break
fi
logit 4 Syncing redologs done
logit 4 Purging redolog directory and archives
#Purge local redolog directory
find $REDOLOG_DIR -mtime +$REDO_LOG_HISTORY_DAYS -type f -exec rm {} \\;
#Purge any interrupted rsync files
find $REDOLOG_DIR -name \'.redo*\' -type f -exec rm {} \\;
logit 4 Purge redolog directory and archives done
replay_redo_logs || break
#Restore mailboxd to previous running state or start if HSM is being used
if [ $prev_zmmailbox_status -eq 0 ] || \\
[ x$secondary_storage == xyes ] >/dev/null; then
logit 4 Re-starting Zimbra mailbox service
zmmailboxdctl start &>/dev/null
if ! zmmailboxdctl status &>/dev/null; then
logit 2 Unable to re-start local Zimbra mailbox service
fi
fi
fi
#If there are no incremental backups then remote archive directory will need purging
if [ x$incremental_backups != xtrue ]; then
logit 4 Purging remote redolog directory
echo purge$REDO_LOG_HISTORY_DAYS | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT
logit 4 Purging remote redolog directory done
fi
#Establish copy-and-live-stream of current redo.log file
logit 4 Live streaming redolog
echo stream | \\
$SSH $remote_address \\
$SYNC_COMMANDS_SCRIPT >$REDO_LOG_FILE &
stream_pid=$!
disown $stream_pid
#Delay as PID was sometimes not being found if checked immediately
sleep 5
#If successfully established stream then sit and wait for move to archive
if ps $stream_pid | grep $SYNC_COMMANDS_SCRIPT &>/dev/null; then
logit 4 Live streaming redolog established
touch $LAST_GOOD_REDO_STREAM
#Remove lock file, this is resting point
rmdir $LOCK_STATE_DIR &>/dev/null
#Wait for name to be passed of new archive file after redo.log is moved on remote server
#This is normal resting point of this process
archived_file=$( echo wait_redo | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT | \\
tail -n 1 | grep -Eo redo-.*log )
#Kill stream
kill -KILL $( ps aux | grep $SYNC_COMMANDS_SCRIPT | \\
grep -v grep | awk \'{print $2}\' ) &>/dev/null
#Mirror move operation on local server
if echo $archived_file | grep -E redo-.*log &>/dev/null; then
logit 4 Moving redo.log to $archived_file
mv -f $REDO_LOG_FILE $ARCHIVE_DIR/$archived_file 2>/dev/null
else
logit 2 Archive file name not found
fi
[ -f $STOP_FILE ] && break
else
logit 2 Failed to start redolog streaming, PID=$stream_pid
break
fi
done
rmdir $LOCK_STATE_DIR &>/dev/null
#Wait $ERROR_CLEAR_MINUTES minutes for error to error to clear
i=0
while [ $(( i++ )) -lt 60 ] && [ ! -f $STOP_FILE ]; do
sleep $ERROR_CLEAR_MINUTES
done
done
logit 3 Ending redo log live sync process
}
#The ldap sync daemon
ldap_live_sync () {
local ldap_wait_pid i last_ldap_success_state
last_ldap_success_state=false
logit 3 Starting ldap live sync process
while [ ! -f $STOP_FILE ]; do
while [ ! -f $STOP_FILE ]; do
#Wait for lock directory to be successfully created
while ! mkdir $LOCK_STATE_DIR &>/dev/null; do
sleep 3
done
if [ $zimbra_version -lt 8 ]; then
logit 3 Syncing ldap using rsync
#Use rsync for Zimbra older than verion 8
while [ 1 ]; do
#Check for changes during ldap sync operation
echo wait_ldap | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT &>$WATCHES_FILE &
ldap_wait_pid=$!
disown $ldap_wait_pid
if ! ps $ldap_wait_pid &>/dev/null; then
logit 2 Unable to establish watch on remote LDAP directory, no ldap sync performed
break
fi
#Wait for watches to be established
while ! grep established $WATCHES_FILE &>/dev/null && \\
ps $ldap_wait_pid &>/dev/null; do
sleep 1
done
#Echo out status
cat $WATCHES_FILE
rm -f $WATCHES_FILE
#Rsync remote server to temporary local ldap directory
if ! rsync -e $SSH -aHz --sparse --force --delete \\
$remote_address:$LDAP_DATA_DIR/ $LDAP_TEMP_DIR/; then
logit 2 Rsync of ldap failed
break
else
touch $LAST_GOOD_LDAP_SYNC
fi
ps $ldap_wait_pid &>/dev/null && break
logit 3 Ldap changed during rsync. Re-syncing.
sleep 10
done
kill -KILL $ldap_wait_pid &>/dev/null
else
#Use ldif export for Zimbra 8 and over
logit 3 Syncing ldap using ldif
if ! echo dump_ldap | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT >$LDAP_TEMP_LDIF; then
logit 2 Unable to fetch remote LDIF, no LDAP sync performed
break
else
touch $LAST_GOOD_LDAP_SYNC
fi
fi
if which zmhsm >/dev/null && zmhsm -u | grep Currently running >/dev/null; then
logit 3 LDAP update is suspended while HSM process is active
else
#Stop ldap
ldap status &>/dev/null && ldap stop &>/dev/null
if ldap status &>/dev/null; then
logit 1 Unable to stop local ldap server
break
fi
if [ $zimbra_version -lt 8 ]; then
#Use rsync for Zimbra older than version 8
#rsync temporary local ldap directory to real local ldap directory
rsync -aH --sparse $LDAP_TEMP_DIR/ $LDAP_DATA_DIR/
else
#Use LDIF import for Zimbra 8 and over
rm -rf $LDAP_DATA_DIR/mdb && \\
mkdir -p $LDAP_DATA_DIR/mdb/db && \\
mkdir -p $LDAP_DATA_DIR/mdb/log && \\
/opt/zimbra/libexec/zmslapadd $LDAP_TEMP_LDIF
if [ $? != 0 ]; then
logit 2 Unable to import LDIF into local LDAP
break
fi
fi
#Restart ldap
ldap status &>/dev/null || ldap start &>/dev/null
if ! ldap status &>/dev/null; then
logit 1 Unable to restart local ldap server
last_ldap_success_state=false
else
last_ldap_success_state=true
fi
logit 4 Syncing LDAP done
fi
rmdir $LOCK_STATE_DIR &>/dev/null
[ -f $STOP_FILE ] && break
#Wait for change in remote ldap over $LDAP_CHECK_MINUTES_INTERVAL intervals
echo wait_ldap | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT &
ldap_wait_pid=$!
disown $ldap_wait_pid
while [ ! -f $STOP_FILE ]; do
logit 4 Start new LDAP monitor period
#Repeat last ldap success so that no ldap change is not
#interpreted by Nagios as no ldap success.
if [ x$last_ldap_success_state == xtrue ]; then
touch $LAST_GOOD_LDAP_START
fi
#Restart wait for ldap change if required
if ! ps $ldap_wait_pid &>/dev/null; then
echo wait_ldap | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT &
ldap_wait_pid=$!
disown $ldap_wait_pid
fi
#Wait $LDAP_CHECK_MINUTES_INTERVAL minutes
i=0
while [ $(( i++ )) -lt 60 ] && [ ! -f $STOP_FILE ]; do
sleep $LDAP_CHECK_MINUTES_INTERVAL
done
#If wait process is not still running then there was a change
ps $ldap_wait_pid &>/dev/null || break
done
done
rmdir $LOCK_STATE_DIR &>/dev/null
#Wait $ERROR_CLEAR_MINUTES minutes for error to error to clear
i=0
while [ $(( i++ )) -lt 60 ] && [ ! -f $STOP_FILE ]; do
sleep $ERROR_CLEAR_MINUTES
done
done
logit 3 Ending ldap live sync process
}
get_zimbra_config_globals () {
#Query whether incremental backups are enabled
incremental_backups=$( echo query_incremental | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT )
#Query whether convertd is installed and enabled
ldap status &>/dev/null || ldap start &>/dev/null
if ! ldap status &>/dev/null; then
logit 1 Unable to start local ldap server
exit 1
fi
if [ $( zmprov -l gs `zmhostname` | \\
grep -E (zimbraServiceInstalled|zimbraServiceEnabled):[[:space:]]*convertd | \\
wc -l ) -eq 2 ]; then
convertd_enabled=true
else
convertd_enabled=false
fi
}
kill_everything () {
touch $STOP_FILE
kill -KILL $( head -n 1 $PID_FILE_LDAP 2>/dev/null ) &>/dev/null
kill -KILL $( head -n 1 $PID_FILE_REDO 2>/dev/null ) &>/dev/null
kill -KILL $( ps aux | grep live_syncd start | grep -v grep | awk \'{print $2}\' ) &>/dev/null
kill -KILL $( ps aux | grep redo_log_live_sync | grep -v grep | awk \'{print $2}\' ) &>/dev/null
kill -KILL $( ps aux | grep ldap_live_sync | grep -v grep | awk \'{print $2}\' ) &>/dev/null
kill -KILL $( ps aux | \\
grep $SYNC_COMMANDS_SCRIPT | grep -v grep | awk \'{print $2}\' ) &>/dev/null
kill -KILL $( ps aux | grep rsync | grep -E $REDOLOG_DIR|$LDAP_DATA_DIR|$BACKUP_DIR | \\
awk \'{print $2}\' ) &>/dev/null
#Kill redolog playback if running
kill -KILL $( ps aux | grep -E zimbra.*java.*PlaybackUtil | grep -v grep | \\
awk \'{print $2}\' ) &>/dev/null
rm -f $STOP_FILE
rm -f $PID_FILE_LDAP
rm -f $PID_FILE_REDO
rmdir $LOCK_STATE_DIR &>/dev/null
}
quitting () {
echo Quitting
#Kill any hanging processes
kill_everything
trap - INT TERM SIGINT SIGTERM
echo \'kill -KILL $( ps aux | grep live_syncd | grep -v grep | awk \'\'\'{print $2}\'\'\' ) &>/dev/null\' | \\
at now && sleep 1 && rmdir $LOCK_STATE_DIR &>/dev/null
exit
}
#******************************************************************************
#********************** Main Program ******************************************
#******************************************************************************
if [ $( whoami ) != zimbra ]; then
echo Must run as zimbra user >&2
exit 1
fi
mkdir -p $LOCKING_DIR
mkdir -p $PID_DIR
mkdir -p $LOG_DIR
mkdir -p $LDAP_TEMP_DIR
mkdir -p $STATUS_DIR
chmod 755 $STATUS_DIR
if [ ! -f $CONF_FILE ]; then
echo Configuration file, $CONF_FILE, not found >&2
exit 1
fi
source $CONF_FILE
#Find all local addresses
server_addresses=$( /sbin/ifconfig |grep inet | grep -E \'(([0-9]+\\.){3}[0-9]+|[0-9a-f]+(:[0-9a-f]*){5})\' |awk \'//{print $2}\' )
#server_addresses=$( /usr/sbin/ip a | grep -Po \'inet \\K[\\d.]+\')
#Check configured server addresses are valid
if ! echo $server1 | \\
grep -Ei ([0-9]+\\.){3}[0-9]+|[0-9a-f]+(:[0-9a-f]*){5} &>/dev/null; then
echo No valid IP address found for server1 in configuration file >&2
exit 1
fi
if ! echo $server2 | \\
grep -Ei ([0-9]+\\.){3}[0-9]+|[0-9a-f]+(:[0-9a-f]*){5} &>/dev/null; then
echo No valid IP address found for server2 in configuration file >&2
exit 1
fi
#Deduce local address and assume other address is remote machine
if echo $server_addresses | grep $server1 &>/dev/null; then
local_address=$server1
remote_address=$server2
else
if echo $server_addresses | grep $server2 &>/dev/null; then
local_address=$server2
remote_address=$server1
else
echo Unable to identify local server address and assume remote address >&2
exit 1
fi
fi
#Check remote server is OK
remote_server_status=$( echo test | \\
$SSH $remote_address $SYNC_COMMANDS_SCRIPT )
if [ x$remote_server_status == xbusy ]; then
echo Remote server appears to have live_syncd process running >&2
echo This can not run on both servers >&2
exit 1
fi
if [ x$remote_server_status != xOK ]; then
echo Unable to run commands on remote server >&2
exit 1
fi
#Get major Zimbra version
zimbra_version=$( zmcontrol -v | grep -Eo [0-9][^\\.]* | head -n 1 )
if [ ${#zimbra_version} -lt 1 ]; then
zimbra_version=0
fi
case $1 in
start)
#Check for processes from this script and also redolog replay. Don\'t count PID files older than uptime.
if [ -f $PID_FILE_REDO ] && \\
[ $(( $( date +%s ) - $( stat -c \'%Y\' $PID_FILE_REDO ) )) -lt $( cat /proc/uptime | grep -Eo [0-9]+ | head -n 1 ) ]; then
pid_found=yes
fi
if [ -f $PID_FILE_LDAP ] && \\
[ $(( $( date +%s ) - $( stat -c \'%Y\' $PID_FILE_LDAP ) )) -lt $( cat /proc/uptime | grep -Eo [0-9]+ | head -n 1 ) ]; then
pid_found=yes
fi
if [ $pid_found ] || \\
ps aux | grep -E zimbra.*java.*PlaybackUtil | grep -v grep &>/dev/null; then
echo Proccess already running
else
echo -n Starting processes...
get_zimbra_config_globals
echo *************************************** >>$LOG_FILE
logit 3 Starting $( basename $0 ) >>$LOG_FILE
logit 3 Incremental backups enabled : $incremental_backups >>$LOG_FILE
logit 3 Convertd enabled : $convertd_enabled >>$LOG_FILE
ldap_live_sync >>$LOG_FILE 2>&1 &
echo $! >$PID_FILE_LDAP
redo_log_live_sync >>$LOG_FILE 2>&1 &
echo $! >$PID_FILE_REDO
echo done
fi
;;
stop)
touch $STOP_FILE
[ -d $LOCK_STATE_DIR ] && echo Waiting for sync operations to complete...
while [ -d $LOCK_STATE_DIR ]; do
sleep 5
done
rm -f $STOP_FILE
replay_redo_logs
kill_everything
echo done
;;
status)
if ps aux | grep -E zimbra.*java.*PlaybackUtil | grep -v grep &>/dev/null; then
echo redolog is being replayed
replay_stat=0
else
replay_stat=3
fi
if [ -f $PID_FILE_REDO ] && ps $( head -n 1 $PID_FILE_REDO 2>/dev/null ) &>/dev/null; then
echo redo log sync process OK
redo_stat=0
else
echo redolog sync process stopped
redo_stat=3
fi
if [ -f $PID_FILE_LDAP ] && ps $( head -n 1 $PID_FILE_LDAP 2>/dev/null ) &>/dev/null; then
echo ldap sync process OK
ldap_stat=0
else
echo ldap sync process stopped
ldap_stat=3
fi
[ $ldap_stat == 3 ] && [ $redo_stat == 3 ] && [ $replay_stat == 3 ] && exit 3
[ $ldap_stat == 0 ] && [ $redo_stat == 0 ] && exit 0
exit 1
;;
kill)
kill_everything
;;
*)
trap quitting INT TERM SIGINT SIGTERM
if ps aux | grep redo_log_live_sync | grep -v grep &>/dev/null || \\
ps aux | grep ldap_live_sync | grep -v grep &>/dev/null || \\
ps aux | grep -E zimbra.*java.*PlaybackUtil | grep -v grep &>/dev/null; then
echo Proccess already running
else
echo Starting processes in realtime
get_zimbra_config_globals
logit 3 Incremental backups enabled : $incremental_backups
logit 3 Convertd enabled : $convertd_enabled
ldap_live_sync &
echo $! >$PID_FILE_LDAP
redo_log_live_sync &
echo $! >$PID_FILE_REDO
while [ 1 ]; do sleep 10; done
fi
;;
esac
sudo chown zimbra:zimbra /opt/zimbra/live_sync
sudo chmod +x /opt/zimbra/live_sync
- Remote command script
The following script should be saved as sync_commands in the /opt/zimbra/live_sync directory. This should be owned by user zimbra and made executable.
sudo vim /opt/zimbra/live_sync/sync_commands
#!/bin/bash
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>
##########################################################################
# Title : sync_commands
# Author : Simon Blandford <simon -at- onepointltd -dt- com>
# Date : 2013-03-12
# Requires : zimbra sync_commands inotify-tools
# Category : Administration
# Version : 2.1.3
# Copyright : Simon Blandford, Onepoint Consulting Limited
# License : GPLv3 (see above)
##########################################################################
# Description
# Keep two Zimbra servers synchronised in near-realtime, local agent
##########################################################################
#******************************************************************************
#********************** Main Program ******************************************
#******************************************************************************
if [ $( whoami ) != zimbra ]; then
echo Must run as zimbra user >&2
exit 1
fi
#Check for rsync of redolog or ldap
if echo $SSH_ORIGINAL_COMMAND | \\
grep rsync | \\
grep -E /opt/zimbra/redolog/|/opt/zimbra/data/ldap/|/opt/zimbra/backup/sessions/incr &>/dev/null; then
case $SSH_ORIGINAL_COMMAND in
*\\&*)
echo Rejected
;;
*\\(*)
echo Rejected
;;
*\\{*)
echo Rejected
;;
*\\;*)
echo Rejected
;;
*\\<*)
echo Rejected
;;
*\\`*)
echo Rejected
;;
rsync\\ --server*)
$SSH_ORIGINAL_COMMAND
;;
*)
echo Rejected
;;
esac
else
#Not rsync
case $# in
0) read command
;;
*) command=$1
;;
esac
check_inotify () {
if ! which inotifywait &>/dev/null; then
echo inotifywait not found >&2
echo Please install inotify-tools >&2
exit 1
fi
}
#Extract numeric parameter from command name
param=$( echo $command | grep -Eo [0-9]+ )
command=$( echo $command | grep -Eio [a-z_]+ )
case $command in
test)
if ps aux | grep live_syncd | grep -v grep &>/dev/null; then
echo busy
else
echo OK
fi
;;
wait_redo)
#Wait for redo log roll-over
check_inotify
kill -KILL $( ps aux | grep inotifywait -r /opt/zimbra/redolog | \\
grep -v grep | awk \'{print $2}\' ) &>/dev/null
inotifywait -r /opt/zimbra/redolog -e moved_to
;;
wait_ldap)
#Wait for ldap changes. Ignore log changes.
check_inotify
kill -KILL $( ps aux | grep inotifywait -r /opt/zimbra/data/ldap | \\
grep -v grep | awk \'{print $2}\' ) &>/dev/null
inotifywait -r /opt/zimbra/data/ldap -e modify \\
-e attrib -e close_write -e moved_to -e moved_from \\
--exclude logs\\/log\\.|accesslog \\
-e move -e delete -e delete_self
;;
dump_ldap)
#Extract the LDIF database and stream it
/opt/zimbra/libexec/zmslapcat /tmp/zimbraldif
cat /tmp/zimbraldif/ldap.bak
rm -rf /tmp/zimbraldif
;;
stream)
#Live-stream redolog
#Kill any hanging previous tail commands
kill -KILL $( ps aux | grep tail -c +0 -f /opt/zimbra/redolog/redo.log | \\
grep -v grep | awk \'{print $2}\' ) &>/dev/null
tail -c +0 -f /opt/zimbra/redolog/redo.log
;;
gather)
#Gather list of recent redologs from incremental backups and archive
find \'/opt/zimbra/backup/sessions/incr-\'*\'/redologs\' \\
\'/opt/zimbra/redolog/archive\' \\
-name \'redo*.log\' -type f -mtime -$param -print 2>/dev/null | \\
sort
;;
purge)
#Remove old archives
find /opt/zimbra/redolog/archive -type f -mtime +$param -exec rm {} \\;
;;
query_incremental)
#Query whether incremental backups are scheduled
if which zmschedulebackup &>/dev/null && \\
zmschedulebackup -q | \\
grep -Eo i([[:space:]]+[0-9\\*\\-]+){5} &>/dev/null; then
echo true
else
echo false
fi
;;
*)
rsync
;;
esac
fi
sudo chown zimbra:zimbra /opt/zimbra/live_sync/sync_commands
sudo chmod +x /opt/zimbra/live_sync/sync_commands
- Configuration file
The configuration file simply contains the IP addresses of the live and mirror server. The order is not important since this is worked out by the script by seeing which IP address is assigned to the local machine. The configuration file name is saved as live_sync.conf and saved in the /opt/zimbra/live_sync directory and readable by user zimbra. The following is an example, you obviously should use the real IP addresses of your own live and mirror servers.
server1=192.168.108.10
server2=192.168.108.11
- Enabling redo.log
For the Network edition, redo logs are already being created and are periodically moved to create incremental backups. For the open source version, redo logs archiving must be enabled.
To see the current redo log related settings, type the following as user zimbra:
sudo su - zimbra -c zmprov gacf | grep RedoLog
To enable redo log rollover on the open source version, type…
sudo su - zimbra -c zmprov mcf zimbraRedoLogDeleteOnRollover FALSE
sudo su - zimbra -c zmprov mcf zimbraRedoLogEnabled TRUE
You may also want to make the redo log rotation more frequent to guarantee a file-system consistent redo log on the mirror server at least up to the last, say, thirty minutes. The live-streamed redo.log may not be consistent although it is unlikely this will ever be a problem except with the very last record in the log.
For example, to force rollover every half an hour, type…
sudo su - zimbra -c zmprov mcf zimbraRedoLogRolloverFileSizeKB 1
sudo su - zimbra -c zmprov mcf zimbraRedoLogRolloverMinFileAge 30
This will rollover if the size of the redo log is over 1KB after 30 mins, which is very likely unless the mail server is not sending or receiving any mail at all during this time.
You may want to reduce the zimbraRedoLogRolloverMinFileAge even further while setting and testing this script just so you don\’t have to wait too long to see stuff happening between the servers.
Mirror Server
The mirror server should ideally have the same operating system as the live server and must have exactly the same version of Zimbra installed.
The hostname must also be exactly the same.
- Install inotify-tools
inotify tools are required. For RHEL/CentOS, inotify-tools is provided by the epel-release repository.
As user root: (Centos)
yum install epel-release -y
yum install inotify-tools -y
while for Ubuntu, as user root:
apt install inotify-tools -y
- Create log rotation
The script will create a log file which can be handled by logrotate. As user root:
echo /opt/zimbra/live_sync/log/live_sync.log {
daily
missingok
copytruncate
rotate 7
notifempty
compress
}>/etc/logrotate.d/zimbra_live_sync
- First rsync between the servers
We now perform the first copy of the zimbra directory between the live and mirror server. On the mirror server, we must stop Zimbra. We leave Zimbra running on the live server for now to reduce downtime. This is what we call the hot sync.
We\’re going to copy the whole /opt/zimbra so the script, its configuration and Zimbra itself is copied to the remote second node (mirror).
The following rsync command is run on the mirror server. Substitute live_server in the command below with the hostname or IP address of the live server.
Note: Copying the sparse files used by LDAP in Zimbra 8+ takes a very long time even though the file is small.
Generate a SSH key-pair without a passphrase on the mirror_server and transfer the id_rsa.pub key file to the live_server and add it to the /root/.ssh/authorized_keys file before running the rsync command below.
NB: Remote root login should be enabled on the live_server. As user root:
service zimbra stop
rsync -aHvz -e “ssh -p 22” --force --delete --sparse live_server:/opt/zimbra/ /opt/zimbra/
- Second rsync between servers
This is where we need to stop Zimbra on the live server (cold sync) so that we can copy a consistent /opt/zimbra directory from the live to the mirror server. This is the only downtime required.
On the live server as user root:
service zimbra stop
On the mirror server as user root:
rsync -aHvz -e “ssh -p 22” --force --delete --sparse live_server:/opt/zimbra/ /opt/zimbra/
On the live server as user root:
service zimbra start
On the mirror server as user root (just to make sure we have a viable copy of Zimbra):
service zimbra start
service zimbra status
service zimbra stop
- Running the script
Not only have we copied all the Zimbra data from the live to the mirror server, we have also copied the script and SSH keys. We should now be able to try running the script.
On the mirror server as user zimbra:
sudo su - zimbra -c cd /opt/zimbra/live_sync
sudo su - zimbra -c ./live_syncd start
All being well the script has started without any complaints and we can now tail the log file to see that it is syncing as expected.
tail -f log/live_sync.log
(CTRL-C to exit tail command)
- Init script
There is also an init script that may be useful for Ubuntu and Redhat/CentOS.
- CentOS/Redhat
Copy the below script as /etc/init.d/zimbra_live_sync.
sudo vim /etc/init.d/zimbra_live_sync
#!/bin/bash
#
# ***** BEGIN LICENSE BLOCK *****
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>
# ***** END LICENSE BLOCK *****
#
#
# Init file for zimbra live sync
#
# chkconfig: 345 99 01
# description: Zimbra live sync service
#
### BEGIN INIT INFO
# Provides: zimbra_live_sync
# Required-Start: $network $remote_fs $syslog $time nscd cron
# Required-Stop: $network $remote_fs $syslog $time
# Default-Start: 3 5
# Description: Zimbra live sync service
### END INIT INFO
case $1 in
gentle-restart)
su - zimbra -c live_sync/live_syncd stop
su - zimbra -c live_sync/live_syncd start
RETVAL=$?
;;
restart)
su - zimbra -c live_sync/live_syncd kill
su - zimbra -c live_sync/live_syncd start
RETVAL=$?
;;
start)
su - zimbra -c live_sync/live_syncd start
RETVAL=$?
;;
gentle-stop)
su - zimbra -c live_sync/live_syncd stop
RETVAL=$?
;;
stop)
su - zimbra -c live_sync/live_syncd kill
RETVAL=$?
;;
status)
su - zimbra -c live_sync/live_syncd status
RETVAL=$?
;;
*)
echo $Usage: $0 {start|stop|restart|gentle-stop|gentle-restart|status}
RETVAL=1
;;
esac
exit $RETVAL
On both the live and the mirror server make the script executable and add the script to chkconfig.
CentOS/Redhat
chmod 755 /etc/init.d/zimbra_live_sync
chkconfig --add zimbra_live_sync
On the live server, make sure it doesn\’t start on boot.
chkconfig zimbra_live_sync off
On the mirror server ensure that Zimbra doesn\’t start on boot but the live sync script does.
chkconfig zimbra off
chkconfig zimbra_live_sync on
Ubuntu
chmod 755 /etc/init.d/zimbra_live_sync
On the mirror server, ensure Zimbra doesn\’t start on boot but the live sync script does.
update-rc.d -f zimbra remove
update-rc.d zimbra_live_sync defaults
- Failover
If the live server fails, then the procedure on the mirror server is simply to stop the live_sync script, and start Zimbra.
sudo su - zimbra
cd /opt/zimbra/live_sync
./live_syncd stop
zmcontrol start
- Fallback
Simply run the script on the server to fail back to i.e. live and mirror are now reversed. As user zimbra (on ex-live server to be restored back to live):
cd /opt/zimbra/live_sync
./live_syncd start
Once the script has caught up and synced the two servers together. Stop Zimbra on the mirror server.
As user zimbra on mirror (failover server):
sudo su - zimbra -c zmcontrol stop
As user zimbra on live (restored server):
sudo su - zimbra -c cd /opt/zimbra/live_sync
sudo su - zimbra -c ./live_syncd stop
sudo su - zimbra -c zmcontrol zimbra start
Nagios Integration
The script also generates useful status information for Nagios. The time since the last successful operation is measured and Nagios can raise an alert if any part of the script appears to have not been successful for a longer than expected time.
The Nagios script is to be run on the current Live server and can be run as any user as long as that user has read access to the status files created by live_syncd. These are in /opt/zimbra/live_sync/status. Save the script as check_zimbra_live_sync in the /opt/zimbra/live_sync directory.
sudo vim /opt/zimbra/live_sync/check_zimbra_live_sync
#!/bin/bash
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>
##########################################################################
# Title : check_zimbra_live_sync
# Author : Simon Blandford <simon -at- onepointltd -dt- com>
# Date : 2012-08-25
# Requires : live_syncd
# Category : Administration
# Version : 2.0.0
# Copyright : Simon Blandford, Onepoint Consulting Limited
# License : GPLv3 (see above)
##########################################################################
# Description
# Nagios plug-in for Zimbra live sync script
##########################################################################
#******************************************************************************
#********************** Constants *********************************************
#******************************************************************************
ZIMBRA_DIR=/opt/zimbra
BASE_DIR=$ZIMBRA_DIR/live_sync
STATUS_DIR=$BASE_DIR/status
#Files that need age testing
LAST_GOOD_REDO_REPLAY=$STATUS_DIR/last_good_redo_replay
LAST_GOOD_REDO_SYNC=$STATUS_DIR/last_good_redo_sync
LAST_GOOD_REDO_STREAM=$STATUS_DIR/last_good_redo_stream
LAST_GOOD_LDAP_SYNC=$STATUS_DIR/last_good_ldap_sync
LAST_GOOD_LDAP_START=$STATUS_DIR/last_good_ldap_start
#******************************************************************************
#********************** Functions *********************************************
#******************************************************************************
usage () {
echo Usage: check_zimbra_live_sync -w <hours> -c <hours>
exit 3
}
file_age () {
echo $(( ($( date +%s) - $( stat -c %Y $1 )) / 3600 ))
}
#Extract name of function being tested from file name
function_id () {
echo $1 | grep -Eo [^_]*_[^_]*$
}
file_report () {
local file_under_test
file_under_test=$1
age_of_file_under_test=$( file_age $file_under_test )
#No file returns UNKOWN status
if [ ! -f $file_under_test ]; then
if [ ${#affected_function_list} -gt 1 ] && [ $status_code -eq 3 ]; then
affected_function_list=$affected_function_list,$( function_id $file_under_test )
else
affected_function_list=$( function_id $file_under_test )
fi
status_code=3
status_msg=Unknown
return
fi
#Test for files older than the critical number of seconds. Only test if no unknown status
if [ $status_code -lt 3 ]; then
if [ $age_of_file_under_test -ge $c ]; then
if [ ${#affected_function_list} -gt 1 ] && [ $status_code -eq 2 ]; then
affected_function_list=$affected_function_list,$( function_id $file_under_test )
affected_function_list=$affected_function_list($age_of_file_under_test hours)
else
affected_function_list=$( function_id $file_under_test )($age_of_file_under_test hours)
fi
status_code=2
status_msg=Critical
return
fi
fi
#Test for files older than warning number of seconds. Only test if no unknown or critical status
if [ $status_code -lt 2 ]; then
if [ $age_of_file_under_test -ge $w ]; then
if [ ${#affected_function_list} -gt 1 ] && [ $status_code -eq 1 ]; then
affected_function_list=$affected_function_list,$( function_id $file_under_test )
affected_function_list=$affected_function_list($age_of_file_under_test hours)
else
affected_function_list=$( function_id $file_under_test )($age_of_file_under_test hours)
fi
status_code=1
status_msg=Warning
return
fi
fi
}
#******************************************************************************
#********************** Main Program ******************************************
#******************************************************************************
status_code=0
status_msg=OK
affected_function_list=Live sync up to date
for i in $@; do
if [ $get_w ]; then
w=$( echo $i | grep -Eo [0-9]+ );
unset get_w
fi
if [ $get_c ]; then
c=$( echo $i | grep -Eo [0-9]+ );
unset get_c
fi
[ x$i == x-w ] && get_w=1
[ x$i == x-c ] && get_c=1
done
if [ ! $w ] || [ ! $c ]; then
usage
fi
file_report $LAST_GOOD_REDO_REPLAY
file_report $LAST_GOOD_REDO_SYNC
file_report $LAST_GOOD_REDO_STREAM
file_report $LAST_GOOD_LDAP_SYNC
file_report $LAST_GOOD_LDAP_START
echo $status_msg : $affected_function_list
exit $status_code
Use the following line in /etc/nagios/nrpe.cfg or in the /etc/nagios/nrpe.d directory to allow the script to be called by nrpe.
command[check_zimbra_live_sync]=/usr/lib/nagios/plugins/contrib/check_zimbra_live_sync -w $ARG1$ -c $ARG2$
You may need to adjust the path to the nagios plugin script depending on where the contributed scripts are stored.
Conclusion
To replay redo logs only requires that the mailbox process is stopped. This is done automatically by the script. The script will work whether Zimbra has been started on the mirror or not as it will enable or disable services as and when it needs them. Keeping the rest of Zimbra running will drastically reduce the time it takes to fail over. This is only an advantage when access to the server domain can be quickly flipped or has a failover mechanism.