Building a Disaster Recovery Setup for a Production App with MySQL Replication

The Setup

I run a Node.js application for an internal company tool. It lives on a single VPS, serves several hundred thousand orders worth of data, and has been quietly working for years. The setup is simple: Nginx in front, Node app managed by PM2, MySQL 8 storing everything in a single database.

There is just one problem. If that single VPS goes down, the entire app goes down with it. The “backup” was a shell script that supposedly dumped the database to Google Drive every night.

Spoiler: when I checked, it had not run in three years. The OAuth token was expired, the cron job was never installed, and the backup folder was empty. So I did what any reasonable engineer would do: I panicked a little, then started building a real disaster recovery setup.

Why Not Just Fix the Backup?

I considered it. Reinstalling rclone, getting fresh OAuth credentials, scheduling daily dumps to cloud storage. That would have been the easy path. But backups have two problems that bothered me:

They are stale by definition. A daily backup means losing up to 24 hours of orders if the server dies at the wrong moment.
Restoring is slow. Pulling a 1 GB SQL dump from cloud storage, importing it on a fresh server, reconfiguring DNS, that is hours of downtime, not minutes.

What I actually wanted was something different. I wanted a second server that always has the latest data, ready to take over when needed. That is what database replication gives you.

The Plan

I already had another server in a different region, running a handful of WordPress sites and Docker containers. I’ll call the production server Server A and the backup server Server B for the rest of this article. The plan looked like this:

Server A (master)             Server B (slave)
MySQL 8.0.27 :3306     →     Docker MySQL 8.0.27 :3307
production DB (writable)      production DB (read-only mirror)

When Server A goes down:

Promote the slave on Server B to master (now writable)
Point a backup app at the slave
Tell users to use the backup subdomain until Server A recovers
When Server A comes back, run a script that syncs the new data back

No magic, no auto-failover, no DNS tricks. Just a clear manual procedure that takes about 5 minutes and is safe to execute under pressure.

Why Docker for the Slave?

Server B already runs MySQL 8.0.24 managed by a control panel, serving WordPress sites. I had three choices:

Upgrade the existing MySQL to 8.0.27 to match the master, risky, because the panel manages MySQL its own way and an upgrade could break dozens of production sites.
Run the slave as a Docker container on a different port, fully isolated, zero impact on the existing stack.
Use MariaDB or some other compromise, adds complexity without solving the actual problem.

Docker won easily. If the container misbehaves, I just stop it. No collateral damage. I bound it to localhost on a non-default port so it is never exposed to the internet, gave it a hard memory cap so it cannot OOM the host, and pointed its data directory to the SSD-backed storage volume.

Setting Up the Master

The master setup on Server A was the smaller half of the work. Three things needed to change in MySQL:

# /etc/mysql/mysql.conf.d/replication.cnf
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
binlog_expire_logs_seconds = 604800
gtid_mode = ON
enforce_gtid_consistency = ON
max_binlog_size = 100M

Then a replication user, scoped to only allow connections from Server B’s IP address:

CREATE USER 'replicator'@'<server-b-ip>'
  IDENTIFIED WITH mysql_native_password BY '...';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<server-b-ip>';

A firewall rule to allow port 3306 only from Server B:

ufw allow from <server-b-ip> to any port 3306 proto tcp

A MySQL restart later, and the master was ready.

The Critical Decision: GTID

This is the part where someone newer to MySQL replication might walk into a trap.

There are two ways to track replication state in MySQL:

Binary log file and position: the old way. You manually note “we are at file mysql-bin.000001 position 12345” when you take the snapshot, and the slave reads from there.
GTID (Global Transaction Identifier): the new way. Every transaction gets a unique ID, and the slave automatically tracks which IDs it has already applied.

The old way works, but switching the direction of replication later (which is exactly what you need to do during failback) is painful. You have to manually map binlog positions across servers, and one wrong number means data corruption.

GTID makes that switch trivial. The slave knows exactly which transactions it has consumed, regardless of which server it is reading from. For any setup where you might need to fail over and back, GTID is non-negotiable.

Setting Up the Slave

The Docker side was straightforward, with one annoying surprise.

docker run -d \
  --name mysql-slave \
  --restart unless-stopped \
  --memory 768m \
  -p 127.0.0.1:3307:3306 \
  -v /storage/mysql-slave/data:/var/lib/mysql \
  -v /storage/mysql-slave/conf/slave.cnf:/etc/mysql/conf.d/slave.cnf:ro \
  -e MYSQL_ROOT_PASSWORD='...' \
  -e MYSQL_DATABASE=<dbname> \
  mysql:8.0.27

The surprise: my first slave config included read_only=1 and super_read_only=1. Those settings are correct for a slave, but they apply during container initialization too, which means MySQL refuses to set the root password during init. The container starts with an empty root password and you cannot fix it because super_read_only=1 blocks the ALTER USER statement.

The fix was to start without those flags, let MySQL initialize properly, and only enable read-only mode after replication is running:

SET GLOBAL read_only = 1;
SET GLOBAL super_read_only = 1;

A small detail, but the kind of thing that costs 30 minutes the first time.

The Initial Sync

The database was around a gigabyte. The two servers do not have direct SSH access to each other, but my local machine has access to both. So I used the laptop as a relay:

ssh user@server-a "mysqldump -u user -p... \
    --single-transaction --set-gtid-purged=ON \
    --routines --triggers --events <dbname> 2>/dev/null | gzip" \
| ssh root@server-b "gunzip | docker exec -i \
    mysql-slave mysql -uroot -p... <dbname>"

That single pipeline streams the dump from Server A, through my laptop, into the Docker container on Server B. No temporary files, no scp dance. Took about three minutes for a gigabyte of data.

The --set-gtid-purged=ON flag is the important one. It tells the dump to include the master’s GTID position, so the slave knows exactly where to start reading the binary log when replication kicks in.

Starting Replication

CHANGE REPLICATION SOURCE TO
  SOURCE_HOST='<server-a-ip>',
  SOURCE_PORT=3306,
  SOURCE_USER='replicator',
  SOURCE_PASSWORD='...',
  SOURCE_AUTO_POSITION=1;

START REPLICA;

The first SHOW REPLICA STATUS\G looked promising and concerning at the same time:

Replica_IO_Running: Connecting
Replica_SQL_Running: Yes
Last_IO_Error: error connecting to master ... (110)

Error 110 is connection timeout. The slave could not reach the master. Sound familiar? The firewall was blocking the connection.

Wait, I had already added the rule. So I tested manually from inside the Docker container:

docker exec mysql-slave \
  mysql -h<server-a-ip> -ureplicator -p... -e 'SELECT 1'

That worked. The container could connect just fine. The replication thread had retried so many times that it was now in a backoff loop. A simple STOP REPLICA; START REPLICA; woke it up.

Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 0

Replication was alive. To verify, I inserted a test row on the master:

-- on master
CREATE TABLE _repl_test (id INT, msg VARCHAR(100));
INSERT INTO _repl_test VALUES (1, 'hello from master');

Two seconds later on the slave:

+----+-------------------+
| id | msg               |
+----+-------------------+
|  1 | hello from master |
+----+-------------------+

Confirmed.

The Failback Problem

Promoting the slave is easy. Three SQL statements:

STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 0;
SET GLOBAL super_read_only = 0;

But what about the reverse? When Server A comes back online, it has stale data. Meanwhile, Server B has been collecting new orders for hours or days. You cannot just point users back at Server A and start replication in the original direction, you would lose all the new data.

The procedure looks like this:

Stop the app on Server B to freeze writes
Dump the current state of Server B’s database
Drop and recreate the database on Server A
Import the dump into Server A
Set Server A as a temporary slave of Server B to catch up any final writes
Once caught up, switch direction: Server B back to slave, Server A back to master
Restart the app on Server A

That is seven steps, each with its own sub-commands and confirmations. Doing this manually under stress is asking for a mistake. So I wrote a script.

The Script

The result is a single bash script with an interactive menu:

═══════════════════════════════════════════════════════════════
  DISASTER RECOVERY MENU
═══════════════════════════════════════════════════════════════

  1) FAILOVER  : Server A is down, switch to Server B
  2) FAILBACK  : Server A is back online, restore normal flow
  3) STATUS    : Check current replication status
  0) Exit

Each option walks through every step, asks for confirmation before destructive operations, prints colored progress, and reminds the operator (me) what to do next. The script handles:

Promoting the slave with the right MySQL commands
Updating .env files automatically and creating timestamped backups
Restarting PM2 with the correct paths and home directory
Dumping, transferring, and importing data during failback
Setting up replication in either direction with GTID auto-position
Verifying that everything is in the expected state at the end

The whole point is that two months from now, when Server A actually goes down at 3 AM on a weekend, I do not need to remember any of this. I just SSH into Server B, run the script, pick option 1, and the app keeps serving users.

Lessons Learned

Replication is not backup. Replication protects against hardware failure. It does not protect against DROP TABLE, application bugs that corrupt data, or ransomware. You still need actual backups. A provider’s daily snapshot feature, often cheap, is the cheapest insurance you will ever buy.
GTID-based replication is worth the small upfront effort. Switching replication direction without GTID is a nightmare. With GTID, it is mechanical.
Isolate risky changes with Docker. Putting the slave in a container instead of upgrading the existing MySQL on Server B meant zero risk to the dozens of WordPress sites already running there. If the experiment failed, I could just docker rm the container and walk away.
Memory limits are mandatory in shared environments. Server B was already running close to its memory limit. Without a hard memory cap, the slave’s InnoDB buffer pool could have grown until it triggered the OOM killer on something else.
Automate the panic procedures. A clear, tested failover script is worth more than perfect architecture. The best disaster recovery plan is the one you can actually execute when half your monitoring is screaming and you have not had coffee yet.
Test the drill before you need it. This article is partly a drill log. Insert test data, watch it appear on the slave, run the failback flow on a non-critical hour. Discover the rough edges in calm conditions, not during an actual outage.

What I Did Not Build

I deliberately stopped short of two things that would have been overkill for this use case:

Automatic failover. This is an internal company tool, not a public service. If Server A dies, sending a message in a group chat saying “use the backup URL for now” is a perfectly acceptable user experience. Building automatic DNS failover, health checks, and split-brain detection would have tripled the complexity for marginal benefit.
Multi-region replicas. Both servers are in the same general region. If the entire region of two different providers goes down at the same time, the database is the least of my problems.

Disaster recovery should be proportional to risk. Over-engineering it is its own kind of risk.

Closing Thought

The whole setup took about an hour from start to finish, including writing the script and validating the round trip. The only thing that took longer than expected was the Docker initialization quirk with super_read_only, and even that was 30 minutes.

The interesting realization is how little of this is novel. Replication has been a feature of MySQL for over 20 years. GTID has been around since 2013. Docker is a decade old. The only modern part is the bash script wrapper, and that is just plumbing.

What changed is that the cost of doing this properly has dropped to almost zero. A decade ago, this kind of setup would have required a dedicated server, a sysadmin, and a six-figure budget. Today it costs an extra few dollars a month and an hour of focused work.

If you have a single-server production app and no disaster recovery story, you have no excuse.