Back to blog

From Lag to Leading: A Deep Dive into Fixing a 2-Hour PostgreSQL Replication Lag

Jul 23, 2025
5 min read

From Lag to Leading: A Deep Dive into Fixing a 2-Hour PostgreSQL Replication Lag

Every engineer responsible for a large database fears the “big one” — a problem so severe it brings operations to a crawl. We recently…


From Lag to Leading: A Deep Dive into Fixing a 6-Hour PostgreSQL Replication Lag

Every engineer responsible for a large database fears the “big one” — a problem so severe it brings operations to a crawl. We recently faced this crisis: a high-throughput indexer processing millions of daily transactions on a powerful PostgreSQL server had its read replica fall behind by over six hours. Queries were timing out, and our system was on the brink.

This is the story of how we diagnosed the cascading failures and brought the system back from the edge, not just by treating the lag itself, but by fixing the underlying issues that caused it.

Part 1: The First Clue — Checkpoint Hell

Our investigation started on the primary server. It was struggling to keep up, with disk I/O and CPU usage constantly high. The replication lag was a symptom, but the disease was on the primary.

  • The Diagnostic Step: We enabled a crucial logging parameter: log_checkpoints = on. This gave us visibility into one of PostgreSQL's most I/O-intensive operations.
  • The Smoking Gun: The logs were terrifying. We saw messages like **checkpoints are occurring too frequently (11 seconds apart)** and a constant stream of **checkpoint starting: wal**. The database was in "checkpoint hell," desperately trying to flush data to disk because it was running out of transaction log space.
  • The Fix: The root cause was a **max_wal_size** setting that was too small for our write volume. On a powerful server with 128 GB of RAM, the default 1 GB was a drop in the ocean. We increased it to 16 GB.
  • The Result: The effect was immediate. The frantic, size-based checkpoints stopped. The logs shifted to a calm, predictable **checkpoint starting: time**, occurring every 5 minutes. The primary server was stable, but our work wasn't done.

Part 2: The Second Problem — A Bloated Replica

Despite a healthy primary, the replica was still lagged and slow. The bottleneck had shifted. It wasn’t about the primary sending data anymore; it was about the replica’s ability to receive and apply it.

  • The Diagnostic Step: We queried the pg_stat_all_tables view to inspect the health of our tables.
  • The Horrifying Discovery: We found severe table bloat. One critical table had over 44 million dead tuples. The replica was a digital graveyard. Every query and every replicated write had to sift through this mountain of junk data, killing performance.
  • The Quick Check: We also confirmed that **hot_standby_feedback** was enabled on the replica. This is a critical setting that prevents long-running read queries from causing replication conflicts.

Part 3: The Cleanup — Unleashing Autovacuum

The path forward was clear: we had to get rid of the bloat. Instead of a one-time manual VACUUM, we opted to create a sustainable, automated solution by making the autovacuum process hyper-aggressive.

  • Global Tuning: First, we increased the number of workers and made them faster by setting **autovacuum_max_workers = 6** and **autovacuum_vacuum_cost_delay = 2ms**.
  • The Critical Per-Table Fix: The most important change was telling autovacuum when to start. For our massive tables, the default trigger (20% of rows changed) would never happen. We ran the following command on our most bloated tables:

ALTER TABLE your_bloated_table SET (autovacuum_vacuum_scale_factor = 0.001);

  • This forced autovacuum to run after only 0.1% of rows changed, ensuring constant, proactive cleanup.

Part 4: Interpreting the Healing Phase 🩺

After applying the autovacuum settings, we saw a curious pattern: the checkpoint starting: wal messages returned. For a moment, it felt like a step backward, but it was actually a sign of success.

The newly aggressive autovacuum was working so hard cleaning up the bloat that its own operations were generating a massive amount of WAL. This activity was so intense that it temporarily outpaced our 5-minute checkpoint timer. The database wasn't in trouble; it was healing itself. After a few hours, as the worst of the bloat was cleared, the WAL generation subsided, and the checkpoints returned to their stable, time-based schedule.

Conclusion: Our Key Takeaways

This experience reinforced several core database administration principles:

  1. Replication lag is a symptom, not the disease. The root causes were I/O bottlenecks and table bloat.
  2. Tune your primary first. A stable primary is the foundation of healthy replication. Ensure max_wal_size is appropriate for your workload and hardware.
  3. Bloat is a silent killer. For write-intensive systems, aggressive autovacuum isn't just a good idea; it's a requirement. Don't trust the default settings on very large tables.
  4. Monitor everything. You can’t fix what you can’t see. Using tools like log_checkpoints and system views like pg_stat_all_tables is essential for proper diagnosis.
  5. Be patient. After applying a fix, give the system time to react and stabilize. The “healing phase” can look noisy, but it’s often a sign that your solution is working.

#PostgreSQL #DatabaseTuning #ReplicationLag #PerformanceTuning #DevOps #SRE #Database #HighAvailability #SoftwareEngineering #Tech #CaseStudy #Backend #CloudSQL

By Jatin Jain Saraf on July 23, 2025.