20 Million Emails, One Priority Queue: Fixing a Race Condition in Production Mailcow

It was a Tuesday afternoon when the support tickets started coming in.

"Password reset email never arrived." "Can't log in, OTP not received." A user who'd been waiting forty minutes for a simple reset link. Then another. Then five more. Our Mailcow dashboard showed no errors. Postfix wasn't bouncing anything. The queue had messages in it — a lot of them, actually — and they were moving. Just not the ones that needed to move.

We were two months into running a self-hosted Mailcow stack at Prachyam Studios, a media company with two offices (Pune and Varanasi), 20 people, and an active marketing operation. I'd proposed the setup myself in a cost-reduction meeting: take the marketing email budget we were burning on Mailchimp — conservatively $300,000 for the 6-month campaign we had planned, given our ~350 million-record target dataset sat far above Mailchimp's 200k-contact Premium cap — and replace it with a self-hosted stack running on RackNerd VPS instances. Total infra cost for the campaign: roughly ₹9,000 (~$97). The math was obviously right. The configuration, as it turned out, needed work.

What the setup looked like

Six RackNerd VPS servers. Twelve custom sending domains, each with SPF, DKIM, and DMARC configured — rotating domains and IPs so no single deliverability incident could take down the whole operation. Mailcow running as a Docker Compose stack on each server: Postfix for the MTA, Dovecot for IMAP, Rspamd for spam filtering, the full bundle. Internal team mail for all 20 people ran through the same servers. So did onboarding emails, password resets, OTPs — every transactional notification the platform sent.

That last part is where it went wrong.

The incident

The marketing team had launched a new batch. Nothing unusual — they'd done several runs by this point. This one was a larger send, pushing out into the deeper segments of the dataset. I was focused on something else when the first ticket came in. By the time the third one landed, I was looking at the queue.

The numbers looked fine at first glance. Messages were processing. Rspamd scores were clean. No obvious delivery failures in /var/log/mail.log. I ran postqueue -p on the active sending server and got back several thousand lines. The queue was full — not stuck, just backed up with the marketing batch. That's when the shape of the problem started to form.

I piped the output and looked at the message timestamps:

Bash

postqueue -p | grep "^[A-F0-9]" | awk '{print $3, $4, $5}' | head -40

What the setup looked like

That last part is where it went wrong.

The incident

I piped the output and looked at the message timestamps:

Bash

postqueue -p | grep "^[A-F0-9]" | awk '{print $3, $4, $5}' | head -40

20 Million Emails, One Priority Queue: Fixing a Race Condition in Production Mailcow

What the setup looked like

The incident

Stay in the loop

Comments

Semantically related

TIL: Postfix Queue Management Under Load

Building Mail Infrastructure for 20 Million Emails: What Nobody Tells You

20 Million Emails, One Priority Queue: Fixing a Race Condition in Production Mailcow

What the setup looked like

The incident

Stay in the loop

Comments

Semantically related

TIL: Postfix Queue Management Under Load

Building Mail Infrastructure for 20 Million Emails: What Nobody Tells You

Understanding what Postfix actually does with a queue

The fix

What it looked like after

What I'd do differently

How I Managed 18 Million Emails with Self-Hosted Mailcow