derwiki

Nov 08 2010

Using matplotlib to investigate an SMTP slowdown

At work, I’m on the team that ensures emails get sent out on time. Most of the time our email traffic is quite low, but we’ve got a few mass newsletters that go to many, many more recipients (our sending rate jumps about 3 orders of magnitude during these times). We log timing info every 100 messages and recently noticed that the overall sending throughput across our sending processes had dropped from about 17/s to 3/s — a huge slowdown. The logs indicated that the delay was all in handing messages off to the SMTP server, not any other part of the sending process.

To continue investigation, I made a testing branch and added per message logging in the SMTP library for the time it took the batch emailers to actually hand off the message to the SMTP servers (instead of every 100 messages). On visual inspection, this showed that a flurry of messages would get accepted at about ~.01/s and then increase to about ~1.01/s for a while, then drop back to ~.01/s. I wrote a quick script[1] that would make a pretty graph of the timing delays and started fiddling with some parameters to try and figure out a pattern. For testing purposes I used 4 fake mass emails in the development environment that each had myself as the recipient 2000 times (I also set X-DISCARD-ME in the message headers which prevented Postfix from actually sending them, but used my email address just in case). Since our development environment has its own internal sendmail processes, I had to set up a tunnel through a production machine to our SMTP cluster to send mail exactly as its done in production.

4 senders running in parallel, connecting directly to smtplow.mycorp.com (roundrobin of smtp{1,3,4}low.mycorp.com):
4senders-roundrobin

The slowdowns seem to cluster around each other but nothing jumps out (other than the delay seems to add almost -exactly- 1.0 seconds to the sending time). We sort of ruled out garbage collection as the delay because that should only affect a single message, not a cluster of them.

4 senders running in parallel, connecting directly to smtp{1,3,4}low.mycorp.com respectively:
4senders-smtp1

4senders-smtp3

4senders-smtp4

This is where it started to get interesting — each sender followed -exactly- the same delay pattern on the same machine. Garbage collection is completely ruled out and we start to focus on the SMTP server itself

1 sender, connecting directly to smtp4low.mycorp.com:
1sender-smtp4

Still seeing the cycles, but overall runtime is about half as when we were sending 4 in parallel. At this point, a coworker on the systems team found a Postfix setting that would introduce 1 second delays when the server felt that it was getting overwhelmed (this setting was not in our Postfix config and defaults to 1 second unless specified). 1 second seemed like a very special number, as that was exactly the same delay we saw.

1 sender, connecting directly to smtp4low.mycorp.com with Postfix 1 second delay removed:
1sender-smtp4-nodelay

Right back to where it should be.

4 senders, connecting directly to smtplow.mycorp.com (roundrobin) with Postfix 1 second delay removed:
4senders-smtp4-nodelay

And that wraps it up — everything is back to normal with the standard roundrobin SMTP set up. Debugging takeaways:

  • If you start seeing delays that are multiples of very round numbers (1.0s, 2.0s, etc) it’s probably an arbitrary delay like a sleep statement.
  • Learn how to use matplotlib to make sense of data (thanks to my coworker Jason for showing me). Patterns that aren’t apparent in vim are startlingly apparent when visualized.
  • Do some sort of automated load testing (even if it’s not completely pretty) so you can quickly change options and see the results. This investivation only got serious after I was able to send 4x2000 messages with less than a minute set up time.


[1]

# commands to start senders in a 4 different windows on a screen session
python batch/batch_email_send.py -v --sender_name=mycorp.derwiki.a --email_id=20583090 start 2>&1 | grep -v query | tee sender.a.log
python batch/batch_email_send.py -v --sender_name=mycorp.derwiki.b --email_id=20592382 start 2>&1 | grep -v query | tee sender.b.log
python batch/batch_email_send.py -v --sender_name=mycorp.derwiki.c --email_id=20592342 start 2>&1 | grep -v query | tee sender.c.log
python batch/batch_email_send.py -v --sender_name=mycorp.derwiki.d --email_id=20592528 start 2>&1 | grep -v query | tee sender.d.log

# pull out just "timestamp,smtp_time_spent" from the log lines
grep smtp_spent_time sender.a.log | cut -c49-100 > sender.a.parsed.log
grep smtp_spent_time sender.b.log | cut -c49-100 > sender.b.parsed.log
grep smtp_spent_time sender.c.log | cut -c49-100 > sender.c.parsed.log
grep smtp_spent_time sender.d.log | cut -c49-100 > sender.d.parsed.log

# Python code to make a pretty graph
from __future__ import with_statement
senderlog = dict()
# this could be written a lot pretty to make it trivial to scale up the number of senders
# but seeing as how upping the number of senders meant modifying the other parts of the
# setup, it didn't seem like a worthwhile optimization to make
sendernames = ('a', 'b', 'c', 'd')
for sendername in sendernames:
    with open('sender.%s.parsed.log' % sendername, 'r') as sender:
        senderlog[sendername] = []
        for line in sender:
            senderlog[sendername].append(line.strip().split(',')[1:])

time_start = min(float(timestamp) for timestamp, value in senderlog['a'][:] + senderlog['b'][:] + senderlog['c'][:] + senderlog['d'][:])

# make first timestamp be t0 and offset everything from that
offset_senderlog = dict()
for sendername, timinglist in senderlog.iteritems():
    offset_senderlog[sendername] = [(float(timestamp) - time_start, value) for timestamp, value in timinglist]

# quick and dirty
a = [row[0] for row in offset_senderlog['a']]
b = [row[1] for row in offset_senderlog['a']]
c = [row[0] for row in offset_senderlog['b']]
d = [row[1] for row in offset_senderlog['b']]
e = [row[0] for row in offset_senderlog['c']]
f = [row[1] for row in offset_senderlog['c']]
g = [row[0] for row in offset_senderlog['d']]
h = [row[1] for row in offset_senderlog['d']]

from matplotlib import pyplot as plt
plt.plot(a, b, '-', c, d, '-', e, f, '-', g, h, '-')
plt.show()
Comments (View)
blog comments powered by Disqus