Recently a client was receiving complaints that their busy server hosting both their WordPress sites and their OpenX (( advertising is a necessary evil, right? )) banner delivery was underperforming. Specifically, sites including their banners were seeing page loads hang on them. If you’re in the business of selling banners this is bad news. There were reports of the WordPress sites being slow too, but mostly from administrators (( and that turned out to be a pagination issue )) rather than site visitors.
I sorted out a bunch of request amplification issues but still things still weren’t right, so I added a second server to help out. Instead of just chucking the combined traffic at both servers I used HAProxy to separate out the traffic to each, with a view to adding more OpenX servers as necessary.
Here’s what HAProxy’s stats had to say after some time running the sites split:
wordpress | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Queue | Session rate | Sessions | Bytes | ||||||||||
Cur | Max | Limit | Cur | Max | Limit | Cur | Max | Limit | Total | LbTot | In | Out | |
app01 | 0 | 0 | – | 1 | 367 | 2 | 454 | – | 1758026 | 1758026 | 1336541933 | 30062777594 |
openx | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Queue | Session rate | Sessions | Bytes | ||||||||||
Cur | Max | Limit | Cur | Max | Limit | Cur | Max | Limit | Total | LbTot | In | Out | |
app02 | 0 | 0 | – | 19 | 45 | 3 | 75 | – | 5588327 | 5588293 | 4216687748 | 11878951168 |
Some of these I found unsurprising – WordPress serves a higher volume of data, it is content heavy compared to banner delivery and related click handling. Conversely the inbound data volume for OpenX is up because it’s loaded with click information.
What’s interesting is that the WordPress sites have a higher maximum concurrent session count, yet the total sessions is far higher for the OpenX banners. This illustrates the benefit of separating out different server loads: one server is churning away pushing out fat content and even when heavily cached this burns enough resource that requests get queued and gum up, whilst another is fielding quick-in quick-out requests. When it’s not contending with its laggard sibling it can get on with its business unhindered.
Ultimately the visibility HAProxy affords beats an Apache scoreboard when that Apache is fielding two differently focused workloads.
“Integration points are the number-one killer of systems” – Release It!, Michael Nygard
Last week I had two different web systems fail in a similar way. One was a single box running two busy WordPress sites, another was a largish multi-tier publishing cluster. Both dropped off air because they didn’t handle the failure of a remote system very well. In particular, both had the webserver waiting on a HTTP request to a foreign site to complete before returning a page to the client.
The architecture issues with this kind of request amplification are reasonably clear, and can sometimes be avoided. Data ingest from other systems can often execute asynchronously in another process. The cluster mentioned has a set of machines dedicated to just this, distributing the data they retrieve from remote systems via database and shared filesystem ready for the webservers to use when building pages.
However, in the situations where it is necessary for a webserver to dial out to another system at request time then it’s worth being really paranoid about how that outbound request works and untrusting of the reply. The particular problem here was lack of timeouts.
Here, both systems were running mpm_prefork Apache. Both were serving a request that made a HTTP call somewhere else before returning the page to the client ((One of the WordPress sites was doing this in every page’s footer. And not caching the result. Better still, it was contacting the other WordPress site on the same server. This almost guarantees the request won’t return when the shared webserver gets busy since there may be no spare server slots to accept the second request. Ouch.)). Last week the remote sites that both these systems contact had outages, leaving Apache waiting for the reply up until the request timed out.
However, with no timeout configured that wait is effectively infinite. Long enough for these hanging Apache processes to consume all the available slots of the webserver, resulting an interesting set of observations:
- New TCP connections to the server just hang.
- Server load is low, because the load measures runnable processes (and, on Linux, uninterruptible sleeping processes), and these hanging processes are just sleeping, waiting on
poll(2)
orselect(2)
. - No errors are logged by Apache or the application code because they haven’t failed yet. ((If you don’t run Apache hot, then you’d see a “I’ve reached
MaxClients
” message in the error log, but not much else.))
Here’s some vmstat
during such a wedge:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 140580 56528 58168 516184 0 0 0 0 1026 78 0 0 100 0 0
0 0 140580 56528 58168 516184 0 0 0 0 1032 80 0 0 100 0 0
0 0 140580 56528 58168 516184 0 0 0 0 1031 78 0 0 100 0 0
“It was quiet. Too quiet”. Hook up
gdb
and you can see what’s going on:
(gdb) bt
#0 0x00002aafffb7d14f in poll () from /lib64/libc.so.6
#1 0x00002ab0090b3930 in Curl_select () from /usr/lib64/libcurl.so.3
#2 0x00002ab0090ac4bc in Curl_perform () from /usr/lib64/libcurl.so.3
#3 0x00002ab00850591b in zif_curl_exec () from /etc/httpd/modules/libphp5.so
#4 0x00002ab0086663b2 in ?? () from /etc/httpd/modules/libphp5.so
#5 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#6 0x00002ab00865dc09 in ?? () from /etc/httpd/modules/libphp5.so
#7 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#8 0x00002ab008688750 in ?? () from /etc/httpd/modules/libphp5.so
#9 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#10 0x00002ab00865dc09 in ?? () from /etc/httpd/modules/libphp5.so
#11 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#12 0x00002ab00865c11e in ?? () from /etc/httpd/modules/libphp5.so
#13 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#14 0x00002ab0086395de in zend_execute_scripts () from /etc/httpd/modules/libphp5.so
#15 0x00002ab0085fe697 in php_execute_script () from /etc/httpd/modules/libphp5.so
#16 0x00002ab0086b6ad6 in ?? () from /etc/httpd/modules/libphp5.so
#17 0x00002aaffdbb7a4a in ap_run_handler ()
#18 0x00002aaffdbbaed8 in ap_invoke_handler ()
#19 0x00002aaffdbc578a in ap_internal_redirect ()
#20 0x00002ab0069ffbc0 in ap_make_dirstr_parent () from /etc/httpd/modules/mod_rewrite.so
#21 0x00002aaffdbb7a4a in ap_run_handler ()
#22 0x00002aaffdbbaed8 in ap_invoke_handler ()
#23 0x00002aaffdbc5938 in ap_process_request ()
#24 0x00002aaffdbc2b70 in ?? ()
#25 0x00002aaffdbbecd2 in ap_run_process_connection ()
#26 0x00002aaffdbc9789 in ?? ()
#27 0x00002aaffdbc9a1a in ?? ()
#28 0x00002aaffdbc9ad0 in ?? ()
#29 0x00002aaffdbca7bb in ap_mpm_run ()
#30 0x00002aaffdba4e48 in main ()
Another kink is that a graceful restart of Apache, via SIGUSR1
, doesn’t work since a graceful restart waits for a request to finish – and these ones aren’t finishing. apachectl
or service(8)
scripts will exit but the hung processes remain. These long running wedged processes are also visible in ps
:
apache 580 0.0 0.3 30332 5492 ? S Jan02 0:00
/usr/sbin/httpd
and can be matched up with the hung connections via
netstat -anp
tcp 0 0 172.18.74.113:50129 192.0.32.10:80
ESTABLISHED 580/httpd
This is a mostly a long winded way of pointing out the importance of timeouts on code that executes during a request, particularly when doing something non-local. Fail fast!
To keep the WordPress sites running I ran through the code adding CURLOPT_CONNECTTIMEOUT
, CURLOPT_TIMEOUT
settings to all the cURL calls I could find, and the development team took care of the code running on the cluster. Both have been stable since.
There are plenty of other pitfalls in these integration points between systems. Michael Nygard’s book, linked at top, makes a good survey of them alongside other stability antipatterns in complex systems. It’s a recommended read.
FreeBSD, gettext, libintl and bash
Monday, 10. 25. 2010 – Category: sw
Login surprise!
Copyright (c) 1980, 1983, 1986, 1988, 1990, 1991, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 8.0-RELEASE-p4 (GENERIC) #0: Mon Jul 12 20:22:27 UTC 2010 /libexec/ld-elf.so.1: Shared object "libintl.so.8" not found, required by "bash" $
What’s happened here is that
gettext
(which provides libintl.so
) has been upgraded without its dependencies noticing (read: I messed up some package maintenance). These dependencies are linked against a specific version of this library that’s now gone. One of these dependencies is bash
, my preferred shell.
Having your shell asplode like that is ungood. If bash
has been set as my default shell then I wouldn’t be able to log in at all! This is a risk of relying on software that isn’t part of FreeBSD’s core to absolutely always work across system maintenance. bash
is a third-party port, and it’s prudent to anticipate package management fail ((because all package management sucks at some point, yes?)).
To side-step this risk I tend to set my shell to something that is native to a FreeBSD release (plain ole’ /bin/sh
) and adding the following in my .profile
, to test if bash
is indeed working before running it:
/usr/local/bin/bash -c true && exec /usr/local/bin/bash
And to fix all the other ports that are now broken:
$ sudo portmaster -r -R devel/gettext
Recent articles
- Docker, SELinux, Consul, Registrator
(Wednesday, 04. 29. 2015 – No Comments) - ZFS performance on FreeBSD
(Tuesday, 09. 16. 2014 – No Comments) - Controlling Exim SMTP behaviour from Dovecot password data
(Wednesday, 09. 3. 2014 – No Comments) - Heartbleed OpenSSL vulnerability
(Tuesday, 04. 8. 2014 – No Comments)
Archives
- April 2015
- September 2014
- April 2014
- September 2013
- August 2013
- March 2013
- April 2012
- March 2012
- September 2011
- June 2011
- February 2011
- January 2011
- October 2010
- September 2010
- February 2010
- September 2009
- August 2009
- January 2009
- September 2008
- August 2008
- July 2008
- May 2008
- April 2008
- February 2008
- January 2008
- November 2007
- October 2007
- September 2007
- August 2007
- December 2006
- November 2006
- August 2006
- June 2006
- May 2006
- March 2006
- February 2006
- January 2006
- December 2005
- November 2005
- October 2005