Post details: The Art of Debugging Apache Segfaults

09/15/05

Permalink 11:20:11 pm, Categories: Linux, 1404 words   English (US)

The Art of Debugging Apache Segfaults

This week I've been doing some consulting. A heavily loaded website (about 300 req/second) was having some issues such as Apache processes going haywire and eating up all CPU, and Apache processes segfaulting continously. However, the site was functioning, and only when there were alot of apaches eating up CPU did the system go to a crawl. The segfaults were of the apache child processes, not the main one.

The systems (load balanced) were hosted at Rackspace, and they pretty much set everything up for the customer, with things such as PHP, Apache, MySQL, etc. So the first thing I looked at are the haywire httpd processing eating up CPU. What could I do? Well an strace on it to start:

strace -p PID

But it didn't show anything, so it's hung somewhere before any other system call is made. Next thing, attach a gdb to that bitch:

gdb /usr/sbin/httpd PID

Then I did a 'bt' to show a backtrace of function calls and end up with:

0x00265383 in malloc_consolidate () from /lib/tls/libc.so.6
(gdb) bt
#0 0x00265383 in malloc_consolidate () from /lib/tls/libc.so.6
#1 0x00264b29 in _int_malloc () from /lib/tls/libc.so.6
#2 0x00263ecd in malloc () from /lib/tls/libc.so.6
#3 0x0040f500 in OpenSSLDie () from /lib/libcrypto.so.4
#4 0x0040fb5c in CRYPTO_malloc () from /lib/libcrypto.so.4
#5 0x006bd86b in ssl3_new () from /lib/libssl.so.4
#6 0x006c41e2 in tls1_new () from /lib/libssl.so.4
#7 0x006c5c66 in SSL_new () from /lib/libssl.so.4
#8 0x00639545 in ssl_engine_disable () from /etc/httpd/modules/mod_ssl.so
#9 0x080722bc in ap_run_pre_connection ()
#10 0x08072476 in ap_process_connection ()
#11 0x08066ac1 in ap_graceful_stop_signalled ()
#12 0x08066c14 in ap_graceful_stop_signalled ()
#13 0x08066eb9 in ap_graceful_stop_signalled ()
#14 0x08067550 in ap_mpm_run ()
#15 0x0806da2f in main ()

Hmm, ok so it's in some malloc call having to do with SSL. Well this site is not even using SSL, so the next thing I did was disable SSL. Not surprisingly, this reduced the frequency of the process going haywire, but did not rid of them entirely. Something else is going on. From the malloc calls, it appears having to do with maybe a memory leak or double-free happening somewhere. The next time it happened, it was hung in some pthread call, but not really related to any apache module. The problem is somewhat similar to this bug.

At this point I really had nowhere to go on this CPU issue. Apache was also segfaulting, and I guessed that the segfault problem is related to the CPU problem, so I decided to look into that. Now it gets really interesting. Trying to debug an Apache server getting 200-300 req/sec is not very fun. First of all you can have virtually no downtime before everyone starts realizing it, so whatever I needed to do, I had to do it with the least impact.

Apache reported to the logfiles something like:

[Thu Sep 15 04:14:26 2005] [notice] child pid 24195 exit signal Segmentation fault (11)

I also saw lines such as:

Allowed memory size of 16777216 bytes exhausted (tried to allocate 10 bytes)
free(): invalid pointer 0x9910020!

But this is not very helpful to me. What was PID 24195 doing? Which file was it serving? I first added PID logging to the logfiles with the '%P' custom log statement, similar to:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" PID: %P" debuglog

Now I could correlate the segfaulted PIDs with the requests. But get this, it told me nothing! All of the crashes seemed to be random, some on .php files, some on even .jpg files! WTF? This told me more and more there is a memory leak somewhere, and the symptom is not showing up until much later. Next step, I want Apache to dump cores. This was done by adding a:

CoreDumpDirectory /tmp/cores

Great, now I have all these core files. I look at a backtrace:

gdb /usr/sbin/httpd core.12345
bt
#0 0x06ca51b0 in _zval_ptr_dtor (zval_ptr=0x53de8f3)
at /usr/src/debug/php-5.0.5/Zend/zend_execute_API.c:395
395 zv->refcount--;
(gdb) bt
#0 0x06ca51b0 in _zval_ptr_dtor (zval_ptr=0x53de8f3)
at /usr/src/debug/php-5.0.5/Zend/zend_execute_API.c:395
#1 0x06cb71f8 in zend_hash_destroy (ht=0x8821c3c)
at /usr/src/debug/php-5.0.5/Zend/zend_hash.c:519
#2 0x06cae8de in _zval_dtor (zvalue=0x881f47c)
at /usr/src/debug/php-5.0.5/Zend/zend_variables.c:52
#3 0x06ca51d8 in _zval_ptr_dtor (zval_ptr=0x53de8f3)
at /usr/src/debug/php-5.0.5/Zend/zend_execute_API.c:397
...

I see pretty much random location crashes, but they are all within PHP shutdown sequences. This is still not telling me much. The next thing I wanted to do was an strace on a working apache process and wait for it to crash. Luckily, the crashes were so frequent, I could strace on literally any apache PID and after a few minutes would be guaranteed a segfault.

Now what I saw was interesting. Almost always before a segfault, I would see some Mysql traffic that involved trying to access a nonexistent table. It turns out some the website's PHP code had some legacy stuff doing this. So I thought, ok a missing table is not supposed to segfault apache, but lets get this code fixed either way. After a note to the developers, it was fixed. And guess what? There was a significant decrease in the number of segfaults per minute. Before it was about 1 every second, now it was 1 every minute. Clearly something was related to mysql, but at this point in my debugging process I just did not realize it.

All of this has been compiled and setup by Rackspace. During all of this, I'm opening tickets on their system and having them look into the problem. They are doing their own debugging, and even late one evening ran apache within a gdb and watched it. They pretty much thought it was bug in PHP, and asked us to report it to the PHP developers. Rackspace did do alot of debugging and I was impressed with their work, but I felt this was just going in the wrong direction. We spent many days trying different debugging techniques, PHP versions, and what not. But still no luck. At one point we found one .php script that ALWAYS segfaulted, and it had to do with XML stuff. We installed stable XML/XSLT libraries and it went away for that script, but still segfaulted elsewhere :(.

Rackspace showed me some straces of their crashes. They were all meaningless except for one thing, I noticed that shortly before each crash, there was a mysql insert statement executed. The fact that a mysql 'table not found' had increased the segfaults got me thinking, could this be something related to mysql or libraries?

I did some searching and came across this bug, which explained that PHP might crash if it was compiled with one mysql library API version, but there is a different version on the server. The next thing I look at is phpinfo(). I see it was compiled with mysql librares 3.23.58, however the mysqld running on the system was 4.1.11. I had a eureka moment. Could a change in data structures between the versions in the API cause some issues? I thought that could likely be the case.

It just so turns out, that when Rackspace compiled PHP, they did so on a 'build host', which had mysql 3.23.58. Then they took this PHP and installed it on our server. I asked them to recompile it with the same mysql version that we have. Well they did so, and it has been about 5 hours and not one segfault. I think we got it. I'll have to watch it longer, but sometimes you just get a feeling that you just know you figured it out.

This all took place over a week. When I should've been studying, instead I was in the FAU library debugging apache crashes :). It's all good though, because I am still learning something in the process. It is fun to do these kinds of things. Now I just need to make my invoice :).

Lesson learned? Build systems are not very good if they differ from the target system.

Update: This did in fact fix the problem.

Comments, Pingbacks:

Comment from: Doug Alcorn [Visitor] · http://blog.lathi.net
Nice work! On the one hand it's a shame you won't be doing more of this type of work. On the other hand, it's good you'll be using that same perseverance and analytical abilities in debugging humans!
Permalink 09/16/05 @ 09:22
Comment from: Bill Boebel [Visitor] · http://www.webmail.us
You're the man! I have a cluster of servers with this exact symptom, though the crashes are less frequent... and it did have an older version of MySQL at one point, which the current PHP was compiled against.

Please post a follow up and let us know if this fixed the issue on your system permanently.

Again... you da man!
Permalink 09/16/05 @ 23:18
Comment from: nitroburn [Visitor] · http://torrentportal.com
Thank you SOOO Much for this article. I had searched exactly for this topic and didn't find mention of the mismatch in mysql libraries. (poor searching?) When I was trying the same idea as you for tracking the pids from the access_log to the error_log, and finding the same thing.. i decided to search again and thus this time searching for the LogFormat and pid information, I found your site.

I should mention my situation was a little different. I was using a remote mysql server and it did the same segfaults. Local server had the mysql4 libraries, remote mysql server was mysql5. I guess almost the same.

Now I can sleep better at night and not wake up hoping, "Is it still up?"

The segfaults had the effect of crashing the httpd after a while.
Permalink 06/11/06 @ 18:38

Leave a comment:


Warning: touch() [function.touch]: Unable to create file /hsphere/local/home/valankar/viraj.org/_tmp/hn_captcha_counter.txt because No such file or directory in /hsphere/local/home/valankar/viraj.org/b2evolution/blogs/skins/hn_captcha.class.x1.php on line 209

Warning: touch() [function.touch]: Unable to create file /hsphere/local/home/valankar/viraj.org/_tmp/hn_captcha_counter.txt because No such file or directory in /hsphere/local/home/valankar/viraj.org/b2evolution/blogs/skins/hn_captcha.class.x1.php on line 209


An ERROR has occured!
Here you might send email-notification to webmaster or something like that.

Unable to safe captcha-image.