I work on an application at work that does all sort of maintenance on large clusters. Without going into too much detail, it generally keeps an eye on our servers, fixes problems, and performs setup and maintenance. It's a rather important system, and it's all written in Python.
Now for the past few months we've been trying to debug an issue. At random times it would segfault. We had lots of people looking at the core dumps and it seemed to be some sort of memory corruption. We ruled out the machine, as it would crash on multiple machines.
So where did I start? The program is multithreaded, so it usually has lots of processes listed via 'ps'. On a whim (and most of my debugging was based on whims), I did straces on every process. When I did this, I observed the Heisenberg principle, the program did not crash. Running straces on the application wasn't really a solution so I had to look more into it.
I'm fairly involved with the development of this system, so I understand alot of the code. I did not understand why it would crash. I did know one thing: the application uses threads, and it also forks. But that didn't help me because I had no idea there was a problem mixing the two.
I decide to run a tcpdump to see if any strange network traffic may be causing it. After much analysis with ethereal, I noticed something strange. Crashes always seemed to coincide with a web request to the application. The app runs a web interface, and we have monitoring hitting this UI to make sure it is running properly. So it turned out, the 'random' crashes were not so random, and were happening when our monitoring system was hitting the app.
But why was it crashing? With any multithreaded app, you need to have proper locking of shared variables. You generally have to worry about multiple threads modifying/accessing the same area of memory at the same time. So I spent weeks analyzing the code, trying different locks in different areas, etc. I made our webserving function simply output nothing, so it didn't access any shared data. To my surprise, we still crashed.
One problem was I could not predict when it would crash. It could run up to 12 hours without crashing, and most of the monitoring hits of the UI worked fine. So I had work on replicating the bug more frequently. After alot of work using a live debugger that a co-worker built into the app, I was able to set off a sequence of events that crashed it within 2-3 minutes. It generally involved slamming the UI with parallel web requests, and kicking off some internal methods that cause a fork() (for example, SSH'ing to a machine to verify connectivity). Ok that's better for testing. But it told me something: the crash seemed to be involved with forking.
Our webserver in this app was using Python's generic BaseHTTPServer with a ThreadingMixin to handle requests with new threads. From parallel analysis of the core files by others, it did appear to be crashing on pthread create functions. Out on another whim, I changed ThreadingMixin to ForkingMixin, so the webserver would fork instead. Voila, no crash! So this told me it has something to do with forking and threading.
A co-worker pointed out this GNU C Library page warning about using threads and fork. The clincher was this sentence:
Because threads are not inherited across fork, issues arise. At the time of the call to fork, threads in the parent process other than the one calling fork may have been executing critical regions of code. As a result, the child process may get a copy of objects that are not in a well-defined state. This potential problem affects all components of the program.
So indeed we were doing something bad. The problem with changing our webserver to ForkingMixin is that removes any possibility of IPC, which we needed via our UI. i.e. A web request couldn't change the state of the program unless we created some specific IPC mechanism. So instead, we created a parent thread which started one BaseHTTPServer without any mixin. This means the requests would be processed serially, but would still have the ability to update shared data. The only issue was low throughput due to serially handled requests (and also, one request at a time). For our app, it sufficed because the number of UI requests we get is low, and mainly it's for monitoring.
This fixed our immediate issue, but I have now seen crashes still occurring, but at a much later time. The problem is that we are still using threads. It's not just in our webserving, but in other parts of the app as well. The only solution it seems is to remove such threading. We must fork because we do things like SSH.
What's the lesson here? Do not, I repeat, do not ever design an application that used threads and also forks. You will have no end of trouble. If using threads, stick with threads only. If using fork, stick with forks only.
Unable to safe captcha-image.
An ERROR has occured!
Here you might send email-notification to webmaster or something like that.