Harvey Deadlock Issues - Making Docker, uWSGI, and Nginx Play Nice
Exploring the Problem
It’s pretty standard to writeup a postmortem after stuff breaks in production. It helps to identify and analyze what went wrong, how it was fixed, and how it can be avoided in the future. Join me on my 1.5 year journey of fixing a nasty set of problems that nearly stopped my homegrown deployment runner project dead in its tracks.
Harvey is a simple Docker Compose deployment system. It gets a web hook from GitHub when code is pushed to the main branch, pulls in the changes, and deploys the project to Docker via Compose. For the longest time I was using Flask to serve the project in production; however, Flask is not production ready and is highly discouraged to use in production. Instead, you’ll need a WSGI service to run as the frontend for the underlying Flask app.
I initially chose Gunicorn, it had a simple config out of the box and seemed to work for a bit until I started having deployments that would get stuck or timeout. This was super odd to me since it only seemed to happen after the app was running for some time. In January of 2022, I began investigating what the cause might be. I first blindly assumed that maybe the sqlite database was to blame and optimized that by combining databases into a single database with multiple tables, ensuring the package I chose was performant in a multi-threaded environment, etc. I still had problems after this optimization.
I later realized that after running for a week or so the problems would crop up, as a temporary fix, I would simply restart the app and it would start working again. In March of 2022 I wondered if it might be due to a worker vs thread issue and so I switched from Gunicorn to uwsgi; however, the problem persisted.
To this point, I had been running the app in the terminal. With my deployments and health checks (every 30 seconds) printing to console, I began to wonder if the tens of thousands of output lines in the console was crashing the app. I decided in June of 2022 to switch running the service as a daemon so it could run unimpeded in the background - but the problem remained.
Somewhere between August and November of 2022 I realized that maybe there was a memory leak; after all, the service would fall over after about a week of being on and needed to be restarted before it would work again. I decided to have uwsgi cannibalize its own workers when they’ve been running for too long or reached a certain timeout or buffer size. I played with every knob (and there are MANY) in uwsgi to attempt to reign in this runaway app. I load tested the app with thousands of concurrent requests to no avail. Everything kept getting stuck after a few requests. A traditional service should be able to handle thousands of requests in a short amount of time (a few seconds), mine were taking minutes to complete, many of the requests were timing out without ever finishing.
Solutions
At this point I was about ready to give up on troubleshooting anymore. I had probably put somewhere in the neighborhood of 40-60 hours of research and troubleshooting in over the last 9 months and this problem had eluded me - every link on Google was purple, nothing could be done. I noticed that a new Docker update was available within the last few days. As a Hail Mary attempt, I pushed the Docker update in production and would you believe it, everything started working! Apparently there had been a months-old bug in Docker that was timing out certain requests which was the root cause of my troubles. I couldn’t believe that after this update, I could start sending through thousands of requests and they were all responding correctly within seconds, not minutes. I was skeptical so I sent through tens of thousands, over and over to ensure I hadn’t lost my mind. Such a simple fix however was only temporary.
A couple days later, my requests started timing out again. I dove back into the config, looking for something I may have missed, looking to see if any of the knobs I changed in months passed was old and forgotten. Sure enough, I had my uwsgi and nginx configs talking to each other with different protocols. I switched to using uwsgi-pass and a socket for both instead of http so nginx and uwsgi could talk the same protocol. This fix seemed to clear up even more requests, allowing the app to now run for a month or two without issues…
But then a new problem showed up. Now, requests weren’t timing out or being outright rejected, they were instead just disappearing. I’d have a deployment start, the logs would show everything was running and then suddenly the log entries stopped, the request disappeared, and the deployment never finished. I looked at the underlying Python code to see if I was exiting prematurely or if there was an error I hadn’t accounted for and made some additional optimization but nothing did the trick.
Sometime between December 2022 and March of 2023 I found the underlying problem. Buried deep in the uwsgi logs, I found entries for segfaults. Long hours at night spent scouring the web turned up little hope for my third and final problem to get this system working. At long last, on a no-name forum from a decade ago, I found a solution specific to my operating system. macOS apparently doesn’t play nicely with uwsgi since it uses its builtin proxy for requests. I don’t have a need for a proxy - simply disabling it cleared up the remainder of my deployments getting stuck.
For a year and a half, I battled these problems, read every possible resource, turned every possible knob, hit every possible square inch of my head against the wall and finally found not one but three critical fixes to getting my service running properly. For a year and a half I thought it was due to my own written code when really, these issues came down to system-level configuration, updates, and mistakes. Throughout this process, I’ve learned a bunch about systems architecture, running a service at scale, and hosting. Harvey now runs better than ever and has been running for months now, uninterrupted, without issue. If you are looking for a simple Docker deployment system, I’d suggest giving it a try. My months and months of research will thank you.
—
Timeline
Jan 2022 Thought it was due to the database being sqlite and non-performant: https://github.com/Justintime50/harvey/issues/63#issuecomment-1120144898
March 2022 Then thought it was gunicorn’s multiple workers, eventually switched to uwsgi and same problems: https://github.com/Justintime50/harvey/issues/64
June 2022 Thought that the verbose logging in a terminal after a few days was to blame, ran Harvey as a daemon: https://github.com/Justintime50/harvey/issues/66
Aug - Nov 2022 PROBLEM 1: Had uwsgi start canabalizing workers when they ran for too long/too many requests/too much data, timeouts, number of workers, buffers. Started load-testing the app with thousands of concurrent requests — eventually patched by Docker pushing an update: https://github.com/Justintime50/harvey/issues/67 SOLUTION: Update Docker
PROBLEM 2: uwsgi was still timing out and wouldn’t respond to requests: https://github.com/Justintime50/harvey/commit/b895f02cb68bade456601252bb6f246337edbec1 SOLUTION: Use uwsgi-pass and socket instead of http so nginx and uwsgi talked the same protocol
Dec 2022 - Mar 2023 PROBLEM 3: Deployments getting stuck without timing out, couldn’t find the problem until finally found buried deep in the logs it was segfaulting due to a macOS only proxy issue: https://github.com/Justintime50/harvey/issues/72 SOLUTION: https://github.com/Justintime50/harvey/commit/895fa1e618e29afc489ec3fee919ccc81db49c99