Debugging a slow web app
I got an email today from one of my clients, letting me know that one of his web apps was down. He said he was getting an error and asked me to take a look.
In the end, I was able to fix it and get it running faster than ever. What caused it turned out to be a huge surprise to me. I thought I'd outline the steps I went through here, to try to help others trying to solve similar problems in the future.
See for yourself
The obvious first step is to go see for myself. I loaded up the web site, wondering what kind of error I would find. The site initially loaded fine, but then a spinner would appear, and seemed to get stuck. After a long while, the main content of the site failed to load, and an error appeared about failing to parse JSON.
I opened up dev tools and refreshed, keeping an eye on the network tab. Most things were loading fine, including static assets. I noticed that there were some fetches to the web site API that were taking a long time. They eventually failed with a 504 gateway timeout error.
The web site is behind a load balancer, and I know load balancers generally have a timeout limit around one minute. So all I knew is that these API calls were taking longer than that. I could only assume they would eventually succeed and were simply slow, but I wasn't totally sure.
Try to reproduce
Fortunately, I already had a dev environment for the site set up locally. I didn't write the whole application, but I had recently made some performance improvements to parts of the site. I wasn't very familiar with the API side of things though.
Sure enough, it started up fine, and the data all loaded correctly. So I figured it probably wasn't an issue with the code itself.
I started to wonder what could have happened to break the site all of a sudden, when it had worked fine in the past. Did the server auto-update some dependency that was breaking something? Was the server out of disk space? Was the database out of memory?
Getting close to the metal
My next step was to actually
ssh into one of the servers to try and see what's going on there. Everything seemed ok. I ran
free -m to check on the memory, but the RAM usage was fine. I ran
df -h to check on disk usage, but none of the disks were full. Running
top, the CPU usage looked fine as well. I was a bit stumped.
I turned to look at the database. This site is running on AWS, so I logged on to the RDS admin in the console and checked the graphs in the monitoring tab. Everything seemed fine there too. CPU wasn't too high, there were barely any connections, and the database wasn't out of space.
Still, I knew these API requests were hanging on something. I went back to the code and looked at the API endpoints in question, and all they did was make a database query. At this point I was pretty sure it was database-related somehow.
Going into the database
I decided to log in to the production database using the
mysql command-line tool. The database is not accessible to the public, so only the production web server has access. I'd never gone into there before, so I looked at the config file for the server-side application to find the credentials (hostname, username, password and database name).
This is a MySQL database (MariaDB actually), so once I got in, the first thing I ran was
SHOW PROCESSLIST to see if there was anything there. There were a ton of queries in there, many of them had been running for more than a minute. One had been sitting there for almost an hour!
Finally, I found the problem. All the slow queries were working with a single table. There was a mix of
UPDATE statements, so I figured the table was probably missing indices or something, something to make the queries run slowly.
SHOW CREATE TABLE xyz on the table, to see the structure of it. I was wrong, there were lots of keys on the table. Knowing that MySQL will only use one key per query, my first guess was that the problem was that maybe there were actually too many keys on the table, and the table would benefit by having fewer keys with multiple columns in it instead, targeted at these particularly queries.
I tested my theory by hand writing a simple query to see how slow it would be. It was slow, but it only took about one second, not a minute. So it wasn't that.
I wondered if I was missing something. Calling
SHOW PROCESSLIST shows a summary of queries, but it cuts them off. A quick DuckDuckGo search later, and I found out you can call
SHOW FULL PROCESSLIST to see the entire query.
It was then that I discovered what the problem was. The query was written exclusively using
LIKE statements instead of
SELECT * FROM xyz WHERE thing_id LIKE '12345678' AND status LIKE 'ok'
Even the update statements used
UPDATE xyz SET status = 'ok' WHERE id LIKE '12345678'
I found it unconventional to say the least. But was that really causing the problem?
I changed my hand-written query to use
LIKE instead of
=, and sure enough, I had to Ctrl+C to abort it after a long time of waiting.
I realised that yes, of course, this would slow things down. MySQL must be scanning the entire table, converting the IDs from numbers to strings, and doing pattern matching on each one. No wonder it was running so slowly!
Closer to a solution
I searched the code base for "LIKE" and found the cause. Buried in a custom query builder a past developer had assembled, it would only use either
IN for all query parameters, including in
I'm not totally sure what the developer was thinking here. Were they making it so you could search on any field easily? I'm not sure, because I wasn't able to find an example where fields were actually searched on anywhere. We may never know.
The problem was in the code base
I was surprised the problem actually was in the code base itself. It made sense to me, this isn't the kind of problem that would have shown up locally, or even when the site first launched. It would have grown slowly as the table size grew, and apparently only became a major issue once the table had over 750k records in it.
The solution seemed straightforward to me. I modified the API endpoints that used this table, and rewrote the queries directly instead of using this query builder code. (Side note: I've never liked query builders, and this is an excellent example of why!)
I would have liked to modify the query builder to replace
=, but because I'm not sure if that functionality was needed elsewhere, I thought it best to leave it alone, and migrate away from the query builder instead.
Last step was to commit and push the code, and rolled out an updated version of the system. Shortly after the new version went live, I logged in to the database again and ran
SHOW PROCESSLIST. Nothing but a bunch of idle connections! Perfect!
I went over to the AWS admin panel, and sure enough, the "Read IOPS/second" chart had dropped from a steady 100 down to 0! That was a nice reassurance that things were massively improved.
The site wasn't just working again, it was faster than ever!
There are always lessons to be learned from any outage. Here are a few I learned today:
You should definitely not use
LIKEin your database queries to match numeric, indexed IDs. I've never done this, but now the fact has been ingrained deep in my brain.
You probably shouldn't write your own query builders. Again, I've never liked query builders so this just reaffirmed my belief.
You should maybe think about testing your web apps with a very large amount of dummy data in the database sometimes. I've never done this, as it's slow and seems a bit excessive, but I think I may start trying this in the future, particularly on systems where I expect the tables to grow enormously over time.
SHOW FULL PROCESSLISTis a thing. Okay, that's not so much a lesson as it is something I learned that wasn't already in my tool belt.
All in all, the whole process took about an hour, and I'm glad I was able to get things back up and running quickly. Beyond the lessons learned, I got to know the system a little better, and have a lot of new ideas of ways I can improve and speed things up in the future, so it was definitely a worthwhile experience!