Debugging a slow web app

A watercolour illustration of a robot slowly wading through a swamp

I got an email today from one of my clients, letting me know that one of his web apps was down. He said he was getting an error and asked me to take a look.

In the end, I was able to fix it and get it running faster than ever. What caused it turned out to be a huge surprise to me. I thought I'd outline the steps I went through here, to try to help others trying to solve similar problems in the future.

See for yourself

The obvious first step is to go see for myself. I loaded up the web site, wondering what kind of error I would find. The site initially loaded fine, but then a spinner would appear, and seemed to get stuck. After a long while, the main content of the site failed to load, and an error appeared about failing to parse JSON.

I opened up dev tools and refreshed, keeping an eye on the network tab. Most things were loading fine, including static assets. I noticed that there were some fetches to the web site API that were taking a long time. They eventually failed with a 504 gateway timeout error.

The web site is behind a load balancer, and I know load balancers generally have a timeout limit around one minute. So all I knew is that these API calls were taking longer than that. I could only assume they would eventually succeed and were simply slow, but I wasn't totally sure.

Try to reproduce

Fortunately, I already had a dev environment for the site set up locally. I didn't write the whole application, but I had recently made some performance improvements to parts of the site. I wasn't very familiar with the API side of things though.

Sure enough, it started up fine, and the data all loaded correctly. So I figured it probably wasn't an issue with the code itself.

I started to wonder what could have happened to break the site all of a sudden, when it had worked fine in the past. Did the server auto-update some dependency that was breaking something? Was the server out of disk space? Was the database out of memory?

Getting close to the metal

My next step was to actually ssh into one of the servers to try and see what's going on there. Everything seemed ok. I ran free -m to check on the memory, but the RAM usage was fine. I ran df -h to check on disk usage, but none of the disks were full. Running top, the CPU usage looked fine as well. I was a bit stumped.

I turned to look at the database. This site is running on AWS, so I logged on to the RDS admin in the console and checked the graphs in the monitoring tab. Everything seemed fine there too. CPU wasn't too high, there were barely any connections, and the database wasn't out of space.

Still, I knew these API requests were hanging on something. I went back to the code and looked at the API endpoints in question, and all they did was make a database query. At this point I was pretty sure it was database-related somehow.

Going into the database

I decided to log in to the production database using the mysql command-line tool. The database is not accessible to the public, so only the production web server has access. I'd never gone into there before, so I looked at the config file for the server-side application to find the credentials (hostname, username, password and database name).

This is a MySQL database (MariaDB actually), so once I got in, the first thing I ran was SHOW PROCESSLIST to see if there was anything there. There were a ton of queries in there, many of them had been running for more than a minute. One had been sitting there for almost an hour!

Optimizing queries

Finally, I found the problem. All the slow queries were working with a single table. There was a mix of SELECT and UPDATE statements, so I figured the table was probably missing indices or something, something to make the queries run slowly.

I called SHOW CREATE TABLE xyz on the table, to see the structure of it. I was wrong, there were lots of keys on the table. Knowing that MySQL will only use one key per query, my first guess was that the problem was that maybe there were actually too many keys on the table, and the table would benefit by having fewer keys with multiple columns in it instead, targeted at these particularly queries.

I tested my theory by hand writing a simple query to see how slow it would be. It was slow, but it only took about one second, not a minute. So it wasn't that.

I wondered if I was missing something. Calling SHOW PROCESSLIST shows a summary of queries, but it cuts them off. A quick DuckDuckGo search later, and I found out you can call SHOW FULL PROCESSLIST to see the entire query.

It was then that I discovered what the problem was. The query was written exclusively using LIKE statements instead of =, eg.:

SELECT *
FROM xyz
WHERE thing_id LIKE '12345678'
AND status LIKE 'ok'

Even the update statements used LIKE:

UPDATE xyz
SET status = 'ok'
WHERE id LIKE '12345678'

I found it unconventional to say the least. But was that really causing the problem?

I changed my hand-written query to use LIKE instead of =, and sure enough, I had to Ctrl+C to abort it after a long time of waiting.

I realised that yes, of course, this would slow things down. MySQL must be scanning the entire table, converting the IDs from numbers to strings, and doing pattern matching on each one. No wonder it was running so slowly!

Closer to a solution

I searched the code base for "LIKE" and found the cause. Buried in a custom query builder a past developer had assembled, it would only use either LIKE or IN for all query parameters, including in UPDATE statements.

I'm not totally sure what the developer was thinking here. Were they making it so you could search on any field easily? I'm not sure, because I wasn't able to find an example where fields were actually searched on anywhere. We may never know.

The problem was in the code base

I was surprised the problem actually was in the code base itself. It made sense to me, this isn't the kind of problem that would have shown up locally, or even when the site first launched. It would have grown slowly as the table size grew, and apparently only became a major issue once the table had over 750k records in it.

The solution seemed straightforward to me. I modified the API endpoints that used this table, and rewrote the queries directly instead of using this query builder code. (Side note: I've never liked query builders, and this is an excellent example of why!)

I would have liked to modify the query builder to replace LIKE with =, but because I'm not sure if that functionality was needed elsewhere, I thought it best to leave it alone, and migrate away from the query builder instead.

Ship it!

Last step was to commit and push the code, and rolled out an updated version of the system. Shortly after the new version went live, I logged in to the database again and ran SHOW PROCESSLIST. Nothing but a bunch of idle connections! Perfect!

I went over to the AWS admin panel, and sure enough, the "Read IOPS/second" chart had dropped from a steady 100 down to 0! That was a nice reassurance that things were massively improved.

The site wasn't just working again, it was faster than ever!

Lessons learned

There are always lessons to be learned from any outage. Here are a few I learned today:

You should definitely not use LIKE in your database queries to match numeric, indexed IDs. I've never done this, but now the fact has been ingrained deep in my brain.
You probably shouldn't write your own query builders. Again, I've never liked query builders so this just reaffirmed my belief.
You should maybe think about testing your web apps with a very large amount of dummy data in the database sometimes. I've never done this, as it's slow and seems a bit excessive, but I think I may start trying this in the future, particularly on systems where I expect the tables to grow enormously over time.
SHOW FULL PROCESSLIST is a thing. Okay, that's not so much a lesson as it is something I learned that wasn't already in my tool belt.

All in all, the whole process took about an hour, and I'm glad I was able to get things back up and running quickly. Beyond the lessons learned, I got to know the system a little better, and have a lot of new ideas of ways I can improve and speed things up in the future, so it was definitely a worthwhile experience!

Published on February 23rd, 2023. © Jesse Skinner

Web apps that last

When you're building a new web application, or even a new feature, how can you ensure that you're not creating a nightmare code base that will need to be rewritten completely in a few years?

Some people will say it's hopeless to even try and write code that will last. I've even heard people suggest that you should aim to rewrite all your code every few years. That sounds like a very expensive, wasteful strategy.

In two decades of building web apps, I've seen many codebases start as a shiny new prototype and grow into a huge system. In some cases, they've become old, ugly, painful legacy systems that teams are begging to rewrite and replace. (And often those rewrites will themselves grow into ugly, painful legacy systems!) But sometimes a codebase will remain more or less unchanged a decade later, running smoothly as ever.

I believe there are some decisions you can make when writing code that will help it to last longer, and withstand the test of time.

Change is inevitable

Probably the one thing you can be sure of is that change will come. The goals of a business will change, and the people within a business will change. There will inevitably be features added, and existing features will evolve and be repurposed. The names of the products will almost surely change. So is it even possible to write code that doesn't need to change?

I think the key is in the phrase "If it ain't broke, don't fix it". Code that is fulfilling its task, that is doing what it's supposed to do, and is bug-free, is code that will last a long time.

Nightmare code

To understand how to write code that will last, let's think about the opposite: a nightmare codebase that demands to be rewritten. The worst I've seen is a web server written as a giant single file with thousands of lines of code. A system built like a house of cards, where changing one thing will break everything. Code that is very difficult to read or understand. Code that literally gives developers nightmares.

Unfortunately, this is often the kind of code that comes out of throwing together a quick prototype. A hero developer stays up late one night and churns out a first draft like a stream-of-consciousness. The next morning, the business owner is delighted to see their dreams come to life. Everyone's happy.

Then, they ask to change just one thing. Add this little feature. And this other feature. And now this user needs this other thing. And could you just change that other thing quick?

Months later, and this rough draft has accidentally become the foundation for a web application that continues to grow, held together with digital duct tape.

So how do you prevent this nightmare from unfolding?

Do one thing, and do it well

Modularity is extremely important in writing code that will last. And a good module is a piece of code that does one thing, and does it well. That one thing might be interfacing with a single database table. Or it could be handling HTTP calls on a single URL and passing the data to and from other modules that talk to the database.

I find generally that it works best when each module has zero, one or two major dependencies. With zero dependencies, you have a set of functions that receive input data, process it in some way, and return results. With one dependency, you have a set of functions that act as an abstraction or interface to that dependency. With two dependencies, you're writing code that bridges the gap between the two, acting as an adapter or controller.

More than two major dependencies, and you should ask yourself if there's any way to split things up into smaller pieces that are responsible for fewer things.

A dependency might not always be a program you have to install. Another module in your system is also a dependency. So is your business logic. I think about dependencies as anything that your module "knows about". This could even be the shape of certain data structures that might not have explicit type definitions.

The fewer things a module knows about, the more likely the module will be able to persist unchanged over time, because there will be fewer reasons to change it.

When your web application is built with small, independent modules that only do one thing, the chances are much, much lower that any of those pieces will need to be rewritten. And the chance of the whole application needing to be rewritten all at once drops to nearly zero. Even if you later want to do a major redesign, you'll find it easier to copy over lots of these older, simple modules to reuse in the new system.

Finally, a tangible example

Let's say you need to send out a Forgot Password email. You could do the whole thing in one file, but I would prefer to split it up like this:

A module that knows how to actually send an email using AWS SES or something, but doesn't know the recipient, subject or body of the email. function sendEmail(toAddress, subject, body), for example.
A module that knows about the subject and body of the Forgot Password email, but doesn't know who it's sending to or what the reset URL will be. function sendForgotPasswordEmail(toAddress, resetUrl)
A module for the user table in the database, that has a function to generate a reset code, but doesn't know how the reset code will be used or even whether an email will be sent out. function createResetCode(userEmail)
A module that knows about the URL structure of the site, and has a function that can generate a password reset link from a reset code. function getResetUrlFromCode(code)
A module that ties everything together. It takes an email address, calls createResetCode, uses that to call getResetUrlFromCode, passes that to getForgotPasswordEmail, and sends the recipient address and email body to sendForgotPasswordEmail. forgotPassword(email)
A user interface widget with a form, a text field and a button, so the user can type in their email address and click Send password reset link. When the form is submitted, it tells the user to go wait for the email.
A module that is responsible for the server-side password reset part of the system. It receives the form submission, pulls the email address from the form data, calls forgotPassword, and then sends a success status back to the browser.

Here, only a few modules are likely to change. You'll probably see changes to the sendForgotPasswordEmail function, as well as the user interface widget. All the other modules I've outlined are very reusable, and highly unlikely to change, unless you change your email sending provider, or your database software, or something else major. Even in those situations, the code that needs to change is very isolated and easy to replace without affecting anything else.

You can even improve on this further, by having the contents of the email be database-driven, so that non-technical staff members can change the email templates themselves through an admin interface. But an architecture like this is a good starting point that makes those sorts of changes simpler to make.

A good start

If you get in the habit of writing more modular code, and splitting things up as early as possible, then the next time you're throwing together a quick prototype, you'll be able to lean on those principals in the process.

Instead of a giant ball of tangled dependencies and logic, you'll be building smaller, simpler, reusable components that can be used as solid building blocks. Some of these will be so useful and generic that you'll even be able to reuse them in completely different systems without changing them at all.