The 6 Characters That Could Bring Down Your Rails App

It all started with a ping from one of the team leads who noticed a recurring exception in Sentry. At the same time, DataDog showed the Postgres CPU utilization was spiking.

Based on the Sentry logs, the team lead thought this exception was caused by a specific deploy, but they couldn’t find anything in the most recent commits. Then they asked me to look into it. I was new to this project, so I had the fresh eyes benefit.

The first thing I did was look at the exception in Sentry, which pointed to the problematic line of code, which had been introduced about a year earlier. I immediately knew what the problem was when I saw the infamous 6 characters:

.count

In Rails, calling .count on an ActiveRecord::Relation issues a Postgres COUNT query, which is slow, and gets slooooower the bigger the table.

I confirmed this by noticing statement timeouts for those COUNT queries in New Relic. This was with a statement_timeout set to 30 seconds!

I also double checked by looking at the id of the last record in the table, and it was over 34 million!

The query included a .where and a .where.not, so it wasn't trying to count the entire table, but I presumed even the filtered list was so big that Postgres couldn't count it all in 30 seconds.

Now that I knew this was indeed the line of code causing exceptions, timeouts, and high DB CPU usage, I looked at it more closely. Can you spot the issue?

previous_documents = Document.where(series_id: document.series_id).
  order(:id).
  where.not(vbms_document_id: ids)

if previous_documents.count > 0
  document.copy_metadata_from_document(previous_documents.last)
end

The only reason the previous_documents were being counted was to see if there was at least one of them. We were not at all interested in the actual number of documents that matched the first query, just whether or not the result was empty.

There are more efficient (and more idiomatic) ways to check if a query returned any results. The problem is there are several of them, such as any?, exists? and present?, but they don't behave the same way. On top of that, the behavior depends on the Rails version! It's hard to remember which one to use in which situation, and there aren't any automated tools that I know of to guide us.

Luckily, Nate Berkopec wrote a detailed article explaining it all, and I knew to replace .count > 0 with .any? which uses a SELECT 1 AS one FROM ... LIMIT 1.

As soon as this change was deployed, the CPU utilization was reduced by about 5-10x! 🎉

DataDog screenshot showing CPU usage lowered by 10x

Here's my PR, where I also fixed another related instance of .count > 0.

Once this production issue was resolved, I found other instances of .count > 0 where we only needed to check if the results were not empty, and I fixed those as well.

One question that came up was why did the timeouts happen all of a sudden on that particular day? My best guess was that the documents table was slowly growing over time, and finally reached the size at which Postgres could no longer make the COUNT query within 30 seconds. Perhaps a larger-than-usual amount of documents were created on that day. Unfortunately, that table didn't have timestamps (it does now!), so we couldn't easily tell how many documents were created at any given time. 😥

Lessons Learned

Here are actions you can take today to stand out and improve your project:

P.S.

While the example in this story is based on a Rails application, the concept applies to any app (regardless of language or framework) that uses the Postgres COUNT query where a SELECT 1 AS one FROM ... LIMIT 1 would be more appropriate.