Performance gotcha in Rails collection_ids +=

June, 1 2020

Today I’ve got Rollbar about timeout from database connection pool. At first I thought maybe we hit the connection limit for our MariaDB instance, but as it turned out few Sidekiq jobs stuck for longer time - 40 minutes actually.

All jobs were related to same logic - adding tag to multiple products.


  1
2
3
4
5
6
7
8
9
10
11
12
13

  class MassiveTaggingWorker
  def perform(tag_id, product_ids)
    tag = Tag.find(tag_id)

    # filtering starts
    selected_products = Product.where(id: product_ids)
    products_with_tag = selected_products.joins(:taggings).where('taggings.tag_id = ?', tag_id)
    products = Product.where(id: product_ids - products_with_tag.ids)
    # filtering ends

    tag.product_ids += products.ids
  end
end

At first I thought it’s probably about joins on big dataset. I mean - we have like milions of products, hundreds of tags and tens of milions of taggings. I run rails console on production to check what happens for production data for the filtering part. It went smooth, without even small hickup. Weird, it wasn’t about the filtering out products.

More on model

As you may suspect we has 3 models related to this issue.


  1
2
3
4
5
6
7
8
9
10
11
12
13
14

  class Product
  has_many :taggings
  has_many :tags, through: :taggings
end

class Tagging
  belongs_to :tag
  belongs_to :product
end

class Tag
  has_many :taggings
  has_many :products, through: taggings
end

We have like tens of milions of Products (some of them inactive), thousands of Tags and hundreds of millions of Taggings.

Slow collection_ids

We have SQL Rails logs enabled for production console so I decided to read them more carefully. I saw huge SQL with loads of product ids - too many, comparing to product_ids passed to the job. After adding some puts-debug-calls I tracked down the caller of this huge SQL. It was the last line: tag.product_ids += products.ids. It looks so small and trivial you wouldn’t even notice the complexity.

Think of it: you have like 100 000 products associated with the tag via Tagging. So to get all of them to provide value for += you have to get products related to 100 000 taggings. In SQL that would be something like:


  1

  SELECT * FROM products WHERE id IN (<LOAD OF PRODUCT IDS>)

Gettig hundreds of thousands of products is really slow, especially when there’s the other process which updates products (and taggings) at the same time. That was the bottleneck

The solution is simple


  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

  class MassiveTaggingWorker
  def perform(tag_id, product_ids)
    tag = Tag.find(tag_id)

    # filtering starts
    selected_products = Product.where(id: product_ids)
    products_with_tag = selected_products.joins(:taggings).where('taggings.tag_id = ?', tag_id)
    products = Product.where(id: product_ids - products_with_tag.ids)
    # filtering ends

    products.each do |product|
      product.tags << tag
    end
  end
end

We replaced super short tag.product_ids += products.ids with slightly longer (but faster) products.each { |product| product.tags << tag }. That way each product handle its taggings on it own avoiding huge and costly database query.

Conclusion

This way I re-learnt old truth - when working on performance always measure and follow the findings. Even the shortest line may have hidden complexity explosion.