Today I’ve got Rollbar about timeout from database connection pool. At first I thought maybe we hit the connection limit for our MariaDB instance, but as it turned out few Sidekiq jobs stuck for longer time - 40 minutes actually.
All jobs were related to same logic - adding tag to multiple products.
1
2
3
4
5
6
7
8
9
10
11
12
13
class MassiveTaggingWorker
def perform(tag_id, product_ids)
tag = Tag.find(tag_id)
# filtering starts
selected_products = Product.where(id: product_ids)
products_with_tag = selected_products.joins(:taggings).where('taggings.tag_id = ?', tag_id)
products = Product.where(id: product_ids - products_with_tag.ids)
# filtering ends
tag.product_ids += products.ids
end
end
At first I thought it’s probably about joins
on big dataset. I mean - we have like milions of products, hundreds of tags and tens of milions of taggings. I run rails console on production to check what happens for production data for the filtering part. It went smooth, without even small hickup. Weird, it wasn’t about the filtering out products.
As you may suspect we has 3 models related to this issue.
We have like tens of milions of Products (some of them inactive), thousands of Tags and hundreds of millions of Taggings.
We have SQL Rails logs enabled for production console so I decided to read them more carefully. I saw huge SQL with loads of product ids - too many, comparing to product_ids
passed to the job. After adding some puts-debug-calls I tracked down the caller of this huge SQL. It was the last line: tag.product_ids += products.ids
. It looks so small and trivial you wouldn’t even notice the complexity.
Think of it: you have like 100 000 products associated with the tag via Tagging. So to get all of them to provide value for +=
you have to get products related to 100 000 taggings. In SQL that would be something like:
1
SELECT * FROM products WHERE id IN (<LOAD OF PRODUCT IDS>)
Gettig hundreds of thousands of products is really slow, especially when there’s the other process which updates products (and taggings) at the same time. That was the bottleneck
The solution is simple
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class MassiveTaggingWorker
def perform(tag_id, product_ids)
tag = Tag.find(tag_id)
# filtering starts
selected_products = Product.where(id: product_ids)
products_with_tag = selected_products.joins(:taggings).where('taggings.tag_id = ?', tag_id)
products = Product.where(id: product_ids - products_with_tag.ids)
# filtering ends
products.each do |product|
product.tags << tag
end
end
end
We replaced super short tag.product_ids += products.ids
with slightly longer (but faster) products.each { |product| product.tags << tag }
. That way each product handle its taggings on it own avoiding huge and costly database query.
This way I re-learnt old truth - when working on performance always measure and follow the findings. Even the shortest line may have hidden complexity explosion.