Too many cookies

What if one of your tracking snippet started to create cookies like crazy? I mean few hundreds for power users (and at least couple per each visitor). Few things may happen - the browser will start to replace cookies with other cookies after hitting a limit (180 cookies?). Not very cool, but what about varnish or nginx rejecting the request because of too large header?

Symptoms

We are happy to have most engaged users of our website working in-house - they manage content, try to polish it or create marketing campaigns. Few of our team mates started to see 400 responses, it turned out (thanks to haproxy logs in Kibana) there is like factor of percent of our traffic getting request rejected with 400 and it seems they have many cookies set. Few clicks on our core feature and it turned out each time the feature is used the cookie is created - by tracking pixel.

First fix

Our infrastructure is not the simplest - we have haproxy, varnish and nginx. Ah, and puma as an app server. Which one responds with 400, hmmm? Haproxy says it was terminated by varnish, varnishlog indeed shows 400s and BogoHeader Header too long error, which makes us think it’s varnish. Oh, what to do, what to do next. Increase the header size limit of course, but it’s not enough - it would just delay the disaster. Something had to remove excess cookies. Fortunately we don’t cache core pages on varnish level, so I decided to use our app’s middleware to delete those cookies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class CookieRemover
  PREFIX = '_aweful_cookie.'

  def initialize(app, _options = {})
    @app = app
  end

  def call(env)
    request = ActionDispatch::Request.new(env)

    excess_cookies = request.cookie_jar.each.map(&:first).select { |name| name.starts_with?(PREFIX) }
    excess_cookies.each do |cookie_name|
      request.cookie_jar.delete(cookie_name)
    end

    @app.call(env)
  end
end

After deploying the change and reconfiguring varnish (with short downtime) it seemed we can call it a day. Numbers of 400s decreased drastically, we saw smaller number of cookies in logs etc. We were wrong.

Second fix

The next day (Friday) someone told us the problem remained the same - new power user told us about the problem. In logs 400s turned into 502. I mean not all - some of them responded with 200s, but we still saw small factor of percent of our traffic seeing errors because of the cookies. nginx - we saw new errors in its log, so we’ve increased the header limit (once again) - actually the proxy buffer size and related. After reconfiguring nginx (with no downtime) it seemed we no longer get 502s, but 503s instead. The last piece of our http infra - puma app server - responded with errors.

The last (hot) fix

We failed to reconfigure puma. Maybe there is a way to increase its header limit, but it wasn’t obvious and skimming through docs didn’t help. What can we do now.

You have to know our service is mostly used on weekends and evenings. Most of performance peaks happen when we’re not in the office - so potentially without fixing this before the weekend we would serve loads of our visitors 503 thinking all is good, because health checks would not get it. We knew we have to fix this really soon.

The problem is too big HTTP header, the Cookie section in it. Thankfully I’ve had some experience with working with cookies on varnish level - we’re generating visitor id there - so I knew we can remove these cookies from the request. It would reduce the header size, so the request would hit the app, but the app wouldn’t see the cookies (and remove them). Not removing them would just make problem worse - we could exceed the browser limit or hit varnish’s limit once again, with even larger header.

Browsers know all cookies they use. You can access them via document.cookie in JS and delete them. The last piece was short JS snippet added to every page responsible for cookie removal. Deploy, varnish reconfiguration (with no downtime, luckily VCL changes are not as dramatic as setting limits). Now we see no 50xs, no 400s and haproxy logs shows decreasing number of cookies.

After last fix

The client-side cookie removal and varnish request cookie removal are temporary solution to make the problem manageble for super-power-users. As for them we couldn’t find a better way to clear the cookie and now when they have just small number of them for really short time we can get back to app-level removal and default nginx and varnish setup.

The next step is to convince the tracking snippet owner to stop this disaster on his/her level. I wouldn’t expect the snippet to produce so many cookies.