Last Thursday we launched Spot on Hacker News in a post called Show HN: Free 3D virtual office space in the browser. We are no strangers to the Hacker News community, as we've been members for years and gone through YC in previous lifetimes. For those not familiar, it is a great technical/developer community for many things– most notably startups. For Spot, in particular, one of our goals is to create a fully programmable virtual office, and developers are a core part of our target demographic.
As such, we were pleasantly surprised to see the post quickly rise to the front page of Hacker News. After roughly an hour, we had risen to the #3 spot, with mostly positive comments. We had just turned on self-signups and, to our delight, we were watching the number of Spots created quickly increase.
Shortly thereafter, we noticed a few comments mentioning that email signup was returning an error message: "Unauthorized". Checking our server logs confirmed this:
Digging into the relevant stack traces revealed that this was indeed happening during signup. It quickly dawned on us: each one of the above errors was a failed signup! In total, we lost 431 signups! For a startup at our stage, this was not a trivial number.
We also started witnessing our position on the front page rapidly decline, perhaps due to frustration around the inability to signup:
By the time we identified the root cause, we had already lost all traction– a real missed opportunity! Our team is experienced, in particular with operations, what ultimately went wrong?
The Root Cause
While digging into the stack traces, we noticed that the error was originating from our email delivery provider, SendGrid. In the past we have use Simple Email Service, or MailGun, but we chose SendGrid due to a small number of free credits we received (every little bit counts as a startup!).
Signing into the SendGrid dashboard showed no indication that anything was wrong. Everything looked perfectly fine, including our sending limit and message deliverability. Going back to the stack trace, we noticed more messaging: "Credit Limit Exceeded". This was confusing as we had recently upgraded plans and our daily sending limit was far above our actual usage.
We eventually had the following exchange with support:
It was a bug on our end due to the Startup discount
Unfortunately, SendGrid had a bug on their backend where our account–despite being ostensibly upgraded to a paid account– was not actually on the correct plan. Although frustrating, the issues is ultimately on our end for putting a third party in the critical path of our product. All products– even those of publicly traded companies– have issues.
This is not news to us, our team is intimately familiar with operational issues, especially those related to third party outages, and we could have done better.
Usually, when dealing with external services, the solution is to move the logic out of the critical request path and into a job queue or some other method of calling the logic asynchronously. This is normally how we interact with SendGrid, but our email verification logic was inline. Lesson learned.