Skip to main content

How to debug fire?

Take a deep breath. More calm you are, better you can focus on real problem. Oxygen is necessary to make good decisions.

calm

I dont mean to say its time to start meditating, just meant "Keep calm and nerd on"

Before we begin, we need to understand:-

How does POWR stay on?​

  • We have registered our domain with Gandi.net, where we have configured our nameservers to point to cloudflare.
  • Cloudflare is our DNS provider, meaning - all our A, CNAME, SPA, MX records, and anything and everything that has to do with DNS is configured within Cloudflare.
  • We also use Cloudflare for CDN and some Security.
  • Cloudflare first receives http request coming to powr.io, and redirects to Heroku when needed (if its not CDN cached).
  • Heroku is where we host our rails server (among other things)
  • We have also configured heroku to Auto Scale.
  • Amazon RDS Postgres is what we use for our database
  • Compose.com is what we use for redis
  • We use Sidekiq for background jobs
  • We use heroku's scheduler for cron jobs.
  • We use Sparkpost for our transactional emails (some marketing)
  • Most of our marketing emails are triggered within Hubspot.
  • Braintree processes our internal payments (for Pro/Subscriptions)

Here are few things you should do:-

Identify the problem​

  • Is this happening on one or multiple app types?
  • Can you re-produce similar issue on production?
  • Were there recent production deploy, or changes on data that may have caused this issue?
  • Is this happening because of one, or few apps (eg: few apps getting too many responses, some spammy user trying to steal bitcoins from random users, or something similar to that nature?)

How to get started?​

Note:- YOU are highlighted for a reason. Its because YOU can do this, its easier than YOU think.

  • Start with Heroku, take a look at metrics there, does it look unusual?

heroku

  • Move onto NewRelic -> Summary
  • Toggle last 30 mins, 60 mins, 3 hours etc.
  • Take a look around Web transactions time, throughtput and error rate
  • More you familiarize yourself with NewRelic, easier it gets, its built by developers for developers like YOU.

NewRelic

  • If you notice transaction time > 200ms, throughtput climbing, and/or error rates more than 2.5% we are about to get into some trouble.
  • If you see all, most or some of the above happen, its time for YOU to be a firefighter.
  • Try to understand whats happening, what does those graph mean, what is causing those errors. Few tabs/sidebar menus that will come handy to undestand these are, Transactions and Errors (one left hand side)