Skip to main content

🔥 Fire - What to do when server is down or slow

Step 1: Take a deep breath. Oxygen is necessary to make good decisions.​

Step 2: Is this a fire? Verify​

Some common things reported by support

  1. Chunk not loading
  2. Users/me not loading for user x - Does this load for you and qa? May be just a user related issue?
  3. How many users are affected by this? eg: Forms are not submitting - Fire Standalone is not loading - Fire Users/me for a single user not loading - not fire Japanese Yen is not working - not fire 1 simple copy is missing - not fire All simple copies missing - Fire Paypal not accepting payments - Fire Instagram feed is down again - Fire

Step 3: Start a google meet in fire channel and alert all engineers that are online.​

quick meet link: meet.google.com/asz-scvc-faf or type in start fire call

Step 4: Diagnose Issue​

*NOTE: If the team is not able to identify the issue in ~10 minutes and the server is down or unusable, call Ben (+1 610 737 3935) or Puru (628 333-5557). Contact info for other POWr rangers is here

Open Dashboards:

  • Open heroku (https://dashboard.heroku.com/apps) and especially new relic
  • May be useful to see the live logs heroku run logs -t -a powr
  • Did we recently make any pushes that correspond with the error?

Lots of timeouts in the logs and high request queuing in new relic => determine if issue is a specific endpoint or an underlying resource such as postgres, redis, or memory being overwhelmed.​

Specific Endpoint Problems​

Look at transactions in the lower left of new relic screen and click on any that are taking a long time to see if they are directly correlating with when the server is slow.

If a specific endpoint is causing problems: Screen Shot 2020-07-01 at 3.10.49 PM.png ^ throughput of app form response goes up at same time as server response time => there's your problem

Underlying Resource Problems​

If memory is way over 1GB and swap is in 100s MB, then restart server: `heroku restart -a powr`

  • Does postgres DB have problems?*
    • Look at currently running queries: heroku pg:ps
    • Look at current DB connections from heroku dashboard. If over 400, kill all connections heroku pg:killall
    • Use built in diagnostic command: heroku pg:diagnose

In a happy DB, queries take a few ms

Diagnosing sidekiq issues

  • Visit: https://www.powr.io/sidekiq
  • Are there busy workers that have been taking minutes to hours to run? (Generally, only admin workers should take this long)
    • STOP the workers in the /sidekiq interface.
    • If DB is still slow, you may need to Reset DB Connections (see below)
    • Figure out what code is causing these to take too long
  • Is enqueued list > 10,000?
    • Are busy workers slow? See previous bullet.
    • Is the same task duplicated many times?
    • Are workers not running at all? Visit heroku and see how many instances are running. Increase if necessary (usually up to 10 workers).

Diagnosing REDIS issues​

There are few ways you can diagnose redis issues

New Relic

  • Select specific Controller#action with higher response time on new relic Screen Shot 2020-07-08 at 5.26.39 PM.png

  • Scroll to the bottom of the page and locate Application Trace, click on one of those endpoint on the table, you will see something like this. If an endpoint has higher Redis Scan, this means there usually is way too many redis cache being accessed, Simple Copies, General Copies for example.

Screen Shot 2020-07-08 at 5.27.56 PM.png

Scout APM

  • Our development server already comes with Scout APM preinstalled and you can find a little tool tip towards left hand side corner of the page.

Screen Shot 2020-07-08 at 5.31.19 PM.png

  • Click on it and select the Controller#action with higher response time, it usually will show you number of redis get/set calls

Screen Shot 2020-07-08 at 5.35.28 PM.png

*Why is this a problem? Well, redis is hosted on external server, even though redis is a good thing to have, if we are making 100's of redis queries, 100 different times, it has to connect to remote server and each of these will add up, instead we should combine all the keys we would like to check and make just one query.

Step 4: Action​

Server Rollback Procedure Is a major part of the site crashed or unusable?

https://dashboard.heroku.com/apps/powr/activity

Reset DB Connections Are specific connections taking a super long time to run?

  • Find slow DB connections via heroku pg:ps
  • Kill specific connections using heroku pg:kill PROCESS_ID, where PROCESS_ID is found from previous step.

Are there too many connections? Or zero connections?

  • Kill ALL connections: heroku pg:killall

Step 5: Documentation​

  • Alert fire channel with the progress and resolution of the issue
  • Move non-fire communications out of the channel
  • Create a fire doc with the details of what was discovered so people in other timezones can pick up where you left off

Screen Shot 2020-07-08 at 5.31.19 PM.png