🔥 Fire - What to do when server is down or slow

Step 1: Take a deep breath. Oxygen is necessary to make good decisions.

Step 2: Is this a fire? Verify

Some common things reported by support

Chunk not loading
Users/me not loading for user x - Does this load for you and qa? May be just a user related issue?
How many users are affected by this? eg: Forms are not submitting - Fire Standalone is not loading - Fire Users/me for a single user not loading - not fire Japanese Yen is not working - not fire 1 simple copy is missing - not fire All simple copies missing - Fire Paypal not accepting payments - Fire Instagram feed is down again - Fire

Step 3: Start a google meet in fire channel and alert all engineers that are online.

quick meet link: meet.google.com/asz-scvc-faf or type in start fire call

Step 4: Diagnose Issue

*NOTE: If the team is not able to identify the issue in ~10 minutes and the server is down or unusable, call Ben (+1 610 737 3935) or Puru (628 333-5557). Contact info for other POWr rangers is here

Open Dashboards:

Open heroku (https://dashboard.heroku.com/apps) and especially new relic
May be useful to see the live logs heroku run logs -t -a powr
Did we recently make any pushes that correspond with the error?

Lots of timeouts in the logs and high request queuing in new relic => determine if issue is a specific endpoint or an underlying resource such as postgres, redis, or memory being overwhelmed.

Specific Endpoint Problems

Look at transactions in the lower left of new relic screen and click on any that are taking a long time to see if they are directly correlating with when the server is slow.

If a specific endpoint is causing problems: Screen Shot 2020-07-01 at 3.10.49 PM.png ^ throughput of app form response goes up at same time as server response time => there's your problem

Immediately increase the dynos: https://powr.gitlab.io/docs/engineering/89
Find out if a specific user or app is getting spammed. Eg for form responses AppFormResponse.where('created_at > ?',5.minutes.ago).group(:app_id).count. Figure out how to stop them.
Be aware you can quickly block an IP address is cloudflare: https://powr.gitlab.io/docs/engineering/269.

Underlying Resource Problems

Is memory high? https://dashboard.heroku.com/apps/powr/metrics/web
- Go to Heroku => POWr => Metrics. Look at Memory Usage:

If memory is way over 1GB and swap is in 100s MB, then restart server: `heroku restart -a powr`

Does postgres DB have problems?*
- Look at currently running queries: heroku pg:ps
- Look at current DB connections from heroku dashboard. If over 400, kill all connections heroku pg:killall
- Use built in diagnostic command: heroku pg:diagnose

In a happy DB, queries take a few ms

Diagnosing sidekiq issues

Visit: https://www.powr.io/sidekiq
Are there busy workers that have been taking minutes to hours to run? (Generally, only admin workers should take this long)
- STOP the workers in the /sidekiq interface.
- If DB is still slow, you may need to Reset DB Connections (see below)
- Figure out what code is causing these to take too long
Is enqueued list > 10,000?
- Are busy workers slow? See previous bullet.
- Is the same task duplicated many times?
  - Eg: heroku ps:stop web:32 -a powr in this case.
  - You can find and delete the specific tasks via console: https://www.getoutline.com/doc/deleting-stuff-in-sidekiq-iK7Uqk9Xa7
  - Is code allowing the same task to get called many times? Fix it.
  - Is a user spamming us? TODO: HOW TO BLOCK SPAMMING USERS
- Are workers not running at all? Visit heroku and see how many instances are running. Increase if necessary (usually up to 10 workers).

Diagnosing REDIS issues

There are few ways you can diagnose redis issues

New Relic

Select specific Controller#action with higher response time on new relic
Scroll to the bottom of the page and locate Application Trace, click on one of those endpoint on the table, you will see something like this. If an endpoint has higher Redis Scan, this means there usually is way too many redis cache being accessed, Simple Copies, General Copies for example.

Screen Shot 2020-07-08 at 5.27.56 PM.png

Scout APM

Our development server already comes with Scout APM preinstalled and you can find a little tool tip towards left hand side corner of the page.

Screen Shot 2020-07-08 at 5.31.19 PM.png

Click on it and select the Controller#action with higher response time, it usually will show you number of redis get/set calls

Screen Shot 2020-07-08 at 5.35.28 PM.png

*Why is this a problem? Well, redis is hosted on external server, even though redis is a good thing to have, if we are making 100's of redis queries, 100 different times, it has to connect to remote server and each of these will add up, instead we should combine all the keys we would like to check and make just one query.

Step 4: Action

Server Rollback Procedure Is a major part of the site crashed or unusable?

YES => Instant rollback (see https://www.getoutline.com/doc/install-rollback-how-to-rollback-heroku-with-0-downtime-qtl91V6zdS)
- heroku features:disable -a powr preboot
- heroku rollback -a powr
- heroku features:enable -a powr preboot
NO => Normal rollback:
- heroku rollback -a powr

https://dashboard.heroku.com/apps/powr/activity

Reset DB Connections Are specific connections taking a super long time to run?

Find slow DB connections via heroku pg:ps
Kill specific connections using heroku pg:kill PROCESS_ID, where PROCESS_ID is found from previous step.

Are there too many connections? Or zero connections?

Kill ALL connections: heroku pg:killall

Step 5: Documentation

Alert fire channel with the progress and resolution of the issue
Move non-fire communications out of the channel
Create a fire doc with the details of what was discovered so people in other timezones can pick up where you left off

Screen Shot 2020-07-08 at 5.31.19 PM.png

🔥 Fire - What to do when server is down or slow

Step 1: Take a deep breath. Oxygen is necessary to make good decisions.​

Step 2: Is this a fire? Verify​

Step 3: Start a google meet in fire channel and alert all engineers that are online.​

Step 4: Diagnose Issue​

Lots of timeouts in the logs and high request queuing in new relic => determine if issue is a specific endpoint or an underlying resource such as postgres, redis, or memory being overwhelmed.​

Specific Endpoint Problems​

Underlying Resource Problems​

Diagnosing REDIS issues​

Step 4: Action​

Step 5: Documentation​