#engineering #diagnostics #ops #debugging #investigation

idea

This process is focusing on finding the source of a problem:

Understand the problem well. What is the issue, what is the expected behavior. Collect initial data around the context where this happened, collect correlation ids, error codes, error messages, screen captures and screen recordings. Who (customer, tenant, is it one person or everyone, look at the dashboard to understand impact size), when (timeline, recurrence), where (physical location, browser type, locale, ...), how (repro steps, software version)
Try to reproduce, validate that it's not user error. Time-box this exercise. Use the same software version as the customer, on a demo environment. Make sure you reproduce the same problem as a customer, and not something else. Error codes and error messages should match. If you can't, try to narrow down to the configuration to match that of the customer, but don't kill yourself at it. Try to reproduce on lower environments (dev, local) - this will make debugging much easier.
Collect data, look into the logs, search for the correlation id. Look into the traces, the exceptions, the request logs, the dependency (reverse proxy) logs. If can't find by correlation id, search with error code and timespan. Else reduce down to timespan.
List potential causes, starting from the data and your knowledge of the system, identify what could have gone wrong. Go top-down: you see an issue, what is the closest thing that went wrong, then follow the path. Go for breadth with that list.
Match causes with data, what is the cause that best describes the data you gathered? Follow this one and collect more data to try and prove that it's the right explanation. If the data does not match, strike out the cause and document why, then list/review causes based on the new data you received, and repeat the process.
Once you identified a likely cause, go for details until you have enough to remediate or fix. You should not stop an investigation at a something's broken with authentication and wait on someone else to fix it for you.

You should follow this behavior:

Document everything, including: your understanding of the issue, the timeline, the data you found, your hypothesises, what confirmed / discarded them. Include queries and results, screen captures, etc.
Separate concerns: Watch out for superposing issues. Keep things separate, don't mix issues, don't try to fix two things at once, be very, very clear about what you are investigating. Sometimes something that works produces error logs, don't assume that an error log is the cause of a problem, watch out for irrelevant logs.
Limit assumptions, do not assume that all errors in the logs are related to the issue you investigate. Do not assume that configuration files are correct, even if they look correct, and even if you checked them. Do not assume that data in the data base is correct. Check application versions, check connections, check permissions.
Assume we're the weakest link, and that the issue is coming from us. If there's no declared outages with dependencies, it's probably our problem. Even if nothing seem to have changed, even if it was working before, even if our changes were marginal, assume that the problem is coming from our code or from our configuration. It is very unlikely you found a problem with a largely used framework or OS. 90% of the time or more, the issue is coming from us. Once we gather solid evidence that the problem is in fact not coming from us, start with investigating dependencies from the weakest link to the strongest (i.e. devops provider, libraries, frameworks, OS).
When you get blocked, involve someone else. Start internally with the team and diagnose issues / debug with peer-programming. Everyone in the pair needs to participate, look at the logs and provide ideas. Keep involving outside of the team for when we're really blocked, and outside expertise is required. Provide CLEAR data to the outside help, don't just drop the problem on them.
Stay honest and open, stay scientific, don't try to force the data you found into the hypothesis you like. Make sure you understand your hypothesis for real. Be honest about what you don't know, revise wrong assumptions you made.
If it's a customer issue, keep the communication channel open, provide frequent updates, let them know you're working on it and making progress (or not).
Demonstrate ownership. Investigating incident can be hard, confusing, but it's also a great opportunity to learn, and fix things. Assume that you are in charge of fixing the problem, don't wait on someone else to bring the solution to you. It is always OK to ask for help, but the help should be around ideation, or doing something you don't know about. It should NEVER be about dropping the issue onto someone else.

links

#investigation

15 minute rule

references

Template for data collection:

Problem

Describe what the problem is

Timeline

When did the problem start? When was the last changes to code and configuration?

Data

Queries you ran and results they produced, screen captures. Credentials to demo environment and ids that reproduce the issue