#diagnostics #debugging #ops #investigation
idea
Something is broken on an environment, and works on another. Everything is seamingly the same.
There is always something different, even if it's not obvious. It can either be static (code, configuration, infrastructure) or dynamic (bootup, data, load). Walk your way through the differences.
These are potential differences between environment, it's by no means exhaustive. Please validate any assumption you have, don't pick one at random and determine this is the cause, work your way through the list and eliminate.
Infrastructure
- Number of nodes: if one environment is smaller than the other, maybe warmup-time plays a role. Maybe the issue occurs only in the first few minutes of an instance and so the issue is masked on the smaller environment
- Infrastructure provisioning: if environments are provisioned differently, maybe one is not provisioned properly. It might be missing configuration keys, secrets
- Networking: are environment using the same DNS resolvers, proxies and reverse proxies, firewalls?
- Latency: account for what can happen if a deployment happens across the globe.
- Underlying infrastructure is different: you run a self-host on local and it runs in SF in prod. Even the SF emulator locally is different from the actual thing.
Configuration
- URL might be different: Is the configuration of other systems, of callback URLs, of base URLs, of audiences and issuers done properly?
- Are feature-flags configured similarly in both environments? Can combinatory play a role? If there is some tenant restriction, did you allow your tenants in all environments?
- Authentication/Authorization: code might be using different app registrations. Is the configuration done properly?
- Generally, double check all entries, look at configuration side by side.
- Do NOT trust configuration. Double-check again, for http instead of https, or do you actually need the protocol or just the domain name. Check for backslash instead of forward slash, for the escaped characters, for " instead of ', for incorrect indentation level in yaml, for incorrect parenthezing in json.
Data
- Database is configured differently / is different
- Database in prod has more shards, is distributed, is not collocated with compute, is replicated