sometimes dashboards can be misleading! i found that focusing too much on infrastructure health like cpu and memory usage doesn't always tell us what's really happening. users don't care if their data is sitting there waiting to crash, they want things done right.
i switched gears with my team: we picked 2-3 service level indicators (slis) tied directly to user actions - like checkout success rates or error counts - and set some solid slos on them instead of just monitoring the servers. it's a huge shift in thinking!
we also started setting up alerting based not only on our infra errors, but more importantly tracking how much room we have left for mistakes (error budget). this gives us clearer insights into user experience issues.
another trick: audit your alerts and add some synthetic tests to critical flows - these can catch problems before real users face them. plus, talk with customer success about what broke recently - they might give you a heads up on trends or actual pain points!
what's working for others out there? have any tips that i'm missing?
https://hackernoon.com/when-your-metrics-lie-the-illusion-of-observability?source=rss