A global survey of 1,700 IT professionals finds on average teams are grappling with 280 hours of median annual downtime a year, which at 12 hours of a 40-hour work week equates to about 30% of their time.
Conducted by Enterprise Technology Research (ETR) on behalf of New Relic, the most common causes of downtime over the past two years have been network failure (35%), third-party or cloud provider services failure (29%) and human error (28%).
The survey also estimates that IT downtime costs businesses $376.66 million annually, with median total annual downtime involving a high-impact outage lasting 77 hours, or three days. A full 62% said high-business-impact outages cost their organization at least $1 million per hour of downtime.
Nearly half of organizations (45%) use more than five observability tools to reduce downtime, with the median average of budget dollars allocated being $1.85 million.
Conducting root cause analysis (RCA) and post-incident reviews (37%), monitoring DevOps Research and Assessment (DORA) metrics (34%), monitoring golden signals (33%), and tracking, reporting and influencing the mean time required to detect and resolve outages (33%) are the top best observability best practices being followed. The median annual value received from observability is $8.15 million. However, only a quarter (25%) said their organization has achieved full-stack observability.
On average, those with full-stack observability experienced 79% less downtime per year than those without and incurred 92% less outage costs per year. They also had a 27% lower annual observability budget and were 51% more likely to learn about interruptions with observability.
Nic Benders, chief technical strategist at New Relic, said the survey makes it clear that organizations need a lot more than access to log data to prevent and minimize disruptions. Log data as an early warning system is itself not very valuable, he added.
In general, the adoption of artificial intelligence (AI) technologies and an increased focus on security, governance, risk and compliance at 41% each are the top drivers of observability spending.
At least half had deployed one or more core observability capabilities, including security monitoring (58%), network monitoring (57%), database monitoring (55%), alerts (55%), dashboards (54%), infrastructure monitoring (54%), log management (51%) and application performance monitoring (APM; 50%). More than a third had deployed digital experience monitoring (DEM) capabilities, including browser monitoring (44%), error tracking (43%), and mobile monitoring (35%).
More than half (51%) of respondents used an open-source solution for at least one observability capability. More than a third (38%) were using Grafana, 23% were using Prometheus and 19% were using OpenTelemetry. Respondents are also relying on open source software for AI monitoring (31%), synthetic monitoring (28%), distributed tracing (28%), application performance monitoring (27%), Kubernetes monitoring (27%) and AIOps capabilities (26%).
Overall, 42% of respondents noted their organization is already applying artificial intelligence (AI) to monitoring, with 29% employing machine learning and 24% having embraced AI for IT operations (AIOps) platforms. An additional third is expected to deploy artificial intelligence for IT operations (AIOps) capabilities (39%), AI monitoring (36%) and machine learning (ML) model monitoring (34%) in the next year.
A total of 40% have also business observability (40%) as part of an effort to correlate business outcomes in real-time with telemetry data. On average, respondents who had deployed business observability experienced spent 25% less time addressing disruptions compared to those who hadn't, the survey finds.
Finally, the survey finds there is a 2-to-1 preference for a single, consolidated platform over multiple tools they need to integrate. In addition, 41% said they plan to consolidate tools in the next year.
The one clear thing is observability that goes beyond merely monitoring a set of pre-defined metrics remains a work in progress. The challenge, of course, remains to turn all the telemetry data available into actual actionable intelligence.