Observability in 2030: A Practitioner's Wishlist - DevOps.com

In the late 2010s, MELT was all the rage, with vendors and practitioners extolling the virtues of metrics, events, logs and traces as the 'four pillars of observability', much to the chagrin of purists on the traces or logs front. Confusion around the definition of events versus logs eventually saw the former slowly fade away, leaving the industry to argue about which is the best signal of them all - logs or traces.

Metrics are a foundational element of any observability strategy, providing the ability to track changes over time and answer the question, "Did something change?" When the answer is, "Yes," we rely on logs and traces to uncover what changed, and who or what caused it. In a comprehensive observability approach -- whether managing containerized, monolithic, hybrid cloud, multi-cloud or bare metal environments -- logs often serve as the primary signal. However, when you need deeper insight into how applications responded to changes indicated by metrics, traces are essential for telling the full story.

There is still a missing link in the story, and that is where events matter.

An event resembles a log entry, typically featuring a timestamp, a body of text and a source identifier. However, events go beyond simple logging by adding critical context to logs, metrics and traces. This context can range from system configuration changes and outage notifications to product launches or even external factors like weather or geopolitical events. Events are crucial for answering the key question, "Why does this change matter?" This helps reduce unnecessary notifications and lightens the cognitive load during investigations, streamlining the troubleshooting process.

Metrics might show that application latency has increased, logs could confirm there are no system errors and traces can pinpoint the affected container and specific calls. However, it is events that provide the crucial context, revealing that your product launch has gone viral, causing a 100x spike in traffic. In this scenario, you'll be glad to have Kubernetes in place, with your application automatically scaling to handle the surge! Events tie everything together, helping you understand not just what's happening but why it matters.

'In-context' is currently the holy grail of observability, with 'logs-in-context' being the most frequently cited example. It refers to leveraging shared context -- whether from an application, host or other sources -- to enable faster navigation between logs, metrics and traces during issue investigation. This approach is especially valuable when exploring the 'unknown-unknowns', a core principle of observability and a major contributor to the 28% compound annual growth rate in telemetry data. By connecting these signals, teams can resolve issues more efficiently and uncover insights that might otherwise remain hidden.

The OpenTelemetry project introduces a signal called 'resource', which stores attributes that describe the source of logs, metrics or traces, enabling faster grouping and filtering. When these logs or metrics originate from an application, sharing the span and trace IDs creates precise correlations between the telemetry data and the underlying code. This span and trace ID context is critical for distributed tracing, offering visibility across disparate systems even when they share few or no common resource attributes. As containerization has rapidly expanded, so has the demand for distributed tracing. But what about the visibility into the network layer?

Navigating between signals emitted by the same set of entities is a well-established practice, with distributed tracing considered the gold standard. While this approach works effectively for the application layer and signals across shared resources -- using OpenTelemetry terminology -- it still leaves a gap in visibility at the network layer. Achieving the same level of contextual insight for network performance remains a challenge, and without it, full-stack observability is incomplete.

Cross-Layer Telemetry, a project built by Justin Iurman, brings in situ operations, administration and maintenance (IOAM) to OpenTelemetry to unlock visibility from L2 to L7. While RFC 9197 proposes the data fields and RFC 9378 proposes the deployment model, the cross-layer telemetry project brings it all together in an amazing proof of concept.

Chaining together network performance insights via CLT with a specific trace and span ID gives deep network context to traces in OpenTelemetry format. Discovering that intermediary network devices are causing performance degradation, as demonstrated in Iurman's demo video, will continue to break down the silos between application and infrastructure teams.

The vision for 2030 centers on two key advancements: Expanding telemetry to include events from diverse sources (both inside and outside the tech stack) and bridging the network-application divide. To realize this future of observability, we must decouple telemetry collection from telemetry analysis. As OpenTelemetry continues to evolve and becomes the dominant signal format for application-focused observability, translating non-OTLP to OTLP formats will be essential for enabling a single agent to support multiple visualization and analysis experiences. Converting signal types, such as logs-to-metrics (and vice versa) and traces-to-metrics, will reduce the operational burden of managing multiple agents while ensuring seamless analysis, visualization and correlation of telemetry data.

When asked which backend I prefer for observability, my answer is straightforward: I choose a solution that treats OpenTelemetry signal visualization and analysis on par with their native agents. Swapping agents at the edge is a challenging task for enterprise-scale deployments and often impedes the adoption of new visualization and analytics tools due to incompatible telemetry formats. By 2030, deploying a new agent should not be a prerequisite for introducing a new visualization experience or sharing telemetry across multiple teams. For example, generating network flow metrics for IT operations and simultaneously delivering raw flows to security and network operations shouldn't require three separate configurations.

To break down telemetry barriers across various technologies and deployment models, we must decouple agents from the platforms they serve, allowing greater flexibility and interoperability.

Wrapping up, the future of observability lies in breaking down silos across systems, incorporating events for deeper context and advancing cross-layer visibility. By decoupling telemetry collection from its analysis, practitioners can ensure flexibility and scalability across diverse platforms and teams. As OpenTelemetry becomes the standard, we will see more unified experiences that allow for seamless troubleshooting and insight generation, driving more efficient, context-rich observability that meets the evolving demands of complex, hybrid and multi-cloud environments in 2030 and beyond.

Observability in 2030: A Practitioner's Wishlist - DevOps.com

POPULAR CATEGORY

industry

fun

health

sports