Open Telemetry 101

Open Telemetry 101

December 5, 2024

History

Metrics ruled the world

Time Series Databases and Logs InfluxDB and Prometheus

Logstash

Open Tracing & Open Census

Need for One standard to rule them all - Open Telemetry

What is OpenTelemetry? Standard for emitting telemetry signals

Protocol - Most important aspect - allows our application to work without knowing where to send data - no vendor or backend knowledge required - defines data model and api contract for various signals

Lot of vendors come out in the observability space because of the stable and open source protocol

SDKs - Instrument with open source library

Core Signals
    -   Traces, Metrics, Logs, Profiles, RUM
    - Profiles - Protocol finalized, SDKs coming
    - RUM - Real user Monitoring (In Design)


Trace
    -   Debugging
    -   Complex system overview
    Trace waterfalls, service graphs
    Focus on Causality - Something happened due to something
    Focus on Performance
    High cardinality and High dimensionality

Span
    -   We do not create trace
    We create a span. Group of spans with a same trace ID means a trace
    It is basically a structured blob of data
        -   UniqueId (SpanId)
        -   CorrelationId (TraceId)
        - Start Time, End Time
        - CausalityId (ParentSpanID) - allows us to link all spans together
        - Attributes - Semantic names and custom ids based on use cases
        - Can be called fancy logs
Log
    -   Point in time data
    -   Designef for humans (Message Template)
    -   Ideal for Local debugging
    -   Useful for 
        -   Startup and Crashes
        -   Debugging tracing
    -   Easy to do it badly

Metric
    -   Metric vs metric
        -   Metric - Time Series aggregation with Labels - data point in otel - bucketing/aggregation
        -   metric - something that we can measure
    -   Relatively cheap to store and query

    - Lacks deep context
    - Low cardinality/dimensionality
    - If we know dimensions upfront, we can use metrics - if we know what we want to measure

What is Propogation? - Real Magic - How differenet things connect together - the idea of correlation - Transmit state - W3C Trace Headers - TraceID - ParentSpanId - Sampling data - Baggage - Footgun we never wanted - additional context between services - Not multispan attributes - does not appear on every span - should avoid doing this (W3C Baggage) - Carried over to all calls - even to third party sdks

How does it work?

  • Auto Instrumentation - agent based - simiar to apm agents - codeless - env variable config - sideloaded good to get started quick - becoms quite verbose and difficult to tweak data or add mroe context/attributes - can’t do what makes otel amazing - makes storing all things expensive

  • Coded Instrumentation - Targeted instrumentation - Full control - decide which spans/context would be usedful and keep only those - Being intentional about observing what is important to you

What is the Collector?

OTel Collector can be deployed as both an agent on a host and a service for collecting, transforming, and exporting telemetry data. The OTel Collector is a vendor-agnostic proxy that can receive telemetry data in multiple formats, transform and process it, and export it in multiple formats to be consumed by multiple backends (such as Jaeger, Prometheus, other open source backends, and many proprietary backends).

Receivers: Push- or pull-based processors for collecting data.

Processors: Responsible for transforming and filtering data.

Exporters: Push- or pull-based processors for exporting data.

All powerful and nobody should run without it Collectors are deployed as a proxy - apps send data here and then it is sent to backend Everything sends data first to collector Can do a multi fanout to different backends - multi vendor management or evaluation

  • Config/Exporter centralization can be achieved
  • No need to give the backend api keys to applications (API Keys are with collector)

Use collector to secure egress, Collector has redaction and filtering capabilities, centralized enrichment - details like pod, nodes, cluster, etc are added automatically Happy security, happy developers

Treat collector like an actual application - make it autoscale

What is Sampling?

cost crisis in observability when we want to store all tracing data Keeping costs low while giving debugging power

  • Head sampling (application) Has access to limited information about the span context

  • Tail sampling (Collector) Everyone should do this at scale Runs at delay after first trace span Part of collector Has access to all spans in the trace - errors, retries help take decisions Delays sending spans to backend Requires applications to send all spans Cross availability zones - super expensive

  • Retain context

  • Reduce storage/ingest

OTEL - should continuously be evaluating what to store and what not to store - what to sample and what not to sample

Last updated on