Docs

Install otel-k8s-graph, feed it trace spans, and query the graph over REST or from Claude. Three small Go binaries, coordinating only through Redis.

Quickstart

The Helm chart bundles a single-replica Redis by default, so a fresh install is self-contained. image.registry is the only required value.

install
# Build + push all three images (versioned + latest), then helm upgrade --install:
REGISTRY=<your-registry> ./deploy.sh

# Or with helm directly against pre-built images:
helm upgrade --install graph helm/graph \
  --namespace default --create-namespace \
  --set image.registry=<your-registry>

Already have Redis? Point at it with --set redis.internal.enabled=false:

external redis (inline)
helm ... --set redis.internal.enabled=false \
  --set redis.host=my-redis --set redis.password=s3cret
external redis (existing secret)
kubectl create secret generic redis-creds \
  --from-literal=REDIS_HOST=my-redis \
  --from-literal=REDIS_PASSWORD=s3cret

helm ... --set redis.internal.enabled=false \
  --set redis.existingSecret=redis-creds

Helm knobs

The values you'll reach for most often. See the chart README for the complete list.

Key Default Description
image.registry (required) Container registry for all three images, no trailing slash. The image ref is <registry>/<imageName>:<tag>.
image.tag latest Shared image tag for all components. deploy.sh pins the exact version at install time.
graph.keyPrefix graph GRAPH_REDIS_KEY_PREFIX namespacing every Redis key. All three components MUST agree on this to share one graph.
redis.internal.enabled true Deploy a bundled single-replica Redis. Set false to bring your own.
redis.host "" REDIS_HOST for external Redis (when internal.enabled is false and existingSecret is empty).
redis.password "" REDIS_PASSWORD for external Redis. Empty = no auth.
redis.existingSecret "" Source host/username/password from an existing Secret (via secretKeyRef).
redis.internal.persistence.enabled false Persist the bundled Redis in a PVC. Off by default — the graph is rebuilt from the K8s API and OTel data after a restart.
redis.internal.persistence.size 1Gi PVC size when persistence is enabled.
graphOtel.replicas 2 Replica count. graph-otel is horizontally scalable (idempotent full-flush writes + reaper); requires the collector to shard spans by trace ID to the headless Service (see the loadbalancing exporter above).
graphOtel.service.otlpPort 4317 OTLP gRPC port. Point the collector's traces exporter at graph-otel-otlp:<otlpPort>.
graphOtel.config.flushInterval 60s GRAPH_FLUSH_INTERVAL: how often the whole in-memory set is written to Redis (idempotent HSET/SADD, no deletes — safe across replicas).
graphOtel.config.expiryTtl 100s GRAPH_EXPIRY_TTL: drop in-memory entities/edges not seen for this long. MUST exceed flushInterval.
graphOtel.config.reapInterval 5m GRAPH_REAP_INTERVAL: how often the reaper deletes Redis entities not refreshed within reapTtl. 0 disables it; all replicas reap (idempotent).
graphOtel.config.reapTtl 48h GRAPH_REAP_TTL: delete graph-otel-owned Redis entities not refreshed by any replica for this long. (FLOW_REAP_INTERVAL / FLOW_REAP_TTL do the same for flows.)
graphK8s.config.resyncPeriod 5m WATCH_RESYNC_PERIOD: how often informers re-list every object (self-heal).
graphRead.replicas 1 Replica count for the query API (stateless reader; safe to raise).
graphRead.ingress.enabled false Expose the query API via Ingress. NO auth and includes destructive POST /prune — add auth or restrict IPs first.

OTel Collector setup

graph-otel consumes trace spans directly — no spanmetrics connector or metrics pipeline required. Fan the spans your apps already emit to graph-otel alongside your existing trace backend. The k8sattributes preset adds the k8s.* attributes graph-otel needs to attach relationships to the right container.

otel-collector-values.yaml
# otel-collector-values.yaml
mode: deployment

image:
  repository: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib

# Adds the k8sattributes processor (+ RBAC) so spans carry
# k8s.namespace.name / k8s.pod.name / k8s.container.name.
presets:
  kubernetesAttributes:
    enabled: true

config:
  exporters:
    # graph-otel reconstructs whole traces in memory (for flows), so every
    # span of a trace must reach the SAME replica. The loadbalancing exporter
    # shards by trace id across the graph-otel pods, discovered via the
    # headless Service (returns one A record per pod).
    loadbalancing:
      routing_key: traceID
      resolver:
        dns:
          hostname: graph-otel-otlp-headless.default.svc.cluster.local
          port: 4317
      protocol:
        otlp:
          tls:
            insecure: true

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [loadbalancing]
install collector
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
  -f otel-collector-values.yaml

Already exporting traces? Add the loadbalancing exporter alongside your existing one in the traces pipeline (exporters: [your-backend, loadbalancing]). The trace-id sharding keeps each trace whole across graph-otel replicas, which its flow assembly requires.

MCP / Claude Code

The same graph-read binary runs an MCP server in mcp mode. It's a thin REST client of the query API — set GRAPH_BASE_URL to point it at the API. Expose the query API with the chart's built-in Ingress so the MCP server can run anywhere:

expose via ingress
helm ... \
  --set graphRead.ingress.enabled=true \
  --set graphRead.ingress.className=nginx \
  --set graphRead.ingress.hosts[0].host=graph.example.com \
  --set graphRead.ingress.hosts[0].paths[0].path=/ \
  --set graphRead.ingress.hosts[0].paths[0].pathType=Prefix

The query API has no authentication and includes the destructive POST /prune — add auth at the ingress (basic-auth / oauth2-proxy annotations) or restrict source IPs before exposing it.

Then register it with Claude Code / Claude Desktop, pointing GRAPH_BASE_URL at the ingress host:

mcp config (json)
{
  "mcpServers": {
    "graph": {
      "command": "graph-read",
      "args": ["mcp"],
      "env": { "GRAPH_BASE_URL": "https://graph.example.com" }
    }
  }
}

The six tools

  • search Find entities by keyword across IDs, names, metadata values, and edge actions.
  • get_entity Full detail for one entity by exact ID: kind, metadata, and all outbound edges.
  • list_entities Enumerate every entity of a given kind.
  • get_subgraph Every entity reachable from a root via BFS (blast radius / call neighborhood).
  • list_flows Distinct trace flows — request shapes collapsed from many traces.
  • get_flow The full canonical structure of one trace flow by root hash.

REST API

Served by graph-read on :8080. The MCP tools mirror these endpoints.

Method Path Description
GET /search?q=<text>&kind=<kind>&limit=N Case-insensitive substring across IDs, names, metadata values, and edge actions.
GET /entities?kind=<kind> List every entity of a kind.
GET /entity/{id...} One entity with its edges + metadata (IDs may contain /).
GET /subgraph/{id...}?max_depth=N BFS reachable set from an entity (blast radius / call neighborhood).
GET /flows?limit=N Trace-flow summaries (abstract request structures), by occurrence count.
GET /flow/{hash} One flow's canonical structure (collapsed Merkle tree with per-node metadata).
POST /prune?older_than=<dur> Drop entities not seen for <dur> (e.g. 5m); returns the count.
GET /healthz Liveness/readiness.

The query API has no authentication and includes the destructive POST /prune. Put auth in front (ingress basic-auth / oauth2-proxy) or restrict source IPs before exposing it.

Graph schema

Entity types

Every node kind, where it comes from, and the signal that creates it.

Kind ID format Source How it's derived
namespace namespace:<name> K8s Namespace Namespace objects (also implied by every pod). Metadata: label.*
node node:<name> K8s Node Node objects + each pod's spec.nodeName. Metadata: node label.*
zone zone:<name> K8s Node labels From topology.kubernetes.io/zone (legacy failure-domain.beta.kubernetes.io/zone). Emitted only if the label is set; region → zone → node.
region region:<name> K8s Node labels From topology.kubernetes.io/region (legacy fallback). Emitted only when both zone and region labels are present.
deployment deployment:<ns>/<name> K8s Deployment Deployment objects. Metadata: label.*
statefulset statefulset:<ns>/<name> K8s StatefulSet StatefulSet objects. Metadata: label.*
daemonset daemonset:<ns>/<name> K8s DaemonSet DaemonSet objects. Metadata: label.*
job job:<ns>/<name> K8s Job Job objects; linked to its CronJob via ownerRef. Metadata: label.*
cronjob cronjob:<ns>/<name> K8s CronJob CronJob objects. Metadata: label.*, cronjob.schedule
rollout Argo Rollouts rollout:<ns>/<name> Argo Rollouts CRD (argoproj.io/v1alpha1) Watched via dynamic informer; skipped if the CRD is absent. Metadata: label.*, rollout.strategy (canary/blueGreen)
pod pod:<ns>/<name> K8s Pod Pod objects. Metadata: label.*, k8s.node.name, k8s.pod.uid
container container:<ns>/<pod>/<name> K8s Pod spec From each pod's spec.containers. Metadata: container.image.name
hpa hpa:<ns>/<name> K8s HPA (autoscaling/v2) HorizontalPodAutoscaler objects. Metadata: hpa.min_replicas, hpa.max_replicas, hpa.target.kind/name
scaledobject KEDA scaledobject:<ns>/<name> KEDA CRD (keda.sh/v1alpha1) Dynamic informer; skipped if the CRD is absent. Metadata: keda.target.*, keda.min/max_replicas, keda.triggers, keda.scaling_policy
endpoint endpoint:<service>/<METHOD>/<route> OTel HTTP & RPC spans span.kind SERVER or CLIENT. RPC is matched first — rpc.method (or a gRPC-shaped http.route / url.full path): id endpoint:<rpc.system>/<rpc.service>/<rpc.method>, host-independent so a client's CALLS and the server's EXPOSES converge on one entity (e.g. grpc /oteldemo.CartService/GetCart). Otherwise HTTP: SERVER → http.route + service.name; CLIENT → url.full path + peer.service (else server.address). HTTP routes collapse :id/{id}/digit segments to {n}; RPC names are kept verbatim.
topic topic:<name> OTel messaging spans span.kind PRODUCER or CONSUMER. Name from messaging.destination.name (fallback: first token of span.name). Used verbatim.
database database:<system>/<host>[:<port>] OTel DB-client spans db.system set + span.kind CLIENT + server.address. Port from server.port (optional).

Edge types

Each edge has a counterpart in the other direction. Span-derived edges anchor on the emitting container (container preferred over pod).

Edge / counterpart Source Meaning How it's derived
CONTAINS / RUNS_IN K8s Structural containment (A contains B). namespace → pod, node → pod, pod → container (pod spec); zone → node, region → zone (node topology labels).
MANAGES / MANAGED_BY K8s A workload owns/controls its children. deployment/statefulset/daemonset/job/rollout → pod (pod ownerRef); cronjob → job (Job ownerRef); scaledobject → hpa (HPA ownerRef = ScaledObject).
SCALES / SCALED_BY K8s An autoscaler scales a workload. hpa → target (HPA spec.scaleTargetRef); scaledobject → target (KEDA spec.scaleTargetRef, defaults Deployment). Targets: Deployment/StatefulSet/Rollout; other kinds recorded in metadata only.
EXPOSES / EXPOSED_BY Spans A container serves an HTTP or RPC endpoint. SERVER span: HTTP (http.request.method + http.route + service.name) or RPC (rpc.method) — see the endpoint entity for id details.
CALLS / CALLED_BY Spans A container calls an HTTP or RPC endpoint. CLIENT span: HTTP (http.request.method + url.full; target = peer.service / server.address) or RPC (rpc.method).
QUERIES / QUERIED_BY Spans A container queries a database. DB CLIENT span: db.system + span.kind=CLIENT + server.address. Carries action = templatized span.name (the SQL op); one edge per distinct operation.
PUBLISHES / PUBLISHED_BY Spans A container publishes to a topic. Messaging PRODUCER span.
CONSUMES / CONSUMED_BY Spans A container consumes from a topic. Messaging CONSUMER span.
EMITS / EMITTED_BY Spans (rollup) Workload-level rollup of everything its containers touch — one hop to answer “what does this deployment talk to?” deployment/statefulset/… → every endpoint/topic/database (the union of its containers' CALLS/EXPOSES/QUERIES/PUBLISHES/CONSUMES), hoisted onto the stable workload id. graph-otel resolves container → pod → workload via the pod's MANAGED_BY edge at flush time.

Redis schema (prefix configurable, default graph)

redis keys
<prefix>:entity:<id>            HASH  id, kind, name, last_seen_at_ms
<prefix>:entity:<id>:metadata   HASH  arbitrary string key/values
<prefix>:entity:<id>:edges      SET   JSON-encoded Edge objects
<prefix>:by_kind:<kind>         SET   entity IDs of the given kind
<prefix>:ids                    SET   all entity IDs

Full source and chart values on GitHub.