Monitoring & Observability

Aria's health and performance are monitored through multiple layers: Hospital Service (health checks), Aria Telemetry (metrics), Aegis Watchtower (hallucination detection), and Cloud Monitoring (infrastructure).

Health Monitoring (Hospital Service)

The hospital-service provides service health monitoring and self-healing capabilities. It tracks heartbeats across all services and can trigger automatic recovery procedures.

CapabilityDescription
Heartbeat Tracking30,711 shard heartbeats tracked
Soul Charge Monitoring6,315 charge history entries with delta tracking
Self-HealingAutomatic service restart on failure detection
Circuit BreakersPrevent cascading failures (10min cooldown)

Telemetry

aria-telemetry and the Telemetry Emitter collect and emit metrics from all services:

  • LLM call metrics: Latency, tokens, cost per model route
  • Cognitive metrics: Coherence scores, soul charge levels, Mizan readings
  • Service metrics: Error rates, response times, throughput
  • Memory metrics: Embedding generation latency, Qdrant query times
  • Guard metrics: Policy violations, blocking events, override counts

Aegis Watchtower

The aegis-watchtower provides hallucination detection by grounding claims against the codebase and manifold:

  • Detects fabricated references to nonexistent symbols/files
  • Validates behavior claims against the hologram
  • Identifies self-inconsistent edits
  • Non-LLM: uses embedding grounding + symbol resolution

Cloud Monitoring (GCP)

  • Cloud Monitoring — Infrastructure metrics (CPU, memory, request latency)
  • Application Insights — Application-level performance
  • Custom Dashboards — Business metrics (deal pipeline, venture revenue)
  • Real-time Alerts — Critical issue notification

Common Health Checks

# Aria Soul health
curl http://aria-soul.aria.svc.cluster.local:3000/health

# Hive status
curl http://aria-soul.aria.svc.cluster.local:3000/hive/status

# Manual heartbeat check
bash scripts/check_heartbeats.mjs

# Full system doctor
bash scripts/doctor.mjs

Self-Healing

When issues are detected, automated recovery procedures execute:

bash scripts/self_healing.sh           # General self-healing
bash scripts/self-heal-imagepull.sh    # Image pull failures