Monitoring & Observability
Aria's health and performance are monitored through multiple layers: Hospital Service (health checks), Aria Telemetry (metrics), Aegis Watchtower (hallucination detection), and Cloud Monitoring (infrastructure).
Health Monitoring (Hospital Service)
The hospital-service provides service health monitoring and self-healing capabilities. It tracks heartbeats across all services and can trigger automatic recovery procedures.
| Capability | Description |
|---|---|
| Heartbeat Tracking | 30,711 shard heartbeats tracked |
| Soul Charge Monitoring | 6,315 charge history entries with delta tracking |
| Self-Healing | Automatic service restart on failure detection |
| Circuit Breakers | Prevent cascading failures (10min cooldown) |
Telemetry
aria-telemetry and the Telemetry Emitter collect and emit metrics from all services:
- LLM call metrics: Latency, tokens, cost per model route
- Cognitive metrics: Coherence scores, soul charge levels, Mizan readings
- Service metrics: Error rates, response times, throughput
- Memory metrics: Embedding generation latency, Qdrant query times
- Guard metrics: Policy violations, blocking events, override counts
Aegis Watchtower
The aegis-watchtower provides hallucination detection by grounding claims against the codebase and manifold:
- Detects fabricated references to nonexistent symbols/files
- Validates behavior claims against the hologram
- Identifies self-inconsistent edits
- Non-LLM: uses embedding grounding + symbol resolution
Cloud Monitoring (GCP)
- Cloud Monitoring — Infrastructure metrics (CPU, memory, request latency)
- Application Insights — Application-level performance
- Custom Dashboards — Business metrics (deal pipeline, venture revenue)
- Real-time Alerts — Critical issue notification
Common Health Checks
# Aria Soul health
curl http://aria-soul.aria.svc.cluster.local:3000/health
# Hive status
curl http://aria-soul.aria.svc.cluster.local:3000/hive/status
# Manual heartbeat check
bash scripts/check_heartbeats.mjs
# Full system doctor
bash scripts/doctor.mjs Self-Healing
When issues are detected, automated recovery procedures execute:
bash scripts/self_healing.sh # General self-healing
bash scripts/self-heal-imagepull.sh # Image pull failures