Monitoring Starburst Enterprise(Trino) Clusters with REST APIs
Practical ways to track health, state, and performance through coordinator endpoints
Modern data analytics rely on distributed SQL engines like Starburst (Trino) to process large-scale data with speed and flexibility. While these engines excel at query execution, ensuring cluster health and observability is critical to maintaining performance and reliability. The coordinator node exposes REST APIs that provide real-time insights into cluster state, workload, and system health.
This guide is for Data Engineers, DevOps, and Site Reliability Engineers who want to maximize uptime and optimize Starburst clusters using native APIs combined with monitoring tools and automation.
Starburst (Trino) Architecture: Coordinator and Worker Roles
Coordinator: Acts as the control plane — receiving queries, planning execution, scheduling tasks, and aggregating results.
Workers: Perform the actual data processing — scanning, filtering, joining, and aggregating in parallel.
Monitoring focuses primarily on the coordinator’s APIs, which expose cluster-wide state and workload metrics.
Coordinator State Endpoint: /v1/info/state
Request:
curl http://coordinator_host:8080/v1/info/state
Response Example:
"ACTIVE"
Possible States and Operational Meaning:
ACTIVE
Node fully initialized and serving queries
Safe to route production queries
STARTING
Node initialization in progress
Hold traffic until ready
SHUTTING_DOWN
Node gracefully stopping
No new queries; prepare for maintenance
FAILED / INACTIVE
Node encountered errors or is down
Requires operator action or automated restart
Best Practice: Use
/v1/info/state
as the readiness probe in Kubernetes. Unexpected transitions should trigger alerts.
Cluster Metrics Endpoint: /v1/ui/api/stats
Request:
curl http://coordinator_host:8080/ui/api/stats
Sample Response:
{
"runningQueries": 3,
"blockedQueries": 1,
"queuedQueries": 2,
"activeWorkers": 5,
"runningDrivers": 15,
"totalAvailableProcessors": 20,
"reservedMemory": 5120000000,
"totalInputRows": 1500000,
"totalInputBytes": 6000000000,
"totalCpuTimeSecs": 4500
}
Metric Deep Dive:
runningQueries / blockedQueries / queuedQueries → Query execution pipeline.
activeWorkers → Number of connected, healthy workers.
runningDrivers → Execution threads currently active.
totalAvailableProcessors → Total CPU capacity across cluster.
reservedMemory → Memory allocated to queries (bytes).
totalInputRows / totalInputBytes → Cluster throughput since startup.
totalCpuTimeSecs → Total CPU time consumed by queries.
Operational Tips:
Alert when
queuedQueries
orblockedQueries
spike.Track
activeWorkers
to detect node failures.Use
reservedMemory
to monitor memory pressure.Combine
runningDrivers
with CPU metrics to spot bottlenecks.
Readiness and Liveness Probes: /v1/info
& /v1/status
Sample /v1/info
Response:
{
"nodeVersion": { "version": "365" },
"environment": "prod",
"coordinator": true,
"starting": false,
"uptime": "3h45m"
}
Sample /v1/status
Response:
{
"status": "OK",
"uptime": "3h45m"
}
Usage Recommendations:
/v1/info
→ Best for readiness checks; ensurecoordinator=true
and notstarting
./v1/status
→ Ideal for liveness checks; restart pods if unresponsive.
Security Considerations
By default, Trino/Starburst APIs may be open inside the network.
Best practices:
Restrict API access to trusted IPs or VPNs.
Enable TLS at ingress/load balancer.
Monitor API logs for unusual access.
Integrations and Tooling
Prometheus Exporters: Scrape
/v1/ui/api/stats
for long-term monitoring.Grafana Dashboards: Visualize queries, worker counts, CPU load, and memory usage.
Kubernetes Probes: Automate restart and readiness checks with
/v1/info/state
and/v1/status
.Automation Scripts: Scale clusters or trigger alerts based on metrics.
Incident Response: Use APIs in playbooks to assess cluster quickly during outages.
Example: Kubernetes Probes
livenessProbe:
httpGet:
path: /v1/status
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet:
path: /v1/info/state
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
This ensures pods only serve traffic when ready, and restart automatically when unhealthy.
Monitoring Best Practices
Poll
/v1/ui/api/stats
continuously for real-time metrics.Watch
activeWorkers
as a simple cluster health indicator.Track query distribution (
running vs queued vs blocked
).Use
uptime
andversion
to track rolling upgrades.
Conclusion
Starburst (Trino) coordinator REST APIs provide operators with a lightweight but powerful way to monitor clusters. By integrating endpoints like /v1/info/state
, /v1/ui/api/stats
, /v1/info
, and /v1/status
into dashboards, probes, and automation pipelines, teams can:
Detect anomalies early,
Optimize resources,
Automate scaling and failover,
Maintain service reliability.
In short, these APIs transform Starburst clusters from black boxes into transparent, observable systems — a foundation for running modern, production-grade analytics at scale.