Monitoring Starburst Enterprise(Trino) Clusters with REST APIs

Practical ways to track health, state, and performance through coordinator endpoints

Sep 03, 2025

Modern data analytics rely on distributed SQL engines like Starburst (Trino) to process large-scale data with speed and flexibility. While these engines excel at query execution, ensuring cluster health and observability is critical to maintaining performance and reliability. The coordinator node exposes REST APIs that provide real-time insights into cluster state, workload, and system health.

This guide is for Data Engineers, DevOps, and Site Reliability Engineers who want to maximize uptime and optimize Starburst clusters using native APIs combined with monitoring tools and automation.

Starburst (Trino) Architecture: Coordinator and Worker Roles

Coordinator: Acts as the control plane — receiving queries, planning execution, scheduling tasks, and aggregating results.
Workers: Perform the actual data processing — scanning, filtering, joining, and aggregating in parallel.

Monitoring focuses primarily on the coordinator’s APIs, which expose cluster-wide state and workload metrics.

Coordinator State Endpoint: /v1/info/state

Request:

curl http://coordinator_host:8080/v1/info/state

Response Example:

"ACTIVE"

Possible States and Operational Meaning:

ACTIVE

Node fully initialized and serving queries
Safe to route production queries

STARTING

Node initialization in progress
Hold traffic until ready

SHUTTING_DOWN

Node gracefully stopping
No new queries; prepare for maintenance

FAILED / INACTIVE

Node encountered errors or is down
Requires operator action or automated restart

Best Practice: Use /v1/info/state as the readiness probe in Kubernetes. Unexpected transitions should trigger alerts.

Cluster Metrics Endpoint: /v1/ui/api/stats

Request:

curl http://coordinator_host:8080/ui/api/stats

Sample Response:

{
  "runningQueries": 3,
  "blockedQueries": 1,
  "queuedQueries": 2,
  "activeWorkers": 5,
  "runningDrivers": 15,
  "totalAvailableProcessors": 20,
  "reservedMemory": 5120000000,
  "totalInputRows": 1500000,
  "totalInputBytes": 6000000000,
  "totalCpuTimeSecs": 4500
}

Metric Deep Dive:

runningQueries / blockedQueries / queuedQueries → Query execution pipeline.
activeWorkers → Number of connected, healthy workers.
runningDrivers → Execution threads currently active.
totalAvailableProcessors → Total CPU capacity across cluster.
reservedMemory → Memory allocated to queries (bytes).
totalInputRows / totalInputBytes → Cluster throughput since startup.
totalCpuTimeSecs → Total CPU time consumed by queries.

Operational Tips:

Alert when queuedQueries or blockedQueries spike.
Track activeWorkers to detect node failures.
Use reservedMemory to monitor memory pressure.
Combine runningDrivers with CPU metrics to spot bottlenecks.

Readiness and Liveness Probes: /v1/info & /v1/status

Sample /v1/info Response:

{
  "nodeVersion": { "version": "365" },
  "environment": "prod",
  "coordinator": true,
  "starting": false,
  "uptime": "3h45m"
}

Sample /v1/status Response:

{
  "status": "OK",
  "uptime": "3h45m"
}

Usage Recommendations:

/v1/info → Best for readiness checks; ensure coordinator=true and not starting.
/v1/status → Ideal for liveness checks; restart pods if unresponsive.

Security Considerations

By default, Trino/Starburst APIs may be open inside the network.

Best practices:

Restrict API access to trusted IPs or VPNs.
Enable TLS at ingress/load balancer.
Monitor API logs for unusual access.

Integrations and Tooling

Prometheus Exporters: Scrape /v1/ui/api/stats for long-term monitoring.
Grafana Dashboards: Visualize queries, worker counts, CPU load, and memory usage.
Kubernetes Probes: Automate restart and readiness checks with /v1/info/state and /v1/status.
Automation Scripts: Scale clusters or trigger alerts based on metrics.
Incident Response: Use APIs in playbooks to assess cluster quickly during outages.

Example: Kubernetes Probes

livenessProbe:
  httpGet:
    path: /v1/status
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 15
readinessProbe:
  httpGet:
    path: /v1/info/state
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

This ensures pods only serve traffic when ready, and restart automatically when unhealthy.

Monitoring Best Practices

Poll /v1/ui/api/stats continuously for real-time metrics.
Watch activeWorkers as a simple cluster health indicator.
Track query distribution (running vs queued vs blocked).
Use uptime and version to track rolling upgrades.

Conclusion

Starburst (Trino) coordinator REST APIs provide operators with a lightweight but powerful way to monitor clusters. By integrating endpoints like /v1/info/state, /v1/ui/api/stats, /v1/info, and /v1/status into dashboards, probes, and automation pipelines, teams can:

Detect anomalies early,
Optimize resources,
Automate scaling and failover,
Maintain service reliability.

In short, these APIs transform Starburst clusters from black boxes into transparent, observable systems — a foundation for running modern, production-grade analytics at scale.

Monitoring Starburst Enterprise(Trino) Clusters with REST APIs

Practical ways to track health, state, and performance through coordinator endpoints

Discussion about this post