Alerting

Basic alerting

Rules for offline agents, CPU/disk/mem thresholds, uptime and update-age with alert lifecycle status.

What this feature delivers

Alerting is designed to surface problems early without overwhelming operators. Instead of a noisy stream, you define signals that match your environment: availability, resource pressure, uptime drift and patch-age indicators. Alerts have a clear lifecycle (open, acknowledged, resolved) so teams can triage quickly and keep accountability visible.

Rule-based alerts

Alert rules let you translate platform telemetry into operational signals. Typical examples include offline agents, CPU/memory/disk/network thresholds, uptime drift and patch-age style indicators. The goal is simple: detect deviation early enough to respond before it becomes an incident.

Rules work best when they are aligned with how your estate behaves. Instead of setting extreme “red line” values, you can start with sensible thresholds, observe baselines, and then tighten over time.

Lifecycle & triage

Alerts use clear lifecycle states so teams can keep ownership visible: open when detected, acknowledged when someone takes it, and resolved when the issue is fixed. This structure reduces “alert limbo” and makes handovers easier.

Operational value

When lifecycle is consistent, you can measure response, track recurring issues, and identify where preventive remediation (patching, capacity, hardening) will reduce future noise.

Keep noise low and focus on actionable alerts.

Noise control

High-volume alerts are usually a symptom of missing guardrails. Practical ways to keep signal high are simple: choose thresholds that match real workloads, add short evaluation windows (avoid spikes), and suppress repeat alerts while an issue is already acknowledged.

Another effective approach is severity levels: informational signals (heads-up), warning (degraded) and critical (action now). That helps operators triage faster and prevents “everything is urgent” fatigue.

Routing & notifications

Alerting becomes much more useful when it reaches the right person at the right time. Many teams start with in-app triage and then expand to delivery channels such as email or chat (Teams/Slack/Telegram) for critical signals or after-hours escalation.

A solid next step is routing by scope: per tenant, by group/tag, or by system role (for example: database, web, domain controllers). That keeps ownership clear and avoids “who is handling this?” loops.

Context & correlation

Alerts are most actionable when they include context. Correlating an alert with inventory, recent changes, patch state, pending reboot, and current health metrics dramatically reduces time-to-diagnose.

What you can add next

If you want to extend alerting beyond “signal” into “guided response”, these additions typically deliver strong ROI:

  • Runbooks: attach suggested actions and checks per alert type.
  • Auto-remediation: optionally trigger a safe task after approvals/guards.
  • Maintenance windows: pause or downgrade non-critical alerts during patching.
  • Deduplication: group identical alerts across many hosts into one incident view.
  • SLO reporting: measure MTTA/MTTR per tenant/team and improve over time.