Azure Monitor is Microsoft’s unified platform for collecting, analyzing, and acting on telemetry from Azure resources, applications, and connected environments. When people say they “use Azure Monitor,” they often mean a set of services working together: metrics and platform logs from Azure resources, Log Analytics workspaces for queryable logs, alerts and action groups for notifications and automation, and application-level observability through Application Insights. The most useful way to approach Azure Monitor as an IT administrator or system engineer is to treat it as an operating model rather than a single toggle—one that aligns telemetry collection, retention, alerting, and access controls across teams.
This guide focuses on leveraging Azure Monitor for comprehensive application insights in the practical sense: understanding what is happening inside your application (requests, dependencies, exceptions, performance), correlating that with infrastructure signals (CPU, memory, networking, scaling events), and turning those signals into reliable detection and response workflows. Along the way, you’ll see how to design a workspace strategy, configure collection for common compute targets, write actionable KQL (Kusto Query Language) queries, and implement alerting patterns that reduce noise.
Understanding the Azure Monitor data model: metrics, logs, and traces
Azure Monitor revolves around two primary telemetry types: metrics and logs. Metrics are numeric time-series values collected at regular intervals (for example, CPU percentage or HTTP server errors) and are optimized for near real-time charting and alerting. Logs are records (events) that can carry rich context and are queried using KQL in Log Analytics.
Application Insights extends this model with application performance monitoring (APM) signals: request telemetry (incoming operations), dependency telemetry (outgoing calls like SQL/HTTP), exceptions, custom events, and distributed traces. In modern deployments, the recommended approach is increasingly based on OpenTelemetry (OTel), but Application Insights SDKs and auto-instrumentation are still widely used. The key operational point is that Application Insights data is queryable as logs (in a Log Analytics workspace when using workspace-based Application Insights), enabling correlation across application and infrastructure.
It helps to define “comprehensive” observability in concrete terms. For most production applications, you want to answer four classes of questions quickly: Are users being impacted (availability/latency/error rate)? What changed (deployments, configuration, scale, dependency health)? Where is the bottleneck (application code, database, network, downstream service)? And what is the blast radius (which regions, instances, or customer segments)? Azure Monitor can address these if you intentionally collect and correlate signals.
Choosing a workspace strategy and why it matters
Most operational pain with Azure Monitor comes from inconsistent design: multiple teams create workspaces ad hoc, retention varies by environment, and alert rules are duplicated and noisy. Before turning on instrumentation, decide how you want to segment telemetry.
Log Analytics workspaces are the container for logs and the billing boundary for many log-based features. A common approach is one workspace per environment (prod/non-prod) per region, or one workspace per landing zone. Another approach is one workspace per platform team with role-based access control (RBAC) boundaries implemented via Azure RBAC on the workspace and table-level controls where applicable. The right answer depends on your regulatory requirements, separation of duties, and whether you need to isolate billing.
Workspace-based Application Insights is the modern default because it stores Application Insights telemetry in the workspace’s tables. That unlocks cross-resource queries, unified retention, and a simpler operational model (alerts and workbooks can query one place). It also reduces the “two consoles” problem where engineers bounce between Application Insights and Log Analytics without a consistent correlation strategy.
When you plan workspace layout, think about three practical dimensions:
First, retention and cost. High-volume tables (for example, container logs or verbose application traces) can dominate ingestion. Decide what data you truly need in hot storage for interactive queries versus what can be archived, sampled, or reduced.
Second, access boundaries. Operators need read access; engineers may need deeper access for investigations; security teams often need broad query rights. Align workspaces with who needs access and how you audit it.
Third, query and alerting simplicity. If a service spans multiple subscriptions, a centralized workspace can simplify “single pane of glass” queries, but you may still need multi-workspace queries using Azure Monitor Logs query scope.
Designing your telemetry: what to collect and how to keep it useful
Comprehensive monitoring does not mean “collect everything.” It means collecting the right signals with enough context to be actionable. A practical design starts with user-impact signals (availability, latency, errors) and expands outward to dependencies, then infrastructure.
At the application layer, prioritize request telemetry (duration, response code, operation name), dependency calls (SQL, HTTP, storage), and exceptions with stack traces. Add custom dimensions for tenant, region, deployment ring, or feature flag state if those are key to triage, but be deliberate—high-cardinality fields can increase storage and reduce query performance.
At the infrastructure layer, ensure you have baseline metrics (CPU, memory, disk, network) and platform logs for key resources. For example, App Service has HTTP logs and platform metrics; AKS has control-plane logs (via Azure resource logs) and node/container insights; Azure SQL has performance and audit signals; Key Vault has access logs useful for both operations and security.
A helpful mental model is “golden signals” (latency, traffic, errors, saturation) combined with change events (deployments, scaling, configuration updates). Azure Monitor supports this: metrics and logs provide the golden signals; Activity Log and resource logs provide change and control-plane context.
Implementing Application Insights with workspace-based ingestion
Application Insights can be enabled through the Azure portal, ARM/Bicep, Terraform, or through application instrumentation and configuration. In most modern setups you will create an Application Insights resource connected to a Log Analytics workspace.
In a workspace-based configuration, Application Insights telemetry lands in workspace tables such as requests, dependencies, exceptions, and traces. This is crucial for correlation because you can join application telemetry with infrastructure logs in the same query.
Instrumentation can be done in several ways:
If you control application code, you can use an Application Insights SDK or OpenTelemetry SDK/exporter. If you prefer minimal code changes for certain stacks, you can use auto-instrumentation options where supported (for example, some App Service/Azure Functions scenarios) but you still need to validate what telemetry is emitted and whether it includes the dimensions you need.
Regardless of approach, aim for consistent cloud role naming. In Application Insights, cloud_RoleName and cloud_RoleInstance are important for grouping and slicing data. In microservices, set cloud_RoleName per service (not per environment) and use separate dimension(s) for environment and region.
Example scenario: stabilizing a microservice after a deployment
Consider a payments API running on Azure Kubernetes Service (AKS). After a deployment, latency increases and timeouts appear sporadically. Without application tracing, operators might only see that CPU is normal and pods are healthy. With Application Insights (or OpenTelemetry exporting to it), you can identify that the POST /authorize operation has increased dependency time to an external fraud service and that timeouts correlate with a new retry policy.
The operational value here is not simply “more logs.” It’s the ability to pivot from a user-facing symptom (latency) to a specific dependency and then to a change event (deployment) with timestamps that line up.
Collecting platform logs and resource logs with diagnostic settings
Azure resources emit two broad categories of logs: subscription-level Activity Log entries and resource-level logs (sometimes called resource logs or diagnostic logs). The Activity Log records control-plane events such as write operations, service health incidents, and administrative actions. Resource logs record data-plane or resource-specific events, such as Key Vault access logs or Application Gateway access logs.
Diagnostic settings are how you route resource logs and metrics to destinations like a Log Analytics workspace, Event Hubs, or storage accounts. For operations teams, routing to Log Analytics is usually the default because it enables alerting and correlation. You can also route to storage for long-term retention or compliance needs.
As you enable diagnostic settings, be explicit about which log categories are needed. Some categories are high volume. Enabling everything across an estate without a plan can create cost and query noise.
The following Azure CLI example shows the pattern of creating a diagnostic setting that sends logs to a workspace. The exact categories vary by resource type, so treat this as a template rather than a copy-paste for all resources.
# Variables
RESOURCE_ID="/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault>"
WORKSPACE_ID="/subscriptions/<sub>/resourcegroups/<rg>/providers/microsoft.operationalinsights/workspaces/<law>"
# Create diagnostic setting (example categories for Key Vault)
az monitor diagnostic-settings create \
--name "send-to-law" \
--resource "$RESOURCE_ID" \
--workspace "$WORKSPACE_ID" \
--logs '[
{"category":"AuditEvent","enabled":true}
]' \
--metrics '[
{"category":"AllMetrics","enabled":true}
]'
Once resource logs land in the workspace, they appear in tables named per resource provider (for example, AzureDiagnostics for many legacy logs, and increasingly resource-specific tables depending on the log type). When possible, prefer the newer dedicated tables because they are easier to query and often have a more stable schema.
Using Data Collection Rules (DCRs) for consistent ingestion
Data Collection Rules (DCRs) define what data is collected and where it is sent for certain sources, particularly for Azure Monitor Agent (AMA) on virtual machines and some PaaS scenarios. If you manage many VMs (Azure or Arc-enabled), DCRs help standardize Windows Event Logs, performance counters, and syslog collection.
DCRs matter for two reasons. First, they prevent drift: you can apply the same collection policy to a set of machines via Azure Policy or automation. Second, they help control ingestion volume by collecting only what you need.
For Windows servers, a typical operational baseline includes security and system event logs (as required by your organization), plus performance counters for CPU, memory, disk, and networking. For Linux, syslog and performance metrics are common.
If your goal is application insights in the broader sense, VM-level logs are often still relevant because many enterprise applications run on IaaS, and issues often manifest in OS signals (thread exhaustion, disk latency, DNS failures) that are not visible in request telemetry alone.
Correlating signals with KQL: practical query patterns
KQL is the core skill that turns Azure Monitor from a dashboard into an investigation tool. The key is to write queries that are stable, readable, and optimized for common questions.
A good operational query usually starts with a time window, filters to a service or resource, and then summarizes a metric. For example, request failure rate by operation name over the last hour:
kusto
requests
| where timestamp > ago(1h)
| where cloud_RoleName == "payments-api"
| summarize total=count(), failures=countif(success == false) by name
| extend failureRate = todouble(failures) / todouble(total)
| order by failureRate desc
From there, you can pivot into dependencies to see whether failures correlate with outbound calls:
kusto
dependencies
| where timestamp > ago(1h)
| where cloud_RoleName == "payments-api"
| summarize avgDuration=avg(duration), p95=pct(duration, 95), count() by target, name, resultCode
| order by p95 desc
The real power comes from correlation identifiers. Application Insights emits operation_Id and related fields that tie together requests, dependencies, traces, and exceptions for a single end-to-end transaction. When you see an elevated error rate, you can sample failing operations and follow the chain.
Moving from aggregated alerts to “why” queries
Alerts often trigger on aggregates (for example, 5xx rate above a threshold). Your investigation query should rapidly answer what changed. One useful pattern is to compare two time windows: “current” versus “baseline.”
kusto
let baseline = requests
| where timestamp between (ago(2h) .. ago(1h))
| where cloud_RoleName == "payments-api"
| summarize baselineP95 = pct(duration, 95) by name;
let current = requests
| where timestamp > ago(1h)
| where cloud_RoleName == "payments-api"
| summarize currentP95 = pct(duration, 95) by name;
current
| join kind=fullouter baseline on name
| extend baselineP95 = coalesce(baselineP95, 0.0), currentP95 = coalesce(currentP95, 0.0)
| extend delta = currentP95 - baselineP95
| order by delta desc
This isn’t about statistical perfection; it’s about giving on-call engineers a fast way to see which operations regressed.
Metrics vs logs for alerting: building a reliable strategy
Azure Monitor supports metric alerts (near real-time, low overhead) and log alerts (query-based, flexible). The common mistake is to use log alerts for everything. Log alerts are powerful but can be slower, more expensive at scale, and easier to misconfigure if queries are not stable.
Use metric alerts for conditions that are directly represented as metrics and need quick detection: CPU saturation, HTTP 5xx rates on supported resources, queue depth, or availability tests. Use log alerts when you need richer logic: “error rate for a specific endpoint exceeds baseline,” “specific exception type appears,” or “deployment succeeded and error rate increased within 10 minutes.”
When you build alerts for application insights, decide whether you are alerting on symptoms (user impact) or causes (resource pressure). Symptoms should page someone; causes should often create a ticket or notify a channel. This separation reduces alert fatigue.
Action groups and alert routing as an operational contract
Action groups define what happens when an alert fires: email, SMS, push, voice, webhook, ITSM connector, Logic Apps, or Azure Functions. Treat action groups as reusable primitives. For example, you might have action groups per team (Payments On-Call, Platform On-Call) and per severity.
A reliable model is: severity 0/1 pages the on-call rotation; severity 2 posts to a team channel and creates a ticket; severity 3 is informational. Azure Monitor’s alert rule severity and action group selection should reflect that model.
Here is an Azure CLI example that creates an action group with an email receiver and a webhook receiver, which is common when integrating with incident systems.
bash
az monitor action-group create \
--resource-group <rg> \
--name "ag-payments-s1" \
--short-name "payS1" \
--action email oncall-email oncall@contoso.com \
--action webhook incident-webhook https://example.invalid/alerts
In production, consider using managed identities and authenticated endpoints for webhooks, or route through a secure intermediary such as Logic Apps with appropriate access controls.
Building application-level alert rules that avoid noise
An alert that triggers frequently without clear action will be ignored. For application insights, noise often comes from two sources: transient dependency failures (for example, timeouts that are retried) and broad thresholds that don’t respect diurnal traffic patterns.
Start with user-impact indicators: availability and error rate. For HTTP services, a 5xx rate alert is often a better pager than CPU. Pair that with latency (p95 or p99) for key operations.
If you operate a multi-tenant or multi-region service, make your alerts dimension-aware. A global error rate might hide a regional outage; conversely, per-instance alerts might be too granular. A practical middle ground is per-region per-service alerting, with a roll-up notification if multiple regions are affected.
A log alert can implement this. For example, alert if the 5xx rate in the last 10 minutes exceeds 2% and there are at least N requests (to avoid false positives in low-traffic periods):
kusto
let window = 10m;
let minRequests = 200;
requests
| where timestamp > ago(window)
| where cloud_RoleName == "payments-api"
| summarize total=count(), failures=countif(toint(resultCode) between (500 .. 599)) by tostring(client_CountryOrRegion)
| where total >= minRequests
| extend failureRate = todouble(failures) / todouble(total)
| where failureRate > 0.02
This example uses country/region as a dimension, but you can use client_IP, cloud_RoleInstance, or a custom dimension representing Azure region. The key point is to choose dimensions that map to operational ownership and remediation.
Example scenario: reducing alert fatigue for a legacy IIS application
Imagine a legacy .NET Framework application hosted on Windows VMs behind an Azure Load Balancer. Initially, the team uses log alerts on every exception, resulting in hundreds of alerts per day due to handled exceptions and expected client disconnects. By shifting to symptom-based alerts (5xx rate, p95 latency) and adding a daily report query for top exception types, the on-call load drops significantly while still preserving forensic data.
This is a recurring pattern: keep raw exception logs for investigations, but page on conditions that correlate strongly with user impact.
Instrumentation and distributed tracing across services
Modern applications are rarely a single process. A request can traverse an API gateway, multiple microservices, a message queue, and a database. Distributed tracing is the mechanism that ties these spans together using trace context (typically W3C Trace Context: traceparent and tracestate).
Application Insights supports distributed tracing concepts and can visualize end-to-end transactions. If you instrument services consistently (or use OpenTelemetry), you can follow a request from ingress to dependency calls and see where time is spent.
For system engineers, the practical requirement is consistency: ensure trace context is propagated across HTTP boundaries and messaging boundaries. For messaging, additional work is often needed because you must add and extract trace context from message headers.
If you operate hybrid services, Azure Monitor can still be the central store when using Azure Arc or when exporting telemetry from non-Azure environments into the same workspace. That helps keep cross-environment investigations consistent.
Monitoring common Azure application hosting targets
How you collect telemetry depends on where the application runs. The same observability goals apply, but the mechanics differ.
Azure App Service and Azure Functions
App Service provides platform metrics (CPU, memory, requests, HTTP 5xx) and can integrate with Application Insights for application telemetry. For many .NET and Java workloads, enabling Application Insights is straightforward, but you should still validate sampling settings, role names, and whether dependency tracking is enabled.
For Functions, Application Insights is central to visibility. Cold starts, execution duration, and trigger failures are typical signals. Functions often use queues or event streams, so dependency telemetry and custom logs are valuable.
As you transition from basic monitoring to comprehensive application insights, ensure you also collect App Service diagnostics logs when needed, but do it selectively. For example, you might enable HTTP logs temporarily during an incident window and rely on request telemetry for steady-state.
AKS and Container Insights
AKS introduces a layered telemetry landscape: cluster control plane logs, node-level metrics, pod/container logs, and application-level telemetry. Azure Monitor’s Container Insights can collect performance and inventory data. Application Insights (or OpenTelemetry) still needs to be configured at the application level for traces and dependencies.
The practical challenge in AKS is cardinality. Pod names change, containers scale, and naive alerting on per-pod logs becomes noisy. Anchor your monitoring on service-level identifiers (Kubernetes labels like app, component, namespace) and aggregate where possible.
If you run multiple clusters, decide whether each cluster reports to its own workspace or to a centralized workspace. Centralizing simplifies cross-cluster queries, but ensure RBAC and cost allocation are addressed.
Virtual machines and Arc-enabled servers
For VM-hosted applications, combine OS-level signals via Azure Monitor Agent (performance counters, event logs, syslog) with application-level instrumentation via Application Insights SDK or OpenTelemetry.
This is where DCRs are especially valuable: they let you standardize what you collect from fleets. Avoid collecting every Windows event channel or all syslog facilities by default; start with what you need for operational questions.
Example scenario: diagnosing intermittent DNS failures on Linux VMs
A customer-facing API runs on Linux VMs in a scale set. Users report intermittent 502s. Application telemetry shows dependency failures to a third-party endpoint with “name resolution failed.” By correlating application dependencies telemetry with syslog entries from systemd-resolved (collected via AMA and DCR), the team identifies that DNS timeouts spike during a scheduled network appliance maintenance window. The fix is to adjust DNS timeout/retry settings and schedule maintenance differently, but the key is that Azure Monitor provided both the symptom (dependency failures) and the infrastructure evidence (resolver timeouts) in one queryable place.
Visualizing telemetry with Workbooks and dashboards
Once you have reliable collection and alerting, visualization becomes the layer that accelerates human understanding. Azure Monitor Workbooks are particularly useful because they support parameterized queries, multiple visualizations, and interactive drill-down.
A workbook can serve as an operational runbook: start with top-level indicators (availability, latency, error rate), then provide links or tabs for dependencies, exceptions, and infrastructure metrics. If you align workbook parameters with your resource naming conventions (service name, environment, region), you can make a single workbook reusable across many services.
Workbooks are also a way to institutionalize investigation patterns. If your on-call engineers repeatedly run the same KQL queries during incidents, turn them into workbook sections with clear labels and defaults. This reduces time-to-diagnosis and reduces reliance on tribal knowledge.
When building workbooks for comprehensive application insights, keep a clear hierarchy:
Start with user impact signals and SLO-style views (even if you do not formally implement SLOs). Then show top regressions (operations with biggest latency deltas). Then show dependencies and their p95/p99. Finally, show infrastructure saturation metrics.
This ordering mirrors how incidents unfold: you detect user impact, narrow to failing operations, identify the bottleneck, then confirm whether infrastructure constraints contributed.
Integrating Service Health, Resource Health, and change tracking
Application incidents are not always caused by your code. Azure outages, platform maintenance, and configuration drift can all produce symptoms that look like application problems.
Azure Service Health provides information about Azure service incidents and planned maintenance that may affect your subscriptions. Resource Health provides resource-specific health states, which is useful for understanding VM availability issues or platform-level degradation.
In parallel, the Activity Log is critical for change correlation. If latency spiked at 10:07, you want to know whether a scale operation, deployment, key rotation, or NSG rule update happened at 10:05. This is where collecting and querying Activity Log data (and ensuring you have appropriate permissions) becomes an operational necessity.
A simple Activity Log query pattern looks like this:
kusto
AzureActivity
| where TimeGenerated > ago(4h)
| where ResourceGroup == "rg-payments-prod"
| project TimeGenerated, OperationNameValue, ActivityStatusValue, Caller, ResourceId, Properties
| order by TimeGenerated desc
If you use deployment pipelines (Azure DevOps, GitHub Actions), also consider emitting deployment markers to Application Insights as custom events. That gives you a direct “release happened here” annotation that you can overlay on charts.
Managing sampling, retention, and cost without losing operational value
Comprehensive telemetry can be expensive if unmanaged. The goal is to reduce cost while preserving investigative power.
Sampling reduces the volume of telemetry, typically by retaining a representative subset of requests/traces. Adaptive sampling can automatically adjust based on traffic volume. The operational risk is that rare events can be sampled out. For high-volume endpoints, sampling is usually acceptable; for critical failure signals, you may want to ensure exceptions are retained or implement rules that keep all failed requests.
Retention controls how long data stays in interactive query storage. Many teams keep 30–90 days hot for operations and longer in archive or storage for compliance. Decide retention per table where possible, because not all data has equal value. For example, high-level request telemetry may be valuable longer than verbose trace logs.
Another cost lever is the granularity of logs. If you log full request/response bodies, you can create both cost and security exposure. Keep logs focused on identifiers, durations, codes, and correlation IDs. Store sensitive payloads in secure systems, not in general-purpose logs.
Securing access and protecting sensitive telemetry
Telemetry often contains sensitive information: user identifiers, IP addresses, sometimes tokens if logging is careless. Treat Azure Monitor as part of your security posture.
Start with RBAC. Restrict who can read from Log Analytics workspaces and Application Insights resources. Use least privilege: many operators need read-only access, while only a small group needs write access to alert rules and diagnostic settings.
Be deliberate about data minimization. Avoid logging secrets or full payloads. When adding custom dimensions, ensure they do not include PII (personally identifiable information) unless you have a clear policy, legal basis, and access controls.
If you need to integrate with SIEM or long-term archival, exporting logs to Event Hubs or storage can be appropriate, but ensure the downstream systems have equivalent protections. Managed identities and private endpoints can reduce exposure when integrating within Azure.
Automating Azure Monitor configuration with Infrastructure as Code
Manual configuration does not scale. As your environment grows, you want repeatable templates for:
Application Insights creation (workspace-based), diagnostic settings for key resources, action groups, alert rules, and standard workbooks.
Whether you use Bicep, Terraform, or ARM templates, the operational principle is the same: treat monitoring configuration as part of the platform baseline. This prevents the common failure mode where new resources are deployed without logs or alerts.
Azure Policy can enforce or deploy diagnostic settings at scale. For example, you can assign policies that require Key Vault audit logs to be sent to a specified workspace. This is especially useful in regulated environments where you must prove collection.
Here is a compact Azure CLI example that shows the pattern of deploying an alert rule. Alert rule creation is detailed and varies by signal type, so in practice you’ll typically use IaC modules to standardize it.
bash
# Create a metric alert for App Service 5xx errors (template pattern; adjust resource ID and metric)
az monitor metrics alert create \
--name "appservice-5xx-high" \
--resource-group <rg> \
--scopes /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Web/sites/<app> \
--description "High server errors" \
--condition "total Http5xx > 50" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 1 \
--action-group <actionGroupResourceId>
Use this as a conceptual example: validate the metric name and aggregation for your resource type, because metric schemas differ across providers.
Building a cohesive incident workflow: from alert to diagnosis to remediation
Azure Monitor becomes truly effective when alerts, queries, and visualizations are designed as one workflow.
Start with a small set of high-confidence alerts that indicate user impact. Each alert should point to a workbook or saved query that answers the first investigative questions: which operation is affected, in which region, and since when. That workbook should in turn link to deeper views: dependency performance, exceptions, and infrastructure saturation.
This workflow should also integrate change context. If you can annotate charts with deployments and configuration changes, you reduce time spent guessing. Even simple additions—like including the last deployment timestamp in a workbook—can help.
Over time, refine alerts using post-incident reviews. If an alert fired but did not correspond to user impact, adjust thresholds or add suppression conditions (for example, minimum request count). If an incident happened without an alert, identify which signal would have detected it earlier and implement it.
The key is continuity: instrument consistently, collect with intention, alert sparingly but effectively, and provide prebuilt investigative paths.
Advanced correlation: joining application telemetry with infrastructure signals
Once you have both application and infrastructure telemetry in the same workspace, you can write cross-domain queries.
For example, you can correlate request latency spikes with node CPU saturation in AKS (assuming you ingest node metrics/logs into the workspace). Or correlate increased exceptions with a specific VM instance after a scale-out event.
A practical pattern is to extract a set of “bad” operation IDs and then pull all related telemetry:
kusto
let badOps = requests
| where timestamp > ago(30m)
| where cloud_RoleName == "payments-api"
| where success == false
| top 50 by timestamp desc
| project operation_Id;
union requests, dependencies, exceptions, traces
| where timestamp > ago(30m)
| where operation_Id in (badOps)
| project timestamp, itemType, cloud_RoleName, operation_Id, name, resultCode, duration, message
| order by operation_Id, timestamp asc
This kind of query is useful during live incidents because it presents a stitched timeline. You can then layer in infrastructure data by joining on cloud_RoleInstance or by time-binning.
Operating at scale: standardization, naming, and governance
Azure Monitor at small scale can be improvised; at enterprise scale it must be governed.
Standardize naming conventions for resources, role names, and environments because queries depend on them. If one service reports cloud_RoleName=payments and another reports PaymentsAPI, you will spend time normalizing instead of diagnosing.
Standardize tags and use them to drive alert scoping and workbook parameters. For example, tag resources with service, environment, and owner. While tags aren’t always present in telemetry tables directly, they are valuable for inventory, cost management, and policy.
Govern diagnostic settings with policy where possible. Ensure that critical resources (Key Vaults, Application Gateways, load balancers, databases) emit the logs you need. Make exceptions explicit.
Finally, decide on an operational baseline per service tier. A tier-0 service might require multi-region availability tests, strict latency SLO monitoring, and 24/7 paging. A tier-3 internal service might only require basic metrics and business-hours notifications. Azure Monitor supports both, but only if you encode the difference.
Real-world operational pattern: migrating from reactive logging to observability
A common journey is a team that starts with ad hoc logging and gradually builds observability.
In one scenario, a team running an e-commerce application on App Service initially relies on IIS logs and sporadic manual checks. After several incidents where problems were detected by customers first, they enable workspace-based Application Insights, implement availability tests, and build a workbook that shows checkout success rate, p95 latency, and top dependency failures.
The biggest improvement comes when they add a small number of well-designed alerts: checkout 5xx rate, checkout latency regression, and dependency timeout rate to the payment gateway. They route severity-1 alerts to the on-call action group and severity-2 to a Teams channel via webhook automation. Within a month, mean time to detection drops because issues are detected within minutes, and mean time to resolution drops because engineers have a consistent drill-down path.
This pattern highlights the intended use of Azure Monitor: it is not only a telemetry sink, but a system for operational feedback loops.
Workload-specific considerations: databases, messaging, and gateways
Applications rarely fail in isolation; dependencies are often the root cause. Azure Monitor can cover many common dependencies, but you need to enable the right signals.
For Azure SQL, combine platform metrics (DTU/vCore utilization, storage, deadlocks) with query performance insights where appropriate. For messaging (Service Bus, Event Hubs), monitor queue depth, incoming/outgoing messages, throttling, and dead-letter counts. For gateways (Application Gateway, Front Door), use access logs and WAF logs to distinguish between blocked traffic, backend failures, and client disconnects.
As you add these signals, tie them back to application telemetry. If request failures correlate with Service Bus dead-letter spikes, that points you toward message processing issues rather than HTTP handling.
Using availability tests and synthetic monitoring appropriately
Availability tests (synthetic monitoring) provide an external perspective: can a user reach the endpoint and get a successful response? They are especially useful for detecting DNS issues, TLS certificate problems, and region-level failures.
Use them as a complement to internal signals. An application might be “healthy” from the perspective of CPU and error logs but still unreachable due to networking or DNS. Conversely, an availability test might fail due to a synthetic agent issue; that’s why you should alert with redundancy (multiple locations) and avoid paging on a single probe failure.
Availability tests also help validate your alerting: if your error-rate alert fires but availability tests remain green, you may be alerting on non-user-impacting errors (for example, background endpoints).
Practical KQL for operational dashboards and on-call runbooks
As you turn your monitoring into a repeatable practice, saved queries and workbook panels become a lightweight runbook.
A useful “top errors” query groups exceptions by type and operation:
kusto
exceptions
| where timestamp > ago(6h)
| where cloud_RoleName == "payments-api"
| summarize count() by type, outerMessage, problemId
| order by count_ desc
A useful “dependency health” query highlights timeouts and failures:
kusto
dependencies
| where timestamp > ago(1h)
| where cloud_RoleName == "payments-api"
| summarize total=count(), failed=countif(success == false), p95=pct(duration, 95) by target, name
| extend failRate = todouble(failed)/todouble(total)
| order by failRate desc, p95 desc
A useful “who is impacted” query slices by client geography or by authenticated role (depending on what you collect):
kusto
requests
| where timestamp > ago(30m)
| where cloud_RoleName == "payments-api"
| summarize total=count(), failed=countif(success == false) by client_CountryOrRegion
| extend failRate = todouble(failed)/todouble(total)
| order by failRate desc
In practice, you’ll tailor these to your identity model and data minimization requirements.
Keeping the narrative cohesive: turning telemetry into operational outcomes
At this point, the building blocks should fit together. You start by designing a workspace strategy so data is centralized (or segmented) in a controlled way. You instrument applications with Application Insights or OpenTelemetry so you have request, dependency, and exception visibility. You enable diagnostic settings and DCRs so infrastructure and platform logs are available for correlation. Then you use KQL to query across signals, build workbooks that mirror investigation workflows, and implement alerts that page only when user impact is likely.
What makes Azure Monitor effective for comprehensive application insights is not any single feature; it’s the consistency of the operational model. If the same service is always identifiable by cloud_RoleName, if logs for key dependencies are always routed to the workspace, and if every severity-1 alert links to the same investigative workbook, your on-call experience changes from “search and guess” to “follow a known path.”
The final piece is ongoing governance: manage retention, control sampling, review alert performance, and standardize configuration with policy and infrastructure as code. That’s how Azure Monitor becomes a durable, scalable observability foundation rather than a collection of disconnected charts.