Agent Connectivity Troubleshooting: Step-by-Step Workflow

Agent connectivity problems are rarely “mysterious.” They are usually the predictable result of one of a few breakpoints in the path between an endpoint and a vendor or management service: name resolution fails, routing/NAT breaks, a firewall blocks, a proxy changes behavior, TLS inspection alters certificates, time drift invalidates certificates, or the local agent process cannot reach the network. The challenge for IT administrators is that the symptoms tend to look the same—“agent offline,” “no heartbeat,” “cannot enroll,” “telemetry delayed”—even though the underlying causes differ.

This article lays out a practical, step-by-step workflow you can apply to most endpoint agents (EDR, monitoring agents, log forwarders, configuration management agents, vulnerability scanners, etc.). It avoids product-specific features and focuses on fundamentals you can validate with standard tools on Windows and Linux. The goal is repeatability: you should be able to start with minimal assumptions, confirm what is true, and narrow the problem quickly.

Throughout the guide, you’ll see real-world scenarios integrated into the workflow. These are not isolated anecdotes; they show how the same steps lead to different fixes depending on the environment.

Define the communication model before you test

Before running commands, clarify what “connectivity” means for your agent. Different agents use different patterns, and the tests you choose should match the agent’s actual behavior.

Most enterprise agents use outbound connections initiated from the endpoint to a vendor cloud or internal management plane. That model is attractive because it avoids inbound firewall rules and simplifies NAT. However, some agents also require inbound access for management actions, local webhooks, or peer-to-peer components. Others use message queues, long-lived WebSockets, mutual TLS (mTLS), or certificate pinning. If you guess the model incorrectly, you may “prove” network connectivity with ICMP ping while the agent still fails due to proxy authentication, blocked SNI, or TLS inspection.

Start by answering four questions using your vendor documentation and what you can observe on a healthy host:

First, what are the destination hostnames and ports? Prefer FQDNs and service tags over IPs, since many cloud services rotate IP ranges frequently. If your security policy requires IP allowlists, you will need a process to keep them current; otherwise, plan for FQDN-based rules.

Second, does the agent require an HTTP(S) proxy? If so, is it system-wide (WinHTTP / environment variables) or application-specific? Some agents only respect WinHTTP, others only read environment variables, and some support explicit proxy settings in their own configuration.

Third, what does “connected” look like? Agents often maintain more than one channel: an enrollment channel, a command/heartbeat channel, and a telemetry upload channel. You may see partial failures where enrollment succeeds but telemetry fails due to size limits or MTU/fragmentation issues.

Fourth, what identity and trust does the agent use? Many agents authenticate with an API token issued at install time, a machine certificate, or a device identity registered during enrollment. TLS (certificate chain validation) is a frequent point of failure in corporate environments with inspection proxies.

If you do not have a vendor port list handy, you can still proceed by observing a healthy system’s outbound connections and DNS queries, then validating those dependencies on a failing host.

Establish a baseline on a known-good host

A reliable baseline prevents you from chasing false positives. Pick a host with the same OS family and network segment as the failing system if possible. Your objective is to capture: which processes connect, to where, over what ports, and whether a proxy is involved.

On Windows, you can identify active connections and the owning process using PowerShell. Run this on a healthy endpoint while the agent is expected to be active:


# Show established outbound TCP connections with owning process

Get-NetTCPConnection -State Established |
  Select-Object LocalAddress,LocalPort,RemoteAddress,RemotePort,OwningProcess |
  ForEach-Object {
    $_ | Add-Member -NotePropertyName ProcessName -NotePropertyValue (Get-Process -Id $_.OwningProcess -ErrorAction SilentlyContinue).ProcessName -PassThru
  } |
  Sort-Object RemoteAddress,RemotePort,ProcessName |
  Format-Table -AutoSize

This is not perfect—some agents use short-lived connections or UDP—but it gives you a starting point. If you need deeper visibility, Windows Resource Monitor (resmon.exe) or ETW-based tools can help, but start simple.

On Linux, ss can show active connections:

bash

# Show established outbound TCP connections with process info

sudo ss -tpn state established

To capture DNS dependencies, query the local resolver cache (where available) and observe which domains are requested when the agent starts. Even without a DNS cache, you can review logs on your DNS servers or use a short packet capture on the healthy host (discussed later) to see what names are resolved.

This baseline becomes your comparison point: the failing host should look broadly similar in terms of destinations and proxy usage. If it doesn’t, that difference often explains the failure.

Scenario 1: “It works on Wi‑Fi but not on Ethernet”

A common early clue is that the same laptop reports the agent online when on a guest Wi‑Fi or hotspot, but offline when connected to corporate Ethernet/VPN. This immediately suggests the agent and OS are fundamentally functional, and the break is in the corporate network path: a proxy requirement, SSL inspection behavior, firewall policy, or DNS split-horizon difference.

The baseline host on corporate Ethernet will show what the agent needs in that environment. If the baseline host also fails intermittently, your “baseline” may not be stable, and you should choose another host or segment.

Step 1: Verify the local agent service and time synchronization

Although this is a connectivity guide, start local. A surprising number of “connectivity” tickets are actually service health or clock issues that surface as TLS failures.

On Windows, confirm the agent service exists, is running, and is not crash-looping. Use the Services snap-in for a quick view, but PowerShell is better for repeatable output:

powershell

# Replace with the actual service name if known

Get-Service | Where-Object {$_.Name -match "agent|sensor|forwarder"} | Format-Table -AutoSize

# If you know the exact service name

Get-Service -Name "YourAgentServiceName" | Format-List *

If the service is running but the agent is still offline, check whether it can reach the network under its service account context. Agents running as LocalSystem typically have broad access, but if an agent runs under a restricted account, it may lack proxy settings or certificate access.

On Linux, identify the systemd unit and status:

bash

# List likely agent units

systemctl list-units --type=service | egrep -i 'agent|sensor|forward|collector'

# Check a specific unit

sudo systemctl status your-agent.service --no-pager

Now validate system time. TLS relies on valid time ranges for certificates. If NTP is blocked or the system clock is drifting, the agent may fail certificate validation even though the network path is open.

On Windows:

powershell
w32tm /query /status
w32tm /query /peers

On Linux (systemd-timesyncd or chrony environments differ):

bash
timedatectl status

# If chrony is used

chronyc tracking 2>/dev/null || true

If time is off by minutes or hours, fix that first. It’s a foundational dependency for the TLS checks you’ll do later.

Step 2: Confirm basic IP configuration and routing

Once the agent is confirmed running and the clock is sane, validate the host’s basic networking. Agents that are “offline” often sit on endpoints with misconfigured default gateways, missing routes to proxy subnets, or stale VPN routes.

On Windows, capture IP configuration and route table:

powershell
ipconfig /all
route print
Get-NetIPConfiguration
Get-NetRoute | Sort-Object -Property InterfaceIndex,DestinationPrefix | Select-Object -First 50

On Linux:

bash
ip addr
ip route
resolvectl status 2>/dev/null || cat /etc/resolv.conf

You’re looking for a valid IP, correct subnet mask, correct default gateway, and expected DNS servers. For split-tunnel VPNs, verify that routes to the management plane are present (or absent, depending on design). For example, if your agent’s destinations are internal addresses reachable only via VPN, a split-tunnel configuration that does not include those routes will break connectivity.

At this stage, avoid relying on ping alone. ICMP is often blocked, and many vendor endpoints won’t respond. Instead, validate that you can route to the destination network and that the correct egress interface is used. On Windows, Test-NetConnection is more relevant for TCP services:

powershell

# Example: test TCP 443 to a known endpoint

Test-NetConnection -ComputerName example.vendor.com -Port 443

On Linux, prefer curl or nc for TCP reachability:

bash

# TCP connect test

nc -vz example.vendor.com 443

If the TCP connect fails, you have a path problem (routing, firewall, proxy requirement, DNS, or TLS interception depending on how you tested). If the TCP connect succeeds but the agent still fails, you’ll move deeper into proxy and TLS validation.

Step 3: Validate DNS resolution in the same way the agent uses it

DNS issues are a top cause of agents showing offline, especially in networks with split-horizon DNS, internal DNS forwarders, or content filtering. A key detail is that “DNS works” is not binary: you need the right answer from the right resolver, and you need it consistently.

Start by identifying which DNS servers the host is using and whether they are the intended ones. Then check whether the agent’s destination names resolve, and whether they resolve to expected IP ranges.

On Windows:

powershell
Get-DnsClientServerAddress | Format-Table -AutoSize
Resolve-DnsName example.vendor.com

On Linux:

bash

# systemd-resolved environments

resolvectl query example.vendor.com 2>/dev/null || true

# generic

dig +short example.vendor.com
nslookup example.vendor.com

If resolution fails intermittently, compare answers across resolvers:

bash
dig @1.1.1.1 example.vendor.com +short
dig @8.8.8.8 example.vendor.com +short

Be careful here: testing public resolvers may not reflect your corporate environment. The point is to detect whether your corporate DNS is returning NXDOMAIN, a walled-garden IP, or a filtered response.

Also check for IPv6. Some agents will prefer AAAA records if available; if your network advertises IPv6 but does not route it properly, endpoints may try IPv6 and fail. In that case, you’ll see AAAA responses but broken connectivity.

On Windows:

powershell
Resolve-DnsName example.vendor.com -Type AAAA

On Linux:

bash
dig AAAA example.vendor.com +short

If you confirm IPv6 is the culprit, address it by fixing IPv6 routing or disabling IPv6 where appropriate per your standards (and only with consideration for broader impact). A quick mitigation for testing is to force IPv4 in your HTTP tests (curl -4) to see if the path works.

Scenario 2: Split-horizon DNS sends agents to an internal sinkhole

A real pattern in enterprises is split-horizon DNS that returns internal “block” IPs for certain categories of domains. A security team may block newly registered domains, CDNs, or “unknown” categories. If your vendor uses a CDN, that CDN hostname might fall into a filtered category.

In this scenario, Resolve-DnsName on the failing host returns an internal RFC1918 address, while a healthy host (or a host on a different DNS policy) returns public IPs. The agent is not actually failing at TLS or authentication; it is connecting to the wrong destination. The fix is a DNS policy exception for the required hostnames, not a reinstall of the agent.

Step 4: Determine whether a proxy is required, and which proxy settings apply

After DNS and routing, proxy behavior is the next most common breakpoint. Many corporate networks require outbound HTTP/HTTPS traffic to traverse a proxy, and direct-to-internet connections on TCP/443 are blocked.

The tricky part is that “proxy configured” depends on the stack. On Windows, there are two common proxy configurations:

WinINET proxy settings, used by interactive user applications (e.g., Internet Options). These may be set via Group Policy or user settings.

WinHTTP proxy settings, used by system services and many non-interactive components. Many agents running as services use WinHTTP.

You must check both and then validate what your agent actually uses.

On Windows, review WinHTTP proxy:

powershell
netsh winhttp show proxy

To view the current user’s WinINET proxy (useful for comparison, even if the agent runs as a service):

powershell
reg query "HKCU\Software\Microsoft\Windows\CurrentVersion\Internet Settings" /v ProxyEnable
reg query "HKCU\Software\Microsoft\Windows\CurrentVersion\Internet Settings" /v ProxyServer
reg query "HKCU\Software\Microsoft\Windows\CurrentVersion\Internet Settings" /v AutoConfigURL

If you use PAC files (Proxy Auto-Config), remember that services typically do not execute PAC logic unless the application explicitly supports it. An agent may need an explicit proxy host:port rather than “AutoDetect” behavior.

On Linux, proxy settings often come from environment variables (http_proxy, https_proxy, no_proxy) and/or application configuration files. Systemd services may have a different environment than interactive shells.

bash

# Interactive environment

env | egrep -i 'http_proxy|https_proxy|no_proxy'

# For a systemd service, inspect its environment (if permitted)

systemctl show your-agent.service -p Environment -p EnvironmentFile

With proxy expectations clarified, validate connectivity through the proxy explicitly. For HTTP(S), curl is usually the most direct test because it shows proxy negotiation errors.

bash

# Direct connection test (no proxy)

curl -vk https://example.vendor.com/ --max-time 15

# Explicit proxy test

curl -vk -x http://proxy.corp.local:8080 https://example.vendor.com/ --max-time 15

On Windows, if curl is available (Windows 10/11 include a curl binary), you can run the same tests. Otherwise, PowerShell’s Invoke-WebRequest can help, but it hides some TLS details. curl -v is generally preferable for diagnosing proxy and TLS.

If the environment requires proxy authentication (e.g., NTLM/Kerberos/Basic), your agent must support it. Some agents can’t do interactive auth and require a proxy allow rule or an authentication bypass for the agent’s destinations.

A common failure mode is: direct TCP/443 is blocked, proxy is required, but the agent is not configured to use proxy, so it never reaches the vendor.

Step 5: Validate TCP reachability to the correct ports (without assuming ICMP)

With DNS and proxy in hand, you can do port-level tests that reflect how the agent connects.

If the agent connects directly to vendor endpoints over 443, test TCP 443 without going through a proxy. If the environment mandates proxy usage, test reaching the proxy on its port and then test the proxy’s ability to reach the upstream destination.

On Windows, Test-NetConnection helps distinguish DNS resolution and TCP connection:

powershell
Test-NetConnection -ComputerName example.vendor.com -Port 443 -InformationLevel Detailed

# If a proxy is required, first confirm you can reach the proxy itself

Test-NetConnection -ComputerName proxy.corp.local -Port 8080 -InformationLevel Detailed

On Linux:

bash

# Direct TCP

nc -vz example.vendor.com 443

# Proxy TCP

nc -vz proxy.corp.local 8080

If these tests succeed, you have basic L3/L4 reachability. That does not guarantee the agent can authenticate or complete TLS, but it rules out obvious firewall and routing blocks.

If the tests fail, capture the failure type. A connection timeout points to routing or firewall drops. A connection refused suggests the host/port is reachable but nothing is listening (common when testing the wrong port or a transparent proxy is not behaving as expected). A DNS failure points back to Step 3.

At this point, you should be able to state with evidence whether the endpoint can reach the vendor/proxy at the required ports. If it can, the next likely category is TLS and certificate validation.

Step 6: Validate TLS behavior and certificate trust end-to-end

Most agents communicate over TLS (HTTPS) and rely on OS certificate stores to validate server certificates. Corporate TLS inspection (also called SSL inspection) can break agents in several ways:

If the proxy replaces the server certificate with a corporate-issued certificate, the agent must trust the corporate root CA. If the corporate root CA is not installed in the correct trust store (or not available to the service account), TLS will fail.

If the agent uses certificate pinning (it expects a specific certificate/public key), inspection will cause failures even if the root CA is trusted.

If TLS versions or cipher suites are restricted on either side, negotiation can fail.

If system time is incorrect (Step 1), certificates may appear expired or not yet valid.

You want to answer two questions: what certificate chain is presented to the endpoint, and does the endpoint trust it?

On Linux and macOS-like environments, openssl s_client is a direct way to view the presented certificate chain:

bash

# Direct to server (no proxy)

openssl s_client -connect example.vendor.com:443 -servername example.vendor.com -showcerts </dev/null

Look at the issuer, the chain, and whether verification succeeds (Verify return code: 0 (ok)). If you see an issuer that matches your corporate inspection CA, you are not talking directly to the vendor; you are seeing a substituted certificate.

For proxied environments, testing TLS “through” the proxy is different because the proxy typically uses HTTP CONNECT to tunnel TLS. curl -v is often simpler here because it shows CONNECT success/failure and then the TLS handshake.

bash
curl -vk -x http://proxy.corp.local:8080 https://example.vendor.com/ --max-time 15

In the verbose output, you’ll see whether CONNECT succeeds, what certificate is presented, and whether the client accepts it.

On Windows, openssl is not always present, but you can still inspect the certificate chain using curl -v and validate trust stores through certificate MMC or PowerShell. To check whether a specific root CA is present:

powershell

# Search LocalMachine Root store for a CA subject match

Get-ChildItem -Path Cert:\LocalMachine\Root |
  Where-Object { $_.Subject -like "*Your Corporate Root CA*" } |
  Select-Object Subject, Thumbprint, NotAfter

If the corporate root is missing, installing it via Group Policy or your endpoint management platform typically resolves inspection-related trust failures. If the corporate root is present but the agent still fails, consider service account context: some agents use custom trust stores or bundle CA certificates.

Also validate revocation checking behavior. Some environments block outbound access to Certificate Revocation List (CRL) distribution points or OCSP responders, causing TLS validation delays or failures. This is especially common on servers with restricted egress. You might see timeouts that look like “agent slow to connect” rather than a clean failure.

In those cases, allowing CRL/OCSP traffic or configuring revocation checking policy appropriately can be required. The precise change depends on your security policy; the key is recognizing the symptom: TLS connects but stalls during verification.

Scenario 3: SSL inspection breaks enrollment after a proxy change

An enterprise rolls out a new secure web gateway and enables TLS inspection for “unknown” categories. Endpoints that previously enrolled successfully now fail during enrollment, while already-enrolled agents show intermittent heartbeats.

Applying the workflow:

You validate Step 4 and confirm the proxy is now mandatory for outbound 443.

In Step 6, curl -vk through the proxy shows the server certificate issuer is the corporate inspection CA, not the vendor CA.

On affected endpoints, the corporate root CA is missing from the LocalMachine Root store because those devices are in a staging OU that didn’t receive the new GPO.

The fix is not reinstalling the agent; it’s ensuring the inspection root CA is deployed to the correct store and that the proxy policy excludes the vendor domains if the agent uses pinning.

The important operational lesson is that proxy/TLS changes should be tested against agent behaviors explicitly, because “web browsing works” does not guarantee “non-interactive agents trust the chain.”

Step 7: Check local firewall, EDR self-protection, and OS platform controls

Even when the network path is open, the endpoint itself can block agent traffic. Local firewall rules, endpoint security controls, and OS hardening can prevent a process from opening sockets or reaching certain destinations.

On Windows, Windows Defender Firewall rules can be scoped by program, service, profile (Domain/Private/Public), and direction. If your environment uses local firewall policies via GPO or MDM, rules can vary by OU or device compliance state.

Check firewall profile state:

powershell
Get-NetFirewallProfile | Format-Table Name, Enabled, DefaultInboundAction, DefaultOutboundAction

If outbound is default-allow (common), program-specific blocks are more relevant. To find blocks for a specific executable, you may need to search firewall rules by program path, but you must know where the agent binary resides. If you do, you can query rules:

powershell

# Example: search for rules with a program path containing 'agent'

Get-NetFirewallRule -PolicyStore ActiveStore |
  Where-Object { $_.Enabled -eq 'True' } |
  ForEach-Object {
    $app = (Get-NetFirewallApplicationFilter -AssociatedNetFirewallRule $_ -ErrorAction SilentlyContinue).Program
    if ($app -and $app -match 'agent') { $_ | Select-Object DisplayName, Direction, Action, Enabled }
  } | Format-Table -AutoSize

On Linux, check the host firewall (nftables/iptables/firewalld/ufw). The exact tooling varies, but you can at least confirm whether an outbound policy exists:

bash

# firewalld

sudo firewall-cmd --state 2>/dev/null || true
sudo firewall-cmd --list-all 2>/dev/null || true

# nftables

sudo nft list ruleset 2>/dev/null | head

# ufw

sudo ufw status verbose 2>/dev/null || true

Also consider EDR “self-protection” and application control. Some agents embed kernel drivers or protect their own processes; similarly, other security tooling can interfere with network calls by injecting into processes or enforcing allowlists. If the connectivity issue coincides with a new endpoint protection rollout, isolate by correlating install timelines and checking whether the agent process is being constrained.

A practical way to validate whether the problem is “process-specific” is to test connectivity from the same host using a generic tool (curl) and compare to what the agent process experiences. If curl can reach the endpoint but the agent cannot, the issue is likely local policy, process restrictions, or agent configuration rather than network.

Step 8: Confirm the agent is using the expected egress path (NAT, VPN, and identity)

Many vendor services enforce policy based on source IP, geolocation, or tenant allowlists. Similarly, some internal management planes only accept connections from specific NAT pools or VPN concentrators. If your endpoint’s egress path changes, the agent may be blocked even though it can “connect.”

Common causes include:

A new NAT gateway for a subnet, with the old public IP allowlisted at the vendor.

A split-tunnel VPN change, sending agent traffic out the local ISP rather than corporate egress.

A cloud egress firewall or secure web gateway applying different policy based on source subnet tags.

To validate egress identity, check the public IP as seen by external services (if policy permits):

bash
curl -s https://ifconfig.me

On Windows (PowerShell):

powershell
(Invoke-RestMethod -Uri "https://ifconfig.me/ip").Trim()

In regulated environments you may not be allowed to query external “what is my IP” services. In that case, use firewall/NAT logs from your perimeter, secure web gateway logs, or cloud NAT gateway metrics to map the internal host IP to a public egress IP.

If the vendor restricts source IPs, you must update the allowlist to include the new egress. If your architecture expects all agent traffic to go through a central egress, ensure routing and policy routing send it there.

This step ties back to earlier tests: you may have proven the endpoint can reach the vendor, but if it’s reaching from an unexpected egress, the vendor may silently drop or reject requests at the application layer.

Step 9: Check MTU and path fragmentation when symptoms are intermittent

MTU (Maximum Transmission Unit) problems are a classic cause of “it connects but doesn’t work” issues, especially for VPNs, tunnels, and environments with PPPoE or GRE. Agents may establish TCP connections but fail on larger uploads, resulting in delayed telemetry, stuck updates, or periodic disconnects.

A key sign is that small HTTPS requests succeed (e.g., simple heartbeat) but larger payloads fail (e.g., uploading logs, downloading updates). Another sign is that the issue only occurs on a specific network path—often over VPN.

On Linux, you can test path MTU discovery behavior with ping using the “do not fragment” flag. Note that some networks block ICMP fragmentation-needed messages, which itself can cause MTU black holes.

bash

# Example: test MTU with DF set (adjust size as needed)

# 1472 bytes payload + 28 bytes IP/ICMP header = 1500

ping -M do -s 1472 example.vendor.com

On Windows:

powershell

# 1472 payload is typical for 1500 MTU, adjust as needed

ping example.vendor.com -f -l 1472

If packets at 1472 fail but smaller succeed, you likely have a lower MTU path. The fix is usually to set the correct MTU on the VPN interface/tunnel or ensure ICMP fragmentation-needed is permitted so PMTUD can function. Be cautious about making endpoint-wide MTU changes without understanding the path; coordinate with network teams.

Even if ICMP is blocked, you can infer MTU issues by capturing packets and observing retransmissions and stalled TLS record transmissions, which leads into the next step.

Step 10: Use targeted packet capture to confirm what is actually happening

When earlier steps don’t yield a clear answer, packet capture provides ground truth. The goal is not to capture everything; it’s to capture enough to answer a specific question: is the endpoint attempting to connect, is it reaching the destination, and what fails (DNS, TCP handshake, TLS handshake, proxy CONNECT, application reset)?

On Windows, you can use pktmon (built-in) or Wireshark/Npcap if permitted. pktmon can capture and then convert to pcapng for analysis.

powershell

# Start capture

pktmon start --capture --comp nics --pkt-size 0 --file-name C:\Temp\agent.pktmon

# Reproduce the issue (restart agent service or wait for heartbeat)

# Stop capture

pktmon stop

# Convert to pcapng for Wireshark

pktmon pcapng C:\Temp\agent.pktmon -o C:\Temp\agent.pcapng

On Linux, tcpdump is the standard tool. Focus on the relevant hostnames/IPs and ports. If you don’t know IPs, capture DNS and 443 temporarily.

bash

# Capture DNS and TLS traffic (adjust interface)

sudo tcpdump -i eth0 -nn 'port 53 or port 443' -w /tmp/agent.pcap

Then analyze:

If you see repeated SYNs with no SYN-ACK, it’s a path/firewall drop (Step 5).

If TCP establishes but TLS ClientHello is followed by an alert or reset, it’s typically TLS policy/inspection/ciphers (Step 6).

If you see CONNECT requests to a proxy returning 407 (Proxy Authentication Required), it’s a proxy auth issue (Step 4).

If DNS responses return unexpected IPs, it’s a DNS policy issue (Step 3).

Packet capture also helps validate whether the agent is attempting IPv6 first, whether SNI is present (modern TLS), and whether the proxy is intercepting.

Because captures can contain sensitive data, follow your organization’s policy: limit duration, capture only relevant traffic, store securely, and redact before sharing.

Step 11: Validate agent-specific configuration without assuming defaults

By this point you’ve validated the platform and the network path. If everything looks correct yet the agent remains offline, the remaining causes tend to be agent configuration and identity: incorrect tenant ID, wrong enrollment token, blocked device identity, stale certificates, or a proxy setting mismatch.

Even though this article avoids vendor-specific commands, there are universal checks you can apply:

Confirm the agent is pointed at the correct environment (prod vs staging). In multi-tenant setups, it’s easy to enroll into the wrong tenant and then “disappear” from the expected console.

Confirm that any required enrollment token is present and not expired. Some agents store tokens in config files, registry keys, or secure stores.

Check whether the agent is configured for proxy use explicitly and whether it supports PAC, auth, or mTLS.

Confirm that the agent’s local certificate store or identity cache is intact. Some agents generate a device certificate on enrollment; if the local store is corrupted or permissioned incorrectly, re-enrollment may be required.

On Windows, it’s often useful to check the Application and System Event Logs for the agent’s source name. Many agents write clear TLS or proxy errors there. On Linux, check journalctl for the unit.

Windows:

powershell

# Filter recent events that mention proxy/TLS/connectivity (generic approach)

Get-WinEvent -LogName Application -MaxEvents 200 |
  Where-Object { $_.Message -match 'proxy|tls|ssl|certificate|connect|handshake' } |
  Select-Object TimeCreated, ProviderName, Id, LevelDisplayName, Message |
  Format-List

Linux:

bash
sudo journalctl -u your-agent.service --since "-2 hours" --no-pager

If logs show HTTP 401/403 errors, the agent may be reaching the service but failing authentication. That’s not a firewall problem; it’s a token/identity/tenant policy problem. If logs show certificate validation errors, return to Step 6 and confirm which trust store the agent uses.

A disciplined approach here is to avoid “reinstalling as a test” until you can explain what reinstall would change. Reinstall can help if the agent’s local identity store is broken, but it also adds noise and may mask a network policy issue that will recur.

Step 12: Correlate with network security tooling logs (proxy, firewall, SWG, ZTNA)

Endpoint-side testing is necessary but sometimes insufficient, especially in modern environments where traffic passes through secure web gateways (SWG), cloud firewalls, zero-trust network access (ZTNA) connectors, and DLP tools.

If you have access to these logs, they can shorten time-to-root-cause dramatically. The earlier steps help you ask precise questions of those systems:

Do you see the endpoint’s source IP initiating sessions to the vendor FQDNs on the expected ports?

Is the SWG categorizing the domain in a blocked category?

Is TLS inspection applied to these domains, and is the inspection policy different for certain subnets/users?

Is the proxy returning 407 (auth) or 503 (upstream issue)?

Are connections allowed but uploads blocked due to size or DLP rules?

The key is to line up timestamps. When you run curl -vk or restart the agent, note the time and the endpoint IP. Then search the proxy/firewall logs for that timestamp and destination. If your tooling supports it, filter by SNI (Server Name Indication) for TLS traffic, which often corresponds to the hostname even when IPs are shared.

This correlation is particularly important for intermittent problems. If agents go offline at the same time each day, it may be a proxy policy refresh, certificate rotation, or scheduled network change.

Putting it together: a repeatable decision workflow

The steps above are intended to be run in sequence because each one reduces uncertainty. You can think of it as narrowing from “is the agent even running?” to “is the network path open?” to “is TLS trusted?” to “is the agent authorized and correctly configured?”

In practice, experienced administrators often jump directly to proxy or firewall checks, but that can backfire if the agent service is stopped or the system clock is wrong. Conversely, focusing on the agent logs without validating DNS and proxy behavior can waste hours.

A practical way to operationalize this is to capture a minimal evidence set whenever an agent is reported offline:

Record host details: OS, network segment, whether on VPN, current DNS servers.

Record local checks: agent service status, system time status.

Record network checks: DNS resolution for required hostnames, TCP reachability (direct or to proxy), and a curl -vk test that matches the expected path.

Record trust checks: whether TLS is intercepted, whether corporate root CA is installed.

Record correlation: proxy/firewall log entries matching your test time.

This evidence set is small enough to be repeatable and provides enough signal to engage the right team (desktop, network, security, vendor) with concrete findings.

Additional mini-case: Servers in a restricted subnet cannot validate certificates

A frequent enterprise pattern is that user subnets have broad internet egress via an SWG, but server subnets have restrictive egress with only explicit allow rules. Agents deployed to servers show “offline,” while the same agent on workstations is fine.

Following the workflow:

Steps 2 and 5 show TCP/443 is allowed to the vendor endpoint.

Step 6 reveals TLS handshake delays and occasional failures.

Reviewing the server subnet firewall logs shows outbound HTTP to OCSP/CRL endpoints is blocked. The servers cannot reach certificate revocation infrastructure, so TLS validation stalls.

The remediation is to allow OCSP/CRL endpoints required by the vendor’s certificate chain (or to adjust revocation checking policy in line with your security standard), rather than opening broad internet access.

This case illustrates why “port 443 is open” is not sufficient for modern TLS-dependent software.

Additional mini-case: A PAC file works for browsers but agents fail silently

In many organizations, proxy configuration is delivered via PAC (AutoConfigURL). Browsers and user applications function, but an agent running as a system service remains offline.

Steps 4 and 5 identify that direct-to-internet is blocked and the proxy is required.

WinINET settings show a PAC URL configured, but netsh winhttp show proxy reports “Direct access (no proxy server).” The agent, running as a service and using WinHTTP, never uses the PAC.

Setting WinHTTP proxy explicitly (via GPO, netsh winhttp set proxy, or the agent’s own config mechanism) resolves the issue.

This is a classic mismatch between proxy configuration stacks and is easy to miss if you only validate connectivity in a logged-in user context.

Commands and checks you can standardize in runbooks

As you mature your operational process, you can standardize a small set of commands that are safe and broadly applicable. The goal is consistency across tickets and teams.

On Windows endpoints, a typical runbook bundle includes:

Checking service status and recent restarts, confirming system time, showing WinHTTP proxy, testing TCP 443 to the destination, and performing a curl -vk to validate TLS.

On Linux, the equivalent includes:

Checking systemd status and logs, confirming time sync, validating DNS with dig, testing TCP with nc, validating TLS with curl -vk and openssl s_client, and optionally capturing with tcpdump for a short window.

The value of standardization is not the commands themselves; it’s the ability to compare outputs across a healthy host and a failing host and to hand off consistent evidence.

Practical guidance for remediation without introducing fragility

Fixes for agent connectivity should minimize long-term fragility. Some changes “work” but create future outages when IPs rotate, certificates renew, or proxy policies change.

Prefer FQDN-based allow rules where your firewall/proxy supports them, because cloud services and CDNs change IPs.

If you must use IP allowlists, establish a documented process to update them based on vendor IP range publications and test changes before enforcement.

Treat TLS inspection carefully. If the agent supports inspection and you can ensure corporate root CA deployment to all relevant trust stores, inspection can be compatible. If the agent uses pinning or strict validation, implement bypass rules for the vendor domains.

Ensure proxy settings are applied in the context the agent actually runs in (service vs user). On Windows that often means WinHTTP; on Linux that often means systemd unit environment.

Avoid disabling security controls broadly (turning off firewall, disabling inspection globally). Use narrow exceptions tied to vendor domains/ports and document them.

Finally, whenever you resolve an issue, feed it back into your baseline knowledge: record the required endpoints, the proxy behavior, the certificate chain expectations, and the validated tests. Over time, agent connectivity stops being “ad hoc debugging” and becomes an operationally predictable process.

Agent Connectivity Troubleshooting for IT Admins: A Step-by-Step Workflow

Define the communication model before you test

Establish a baseline on a known-good host

Scenario 1: “It works on Wi‑Fi but not on Ethernet”

Step 1: Verify the local agent service and time synchronization

Step 2: Confirm basic IP configuration and routing

Step 3: Validate DNS resolution in the same way the agent uses it

Scenario 2: Split-horizon DNS sends agents to an internal sinkhole

Step 4: Determine whether a proxy is required, and which proxy settings apply

Step 5: Validate TCP reachability to the correct ports (without assuming ICMP)

Step 6: Validate TLS behavior and certificate trust end-to-end

Scenario 3: SSL inspection breaks enrollment after a proxy change

Step 7: Check local firewall, EDR self-protection, and OS platform controls

Step 8: Confirm the agent is using the expected egress path (NAT, VPN, and identity)

Step 9: Check MTU and path fragmentation when symptoms are intermittent

Step 10: Use targeted packet capture to confirm what is actually happening

Step 11: Validate agent-specific configuration without assuming defaults

Step 12: Correlate with network security tooling logs (proxy, firewall, SWG, ZTNA)

Putting it together: a repeatable decision workflow

Additional mini-case: Servers in a restricted subnet cannot validate certificates

Additional mini-case: A PAC file works for browsers but agents fail silently

Commands and checks you can standardize in runbooks

Practical guidance for remediation without introducing fragility