Network troubleshooting is most effective when it’s treated as a repeatable investigative workflow rather than a grab bag of commands. In production environments you rarely get clean symptoms: users report “the internet is down,” an application team blames the firewall, and monitoring shows a few ambiguous spikes. The fastest path to restoration is disciplined scoping, careful hypothesis testing, and capturing just enough evidence to prove where the fault actually is.
This guide presents a step-by-step approach you can apply to common enterprise networking issues—loss of connectivity, intermittent slowness, packet loss, DNS failures, VPN instability, and application timeouts. It assumes you’re an IT administrator or system engineer with access to endpoints, network devices, and (ideally) centralized monitoring. Throughout, you’ll see real-world scenarios woven into the workflow, along with practical commands for Windows, Linux/macOS, and common network environments.
Start with impact and scope before you touch the network
The first minutes of network troubleshooting determine whether you spend the next hour making progress or chasing noise. Before running tests, establish what “broken” means in measurable terms and who is affected. This is not bureaucracy; it’s how you avoid fixing the wrong problem.
Begin by defining the symptom precisely. “Slow” should become “HTTP requests to app.example.com intermittently exceed 5 seconds from the NYC office between 09:00–10:00.” “Down” should become “clients in VLAN 30 cannot reach the default gateway or any off-subnet destination.” You want a statement you can verify and falsify.
Next, determine scope using three axes: users (who), locations (where), and services (what). Ask whether it affects one device, one subnet, one site, a region, or the entire organization. Identify whether only one application is impacted or multiple unrelated services fail. Scope quickly suggests which layers are likely involved. If only one SaaS domain fails from all sites, DNS or egress filtering might be implicated. If one subnet fails while others are fine, the problem is likely local to that broadcast domain: VLAN tagging, gateway availability, DHCP, or access-layer switching.
Finally, confirm whether this is a change-related incident. Recent firewall updates, switch firmware changes, routing policy adjustments, certificate rollovers, or ISP maintenance are frequent root causes. If you have change management tooling, correlate timestamps. If you don’t, at least ask the people who might have touched relevant systems.
Create a quick working model of the path
Once you know the scope, build a simplified mental map of the traffic path. For most enterprise client-to-service issues, the path includes: endpoint network stack, local link (Wi‑Fi/Ethernet), access switch/AP, VLAN and default gateway (SVI or router interface), internal routing, firewall/NAT/proxy, WAN/ISP, and then external DNS and service endpoints.
You don’t need a perfect topology diagram to start, but you do need to know where key boundaries are: the default gateway, the first routed hop, the egress firewall, and any overlay networks (VPN, SD‑WAN, GRE/IPsec tunnels). Those boundaries are where misconfigurations and failures commonly appear.
This model also helps you choose vantage points. Testing only from the affected client is not enough; you want at least two: one inside the affected segment and one from a known-good segment or site. Differences between vantage points are often the most valuable signal you’ll get.
Collect evidence systematically: what changed, what fails, and what still works
Evidence collection is not “run every command.” It’s selecting a minimal set of facts that lets you isolate layers. Keep a running log: timestamps, commands, and results. When multiple responders are involved, this prevents duplicated work and makes handoffs possible.
A practical early checklist is:
- What works from the affected host? (local network, gateway, DNS, internal services, external IPs)
- What works from a known-good host? (same tests)
- What changed recently? (device configs, policy, updates, ISP events)
- What do monitoring systems show? (interface errors, CPU, memory, link flaps, BGP state, DHCP scope utilization)
Even if you lack centralized monitoring, you can still inspect critical metrics directly on devices (interface counters, logs) once you have a hypothesis.
Verify the endpoint: link, IP configuration, and local stack
Even when the incident is “the network,” endpoints frequently contribute. Bad Wi‑Fi roaming behavior, stale DNS caches, incorrect proxy settings, and VPN split-tunnel rules can look like infrastructure outages. Verifying endpoint basics early avoids hours of chasing an upstream ghost.
Start with link state and IP configuration. On Windows, confirm the active adapter, IPv4/IPv6 status, DNS servers, and default gateway.
ipconfig /all
Get-NetIPConfiguration
Get-DnsClientServerAddress
On Linux, verify interface state, addresses, routes, and DNS resolver configuration. On systems using systemd-resolved, be aware that /etc/resolv.conf may be a stub and resolvectl is more authoritative.
bash
ip link
ip addr
ip route
resolvectl status 2>/dev/null || cat /etc/resolv.conf
On macOS, ifconfig, netstat -rn, and scutil --dns are useful.
bash
ifconfig
netstat -rn
scutil --dns
You are looking for obvious mismatches: a host on the wrong VLAN (unexpected IP range), missing default gateway, incorrect DNS servers, duplicate IP symptoms, or a stale static configuration that bypasses DHCP.
A common anti-pattern is to jump straight to ping 8.8.8.8. That’s useful, but first confirm whether the host can reach its own gateway. If it can’t, nothing beyond the local segment matters yet.
Check local firewall and VPN interactions
Host-based firewalls and VPN clients can selectively block traffic and create confusing partial failures. If only one application fails, or only certain ports fail, verify local policy before escalating.
On Windows, you can quickly inspect firewall profiles and recent rules:
powershell
Get-NetFirewallProfile
Get-NetFirewallRule -Enabled True | Select-Object -First 20 DisplayName,Direction,Action
If a VPN client is in use, confirm whether traffic is meant to go through the tunnel and whether split tunneling is configured. A misrouted default route through a down tunnel often manifests as “internet down” while local LAN still works.
Validate the local segment: gateway reachability and ARP/ND behavior
If IP configuration looks correct, test reachability to the default gateway. A successful gateway ping suggests the access layer is working and L2 adjacency exists (at least for ICMP), but don’t over-interpret it; ICMP may be allowed even when other traffic isn’t.
bash
# Linux/macOS
ping -c 3 <default-gateway>
arp -an | head
# Windows
ping -n 3 <default-gateway>
arp -a | more
If the gateway is unreachable, ARP (IPv4) or Neighbor Discovery (IPv6) may be failing. Look for an incomplete ARP entry or no MAC resolution.
On Linux:
bash
ip neigh show
If multiple hosts in the same VLAN show incomplete ARP/ND for the gateway, suspect a gateway outage, VLAN mismatch, STP issue, or an upstream L2 problem. If only one host fails, suspect a local NIC issue, port security, or duplicate IP.
Real-world scenario 1: “Internet down” in one conference room
A common on-call incident: several users in a conference room report no internet, while the rest of the floor is fine. Scope suggests a localized access-layer issue. On an affected laptop, ipconfig /all shows an IP from the correct DHCP scope and the right DNS servers. However, pinging the default gateway fails and arp -a shows no resolved MAC for the gateway.
At this point the working model points to L2 adjacency. You check another device in the room: same symptoms. That rules out a single endpoint. The next action is to inspect the access switch port(s) feeding that room or the AP uplink if they’re on Wi‑Fi. On the switch, you find the uplink interface has moved into an incorrect VLAN during a recent template push, so clients are in VLAN 30 but the AP uplink is tagged for VLAN 20. Fixing the trunk VLAN list restores ARP resolution and connectivity. The key was not “run more internet tests,” but proving the failure was local and L2.
Test name resolution separately from connectivity
DNS problems are among the most misdiagnosed causes of “network issues” because they can make everything appear down while IP connectivity is fine. Treat DNS as a distinct dependency and test it independently.
Start by testing raw IP connectivity to a known-good target. Ideally use something under your control (a public IP you expect to be reachable, or a cloud VM you manage). Then test name resolution for the same target.
From a client:
bash
# Connectivity to IP
ping -c 3 1.1.1.1
# DNS query
nslookup app.example.com
# or
dig app.example.com +short
On Windows, Resolve-DnsName provides detailed output.
powershell
Resolve-DnsName app.example.com
Resolve-DnsName app.example.com -Server <dns-server-ip>
If IP connectivity works but DNS queries time out or return wrong answers, you’ve isolated the problem to resolver reachability, DNS server health, forwarding, or incorrect records.
Also check whether the client is using the DNS servers you expect. A VPN client, captive portal, or rogue DHCP server can supply alternate DNS and cause selective failures.
Distinguish between “can’t reach DNS server” and “DNS server returns bad data”
If DNS queries time out, test whether you can reach the DNS server IP on the network (ICMP is a quick check, but UDP/TCP 53 is what matters). Many environments block ICMP, so don’t rely on ping alone.
From Linux:
bash
# Test TCP 53 (useful when UDP is blocked or responses are large)
nc -vz <dns-server-ip> 53
From PowerShell:
powershell
Test-NetConnection -ComputerName <dns-server-ip> -Port 53
If the DNS server is reachable but returns SERVFAIL, NXDOMAIN, or unexpected results, look at server-side recursion, forwarding, DNSSEC validation issues, or stale zone data.
Confirm routing: local routes, default route, and asymmetric paths
Once L2 and DNS are validated (or ruled out), routing is often the next culprit. Routing issues show up as inability to reach non-local subnets, intermittent reachability, or one-way traffic.
On the endpoint, verify the routing table and confirm the default route points to the correct gateway. On Windows:
powershell
route print
Get-NetRoute -AddressFamily IPv4 | Sort-Object -Property RouteMetric | Select-Object -First 20
On Linux:
bash
ip route
ip rule
If a host has multiple interfaces (Ethernet + Wi‑Fi, or LAN + VPN), you may see multiple default routes with different metrics. A mis-set metric can send traffic out the wrong interface, which is especially common when a VPN adapter registers a lower metric than the physical NIC.
Beyond the endpoint, test the path with traceroute. Use tools appropriate for your environment: tracert on Windows, traceroute on Linux/macOS, and consider mtr for continuous path quality.
bash
# Linux/macOS
traceroute app.example.com
mtr -rw app.example.com
# Windows
tracert app.example.com
Remember that traceroute depends on ICMP/UDP responses from intermediate hops, which may be filtered. Use it to spot gross path changes and obvious blackholes, but don’t assume “ * ” equals failure.
Asymmetric routing and stateful firewalls
Asymmetric routing occurs when traffic from A to B takes a different path than traffic from B to A. In routed enterprise networks, this can happen after partial WAN failures, misconfigured ECMP (equal-cost multipath), or incorrect route redistribution. Stateful firewalls and NAT devices often require seeing both directions of a flow; asymmetry can cause sessions to establish and then fail or appear intermittently broken.
Clues include: SYN packets leaving but no SYN-ACK returning, or flows that work from some subnets but not others due to different return paths. Packet capture (covered later) is the most reliable way to confirm asymmetry.
Measure the symptom: latency, loss, jitter, and throughput
“Network performance problems” are not one thing. Latency (delay), packet loss, jitter (variance in delay), and throughput (rate) each point to different causes. Measuring them clarifies what to fix.
Ping is a basic latency/loss tool, but don’t stop at a few packets. Use longer runs and consider packet sizes. For example:
bash
# Linux/macOS
ping -c 50 -i 0.2 <target>
# Windows
ping -n 50 <target>
For path-aware loss, mtr is helpful because it shows loss and latency by hop over time:
bash
mtr -rwzbc 100 app.example.com
Throughput is best tested with iperf3 between controlled endpoints, not by downloading random files. If you can deploy a temporary iperf3 server in the affected site or a nearby cloud region, you can differentiate WAN congestion from LAN issues.
bash
# Server
iperf3 -s
# Client
iperf3 -c <server-ip> -P 4
Be cautious: throughput tests can load links and worsen conditions. Use them when you already suspect capacity issues and need evidence.
Separate application failure from network failure
Many tickets labeled “network” are actually application-layer failures: expired certificates, overloaded backends, broken auth, misconfigured proxies, or API gateways returning errors. Your job is not to prove the app team wrong; it’s to isolate the failing layer quickly.
Start by checking whether TCP connections can be established to the target host and port. If TCP can’t connect, focus on routing/firewalls. If TCP connects but the application fails, gather HTTP/TLS evidence and collaborate with the application owners.
From PowerShell:
powershell
Test-NetConnection app.example.com -Port 443
From Linux/macOS:
bash
nc -vz app.example.com 443
For HTTPS services, test TLS negotiation and HTTP response separately:
bash
# TLS handshake details
openssl s_client -connect app.example.com:443 -servername app.example.com -brief
# HTTP-level response
curl -vkI https://app.example.com/
A curl response with a fast TCP handshake but a slow server response suggests server-side performance, not the network. A timeout during connect suggests network path or firewall.
Identify common failure domains and test them in a controlled order
At this point you should have enough context to focus. A productive way to structure network troubleshooting is by failure domain, tested in an order that minimizes dependencies:
- Endpoint stack and local configuration
- Local link and L2 adjacency (ARP/ND)
- IP addressing and DHCP (if dynamic)
- DNS resolution
- Routing and path (intra-site, inter-site, internet)
- Firewall/NAT/proxy and policy
- MTU/fragmentation issues
- WAN/ISP and upstream dependencies
This ordering prevents you from diagnosing “WAN problems” when the real issue is a missing default route or DNS misconfiguration.
DHCP and IP addressing: leases, scopes, and conflicts
Dynamic addressing failures can present as “no network,” “limited connectivity,” or intermittent access after lease renewals. Even when clients have an IP, the DHCP-provided options (default gateway, DNS servers, search domains) may be wrong.
On Windows, check whether the address is APIPA (169.254.0.0/16), which indicates DHCP failure. Then inspect lease details:
powershell
ipconfig /all
ipconfig /renew
Get-DhcpClientv4Configuration
On Linux, DHCP client details vary, but you can often view logs via journalctl and confirm the address and routes:
bash
ip addr show
journalctl -u NetworkManager --since "-30 min" 2>/dev/null | tail -n 100
journalctl -u systemd-networkd --since "-30 min" 2>/dev/null | tail -n 100
From the server side, confirm scope utilization and whether addresses are exhausted. DHCP exhaustion causes new clients to fail while existing clients keep working until their leases expire, which creates time-based incident patterns.
Also consider rogue DHCP servers. If a few clients show the wrong gateway or DNS servers while others are fine, capture a DHCP offer/ack exchange on a span port or on the client with packet capture and identify the server IP/MAC.
Switching and Wi‑Fi: VLANs, trunks, and layer-2 instability
Layer-2 problems are often noisy: flapping MAC addresses, STP topology changes, VLAN mismatches, and broadcast storms. They can also be deceptively quiet, affecting only one VLAN due to a trunk misconfiguration.
If your earlier tests showed failures reaching the default gateway or inconsistent ARP resolution, move to access-layer validation. Look for interface errors (CRC, input drops), duplex mismatches (less common now but still possible on edge links), and link flaps.
On managed switches, typical checks include interface counters and logs around the incident time. The exact commands differ by vendor; focus on concepts rather than memorizing syntax: errors, discards, speed/duplex negotiation, PoE events for APs/phones, and STP state.
If Wi‑Fi is involved, confirm whether the issue is RF-related (poor SNR, high retries) or upstream (AP uplink, controller issues). Wi‑Fi problems often manifest as high latency and packet loss only for wireless clients, while wired clients are fine.
A practical tactic is to compare a wired and wireless client on the same VLAN. If only wireless fails, focus on WLAN infrastructure. If both fail, it’s likely VLAN/gateway/routing.
Real-world scenario 2: intermittent VoIP quality after a switch upgrade
A mid-sized office upgrades an access switch stack overnight. The next morning, VoIP calls sound choppy and drop occasionally, but general browsing “seems fine.” Ping tests show sporadic packet loss and jitter to the call manager from IP phones, but not from servers in the data center.
Following the workflow, you first validate scope: the issue is limited to the office and primarily impacts real-time traffic. You then measure jitter/loss from a phone VLAN host and compare it to a server VLAN host. The difference points to a QoS (Quality of Service) marking/trust issue specific to the voice VLAN.
On the new switch stack, the voice VLAN ports are no longer configured to trust DSCP (Differentiated Services Code Point) markings from phones, and the uplink policy is reclassifying voice into best-effort. Under load, voice packets queue behind bulk traffic, causing jitter. Restoring the correct QoS policy fixes the symptom without touching the WAN or firewall. The important move was translating “choppy calls” into measurable jitter/loss and then narrowing the failure domain to a specific VLAN/policy change.
Firewall, NAT, and proxy: validate policy with targeted tests
Stateful firewalls, NAT gateways, and explicit proxies are frequent sources of partial connectivity: some sites can reach an app, some ports work but others fail, or traffic works for a while and then breaks due to session table limits.
Start by proving whether the issue is port-specific. If ICMP works but TCP 443 fails, you need a port test. If TCP connects but HTTP fails, the firewall may be doing TLS inspection or the proxy path may be misbehaving.
From clients:
powershell
Test-NetConnection -ComputerName app.example.com -Port 443
Test-NetConnection -ComputerName app.example.com -Port 80
bash
nc -vz app.example.com 443
nc -vz app.example.com 80
If you use an explicit proxy, test both with and without it (where policy allows). Misconfigured PAC (Proxy Auto-Config) files or proxy outages can look like internet failures even while direct IP connectivity is fine.
For environments with transparent proxying or TLS inspection, certificate errors are a major clue. Users may see browser warnings while ping works. Validate with curl -v and examine the presented certificate chain.
Also consider NAT pool exhaustion for large egress populations. If many internal clients share a small public IP pool, long-lived or high-connection workloads (Windows updates, container registries) can consume ports. Symptoms are intermittent connection failures to external services while internal traffic remains fine. Firewall logs and session counts help confirm this; from the client side you’ll see connect timeouts that resolve after a short wait.
MTU and fragmentation: diagnose “works for small things” failures
MTU (Maximum Transmission Unit) problems are notorious because basic tests can pass while real applications fail. For example, small pings work but large HTTPS responses stall, or VPN users can reach some sites but not others.
Path MTU Discovery (PMTUD) relies on ICMP “Fragmentation Needed” messages (for IPv4) or IPv6 Packet Too Big messages. If those ICMP messages are blocked, endpoints may keep sending packets that are too large, leading to blackholed traffic.
A practical way to test is to send progressively larger packets with the “don’t fragment” bit set (IPv4). On Windows:
powershell
# 1472 bytes payload + 28 bytes IP/ICMP header ~= 1500 MTU
ping -f -l 1472 <target>
On Linux:
bash
ping -M do -s 1472 <target>
If smaller sizes work but larger sizes fail, you may have an MTU bottleneck along the path (common with tunnels like IPsec, GRE, or some PPPoE links). The fix is usually to adjust MTU/MSS (Maximum Segment Size) clamping on tunnel interfaces or firewalls, and to ensure ICMP needed for PMTUD is permitted.
Use packet capture when the symptom is unclear or disputed
When simple tests don’t explain behavior—or when different teams disagree on where the fault lies—packet capture provides ground truth. The goal is not to “capture everything,” but to capture the smallest trace that answers a specific question: Are SYNs leaving? Are SYN-ACKs returning? Is DNS responding? Are resets (RST) being injected? Is retransmission happening due to loss?
You can capture at three strategic points:
- The client endpoint (easy, shows what the client believes)
- A server endpoint (useful for client-to-server issues)
- A network device vantage point (SPAN port, firewall capture), which helps identify drops in transit
On Linux, tcpdump is the standard tool. Capture only relevant host/port pairs and keep files small.
bash
# Capture DNS and HTTPS traffic to a specific host
sudo tcpdump -i any -nn -s 0 -w incident.pcap '(host <server-ip> and (port 53 or port 443))'
On Windows, modern environments can use pktmon (built-in) or Wireshark/Npcap depending on policy. With pktmon:
powershell
pktmon filter remove
pktmon filter add -p 443
pktmon start --etw -m real-time
# reproduce issue
pktmon stop
pktmon format PktMon.etl -o incident.txt
If you can use Wireshark, capture filters should still be tight, for example host <server-ip> and tcp port 443.
When analyzing, focus on the TCP three-way handshake first. If SYNs go out and no SYN-ACK returns, suspect routing, firewall drops, or upstream reachability. If SYN-ACK returns but the client never ACKs, suspect client-side firewall or asymmetric routing. If the handshake completes but data stalls with retransmissions, suspect loss, MTU blackhole, or middlebox interference.
Real-world scenario 3: “The firewall is blocking our app” during a SaaS migration
During a SaaS migration, an internal app is updated to call a new API endpoint over HTTPS. The application team reports intermittent timeouts and insists the firewall is blocking traffic. From a client subnet, Test-NetConnection to the SaaS FQDN on port 443 succeeds, but the app still times out. DNS resolves consistently, and traceroute shows a stable path.
At this stage, packet capture is the shortest route to truth. A capture on the affected app server shows that TCP handshakes complete reliably, but the TLS handshake sometimes stalls after ClientHello. The server retransmits the same TLS records multiple times, then gives up. A simultaneous capture on the egress firewall shows large outbound TLS packets leaving, but the return traffic includes ICMP fragmentation-needed messages that are being rate-limited or dropped by an intermediate device.
This is an MTU/PMTUD issue introduced by a new IPsec tunnel path used for SaaS egress from that subnet. Enabling MSS clamping on the tunnel interface and allowing necessary ICMP resolves the intermittent TLS stalls. The key was using capture to differentiate “blocked” from “blackholed,” and to tie the symptom to packet size and path.
Correlate with device health: interfaces, CPU, memory, and queues
When you’ve narrowed the problem to a segment or device role (gateway, firewall, WAN edge), validate device health. Network devices under resource pressure behave like “random network issues”: control plane delays, slow routing convergence, dropped packets due to queue exhaustion, or management-plane unresponsiveness.
Look for interface errors and discards first. Errors (CRC, frame errors) often indicate physical layer issues—bad cables, optics, or interference. Discards and drops often indicate congestion or mis-sized buffers. If you see rising drops on an uplink during the incident window, you have evidence pointing to capacity or QoS problems.
Then check CPU and memory. Sustained high CPU on a firewall can cause session setup failures and increased latency. On routers, CPU spikes can impact routing protocols (BGP/OSPF) and cause flaps that look like intermittent outages.
Queueing is particularly relevant for “it’s slow at 9 AM” patterns. If you can inspect QoS policies, check whether a critical class (voice, interactive) is being starved or misclassified. If you can’t access network device QoS, you can still infer congestion when latency increases with load and packet loss appears during peak periods.
Work from inside out: compare tests across vantage points
A powerful technique in network troubleshooting is differential testing: run the same test from different points and compare results. Rather than asking “does it work,” ask “where does it stop working.”
For example, if clients in Site A can’t reach 10.20.30.40 but a server in the data center can, the fault is somewhere between Site A and the data center. If Site A can reach the site WAN router but not the data center, the issue is WAN routing or tunnel. If Site A can’t reach its own gateway, it’s local switching/VLAN.
Make this explicit by building a test matrix. You don’t need a spreadsheet, but you do need to be consistent: same destination, same protocol/port, same DNS server if testing DNS.
A minimal matrix might include:
- Affected client (Site A VLAN X)
- Known-good client (Site A different VLAN)
- Known-good host (data center)
- Cloud VM (external)
Run: gateway ping, internal IP ping, external IP ping, DNS query, TCP 443 connect.
This approach prevents you from overfitting to a single symptom and helps you isolate whether the issue is per-VLAN, per-site, or global.
Internet and WAN issues: validate upstream dependencies without guessing
When scope suggests a WAN/ISP issue—multiple subnets affected, internal services reachable but internet failing, or a single site isolated—avoid the temptation to declare “ISP problem” without evidence. ISPs will ask for specifics, and you’ll want them anyway to decide whether to fail over.
Start by validating:
- Can you reach the WAN edge router from inside the site?
- Is the default route present and pointing where expected?
- Are tunnels (IPsec/SD‑WAN) up and passing traffic?
- Are public DNS resolvers reachable by IP?
If you have dual uplinks, compare behavior across them. If one path works and the other doesn’t, you can often mitigate by failing over while investigating root cause.
For BGP-based internet edges, symptoms of route issues include reachability to some prefixes but not others, or a sudden shift in path latency. Your edge monitoring (BGP session state, route counts) is the primary evidence. From the client side, you’ll see traceroutes that diverge sharply from baseline.
If you operate SD‑WAN, be careful interpreting overlay health scores. A tunnel may be “up” while a specific underlay is dropping packets. Use synthetic probes or SLA measurements if available, and corroborate with endpoint mtr.
Authentication and access control dependencies (802.1X, NAC, captive portals)
In enterprise networks, access is often conditional: 802.1X (port-based access control), NAC (Network Access Control), and captive portals can place clients into remediation VLANs or block traffic based on posture. These systems create failure modes where a client has an IP and can reach some resources but not others.
If a user reports “I can connect to Wi‑Fi but nothing works,” check whether they are being placed in a guest VLAN or quarantine VLAN. IP addressing might look valid but differ from normal ranges. DNS may resolve to captive portal addresses. HTTP may redirect.
From the endpoint, evidence includes:
- IP/subnet not matching expected VLAN
- Default gateway different than usual
- DNS responses that point to portal IPs
- HTTP 302 redirects to login pages
From the network side, authentication logs (RADIUS) and NAC policy decisions are key. The remediation is usually not “fix routing,” but correcting certificates, supplicant configuration, or NAC policy rules.
IPv6 considerations: dual-stack surprises and preference issues
Many environments are dual-stack (IPv4 and IPv6). Problems can arise when IPv6 is partially deployed: clients prefer IPv6, but IPv6 routing/DNS/firewalling is incomplete. The result is intermittent failures depending on which address family an application selects.
If a host has an IPv6 address and default route, test both IPv4 and IPv6 explicitly.
bash
# Linux/macOS
ping -4 -c 3 app.example.com
ping -6 -c 3 app.example.com
# DNS queries for A and AAAA
dig A app.example.com +short
dig AAAA app.example.com +short
On Windows:
powershell
ping -4 app.example.com
ping -6 app.example.com
Resolve-DnsName app.example.com -Type A
Resolve-DnsName app.example.com -Type AAAA
If IPv6 fails but IPv4 succeeds, you’ve found a strong clue. Fixing it might mean updating IPv6 firewall policy, enabling RA (Router Advertisement) correctly, or ensuring upstream IPv6 routing is present. Disabling IPv6 at endpoints is rarely the right long-term fix, but it can be an interim mitigation in controlled environments.
Log correlation: align timestamps across systems
As you narrow root cause, timestamps become critical. A firewall log might show denies at 10:02, while a switch log shows a trunk flap at 10:01:58. If system clocks are skewed, you’ll mis-correlate events.
Ensure NTP is working across infrastructure. When investigating, record the timezone and timestamp format used in each log source. If you can, convert times to UTC during analysis.
Correlate:
- Endpoint event logs (network adapter resets, VPN reconnects)
- DHCP logs (lease offers/declines)
- DNS logs (SERVFAIL spikes, forwarder timeouts)
- Firewall logs (denies, drops, NAT port allocation failures)
- Routing logs (BGP/OSPF neighbor changes)
- Switch logs (STP changes, port up/down)
The goal is not to read every log line, but to validate whether an observed symptom aligns with a concrete event.
Build and test hypotheses: change one variable at a time
Network troubleshooting fails when it becomes random action. Once you have evidence—say, “clients can reach gateway, DNS times out, TCP 443 connects to IP but not name”—form a hypothesis: “DNS server 10.0.0.53 is unreachable from VLAN 30 due to ACL change.” Then design the smallest test that confirms or refutes it.
Good tests isolate one variable:
- Query the DNS server directly by IP from affected and unaffected VLANs.
- Temporarily use an alternate DNS server (if policy allows) to see if the symptom disappears.
- Test the same destination by IP vs FQDN.
- Test the same service from two sites to see if it’s localized.
Avoid making multiple changes at once (e.g., changing firewall rules and switching VLANs and rebooting devices). If the issue resolves, you won’t know what fixed it, and it will return.
Remediation patterns: fix classes of issues, not just symptoms
Once you’ve isolated root cause, choose a remediation that addresses the underlying class of failure and reduces recurrence.
If the issue is a VLAN/trunk mismatch, the durable fix is configuration management: consistent templates, pre-change validation, and post-change tests for expected VLAN tagging. If the issue is DNS fragility, the fix may include redundant resolvers per site, health checks, and tight monitoring of forwarder latency. If the issue is MTU blackholing, fix ICMP policies and implement MSS clamping at tunnel boundaries.
Document the incident in terms of evidence and decision points: what symptoms were observed, which tests narrowed the failure domain, and what change corrected it. This becomes your internal playbook and reduces mean time to restore in future events.
Operationalizing the workflow: baselines, monitoring, and runbooks
The difference between “heroic troubleshooting” and operational excellence is preparation. A repeatable network troubleshooting workflow becomes dramatically faster when you have baselines and instrumentation.
Baselines include normal latency between sites, typical DNS response times, expected traceroute paths, and standard interface utilization. Without baselines, you can still troubleshoot, but you’ll argue about whether 40 ms is “bad” and whether a path change is “normal.”
Monitoring should cover at least:
- Interface up/down, errors, discards
- WAN tunnel health and loss/latency
- DNS resolver reachability and query latency
- DHCP scope utilization and failures
- Firewall session counts and drops
Runbooks should not be vendor-specific command dumps. They should encode the logic you followed in this guide: scope, isolate layers, test from multiple vantage points, and escalate with packet evidence when needed.
Reference command set (use sparingly, but consistently)
When you’re in the middle of an incident, you want a small, reliable set of commands that answer specific questions. The following are commonly safe defaults.
For Windows endpoints, focus on configuration, routes, DNS, and port tests:
powershell
ipconfig /all
route print
Resolve-DnsName app.example.com
Test-NetConnection app.example.com -Port 443
For Linux endpoints, focus on addresses, routes, DNS, and captures:
bash
ip addr
ip route
dig app.example.com +short
nc -vz app.example.com 443
sudo tcpdump -i any -nn host <server-ip>
For path quality over time, prefer mtr where available:
bash
mtr -rwzbc 100 <target>
And for MTU suspicion, use DF pings carefully:
bash
ping -M do -s 1472 <target>
Keeping your toolkit small reduces cognitive load and makes comparisons between hosts meaningful.