Optimize Network Performance in Ubuntu Server: Admin How-To

Network performance in Ubuntu Server is rarely a single “turn this knob” problem. Throughput, latency, jitter, and connection scalability are the result of a pipeline: the NIC and its driver, PCIe, interrupts, CPU scheduling, kernel networking, socket buffers, routing and firewalling, and finally whatever the workload is doing at the application layer. If you tune one stage without measuring and understanding the bottleneck, you can easily trade one problem for another—for example, improving peak throughput while increasing tail latency or CPU consumption.

This article lays out a practical, measurement-first process to optimize network performance in Ubuntu Server. The goal is not to apply every tuning option, but to build a repeatable workflow: establish baselines, identify limiting factors, apply changes in small steps, and validate with realistic traffic patterns. Along the way, you’ll see how choices differ for common server roles such as web/API servers, storage nodes, virtualization hosts, and high-throughput data movers.

Start with a measurement-first workflow

Before changing kernel parameters or NIC settings, you need to know what “better” means for your system. Network performance can mean any of the following, depending on the service:

Latency: request/response time and tail latency under load.

Throughput: sustained bits per second, especially for bulk transfers.

Connection scale: how many concurrent connections or packets per second (PPS) the host can sustain.

CPU efficiency: how many CPU cycles per Gbit/s (or per million packets) the system burns.

A useful workflow is to capture a baseline under a representative load, apply one change, then re-run the same test. This helps you avoid attributing improvements to tuning when they’re actually due to test variance, caching, or changes in traffic mix.

Establish an accurate baseline

Baseline measurement should include both network-side metrics and CPU/interrupt behavior. Start with a clear description of the host and path: NIC model, link speed, switch port configuration, VLANs, MTU, and whether the traffic traverses a firewall, load balancer, overlay network, or VPN.

On the Ubuntu Server side, collect:

Kernel and OS version:

uname -a
lsb_release -a 2>/dev/null || cat /etc/os-release

NIC identification and driver:

bash
ip -br link
lspci -nn | egrep -i 'ethernet|network'
sudo ethtool -i <iface>

Link state and negotiated speed/duplex:

bash
sudo ethtool <iface>

Interface counters (errors, drops):

bash
ip -s link show <iface>

Socket statistics and protocol-level drops:

bash
ss -s
nstat -az

CPU and interrupt distribution (to spot single-core bottlenecks):

bash
mpstat -P ALL 1 5
cat /proc/interrupts | egrep -i '<iface>|mlx|ixgbe|i40e|ena|virtio|bnxt'

If you’re running on a VM, also record hypervisor type and the virtual NIC model (virtio-net, vmxnet3, etc.). Virtualization layers and vSwitch policies can dominate behavior, and changes inside the guest may not help if the bottleneck is elsewhere.

Use the right test tools for your traffic profile

You can’t meaningfully tune TCP and UDP with only one benchmark. Use at least one throughput tool and one latency tool, and prefer tests that approximate your packet sizes and concurrency.

For throughput, iperf3 is a common baseline:

bash

# On a peer host

iperf3 -s

# On the Ubuntu Server under test

iperf3 -c <peer_ip> -P 4 -t 30

For UDP capacity and loss, use iperf3 -u, but interpret results carefully because UDP tests can saturate queues and show loss that TCP would avoid via congestion control:

bash
iperf3 -c <peer_ip> -u -b 5G -l 1200 -t 30

For latency, you need more than ping. ping is still useful to catch MTU issues and basic RTT changes, but application latency often depends on queueing and contention. Tools like wrk (HTTP), hey (HTTP), redis-benchmark, netperf, or workload-specific client generators are better.

A practical baseline might look like this:

1) iperf3 throughput with several parallel streams.

2) A service-level test (e.g., wrk to your NGINX endpoint).

3) CPU and interrupt distribution during the test (mpstat, /proc/interrupts).

With that baseline, you can start isolating bottlenecks.

Verify link, layer-2, and path fundamentals

Many “performance” issues are actually link negotiation problems, duplex mismatches, mis-sized MTUs, or switch-side features (LACP hashing, buffer policies, storm control) that interact poorly with your traffic. If you skip these checks and jump into sysctl tuning, you can waste hours.

Confirm speed, duplex, autonegotiation, and flow control

Use ethtool to confirm the negotiated link mode and whether there are signs of a bad physical link (CRC errors, symbol errors, interface resets).

bash
sudo ethtool <iface>
sudo ethtool -S <iface> | head

Pay attention to:

“Speed” and “Duplex”: if you expect 10GbE and see 1GbE, nothing in Linux tuning will recover the missing bandwidth.

RX/TX errors and drops: rising error counters usually indicate cabling, optics, or switch port issues.

Pause frames (flow control): flow control can stabilize loss on congested links but can also propagate congestion in ways that hurt latency-sensitive workloads. Whether it’s desirable depends on your environment.

If the NIC is bonded (LACP), understand that a single TCP flow typically hashes to a single physical member; you won’t see aggregate bandwidth on a single transfer unless you use multiple flows.

Validate MTU end-to-end before enabling jumbo frames

Jumbo frames (commonly MTU 9000) can reduce CPU overhead for high-throughput workloads by decreasing packets per second for a given bandwidth. But jumbo frames require consistent MTU across the entire L2 path (host NIC, switch ports, trunks, any intermediate devices). A partial MTU mismatch can cause fragmentation (if allowed) or blackholing (if blocked), both of which look like performance issues.

Start by confirming current MTU:

bash
ip link show <iface>

Then validate path MTU using “do not fragment” pings. For IPv4:

bash

# 8972 payload + 28 bytes ICMP/IP = 9000 MTU

ping -M do -s 8972 <peer_ip>

For IPv6 (path MTU discovery is mandatory), use:

bash
ping6 -s 8972 <peer_ipv6>

If jumbo frames fail, do not force them on only the servers. Fix the path, or stick with MTU 1500.

Scenario: 10GbE upgrade that “didn’t get faster”

A common case in mixed environments is a server upgraded to 10GbE where throughput stays near 940 Mbit/s. Baseline checks show ethtool reports 1000Mb/s because the switch port is hard-set or the optic/cable doesn’t support 10Gb. In another variation, the server is 10Gb but traffic is traversing a 1Gb uplink due to a mispatched switch port.

The key takeaway is that link-level validation is the first “optimization.” If you don’t see the expected negotiated rate and clean counters, fix that before touching kernel knobs.

Understand the Ubuntu networking stack you’re configuring

Ubuntu Server can be configured using different tools depending on the version and install choices:

Netplan is the declarative configuration layer.

Renderer is typically systemd-networkd on servers, sometimes NetworkManager.

The kernel networking stack is common regardless of renderer.

Why this matters: persistent tuning depends on where you apply it. Interface settings like MTU and offloads are generally applied via ethtool or renderer hooks; sysctl parameters are kernel-level and applied via /etc/sysctl.d/*.conf or sysctl -w for temporary changes.

When you change something, confirm it persists across reboot and interface resets.

Tune NIC offloads and understand when to disable them

NIC offloads push certain packet processing tasks from the CPU to the NIC. Offloads can significantly improve throughput and CPU efficiency, but they can also complicate packet captures, interfere with some virtual switching features, or cause issues with certain drivers/firmware combinations.

Inspect current offload configuration

Use ethtool -k:

bash
sudo ethtool -k <iface>

Common offloads you’ll see:

TSO (TCP Segmentation Offload): NIC segments large TCP buffers into MSS-sized packets.

GSO/GRO (Generic Segmentation/Receive Offload): kernel aggregates or segments packets.

LRO (Large Receive Offload): older receive aggregation; often discouraged in routing/bridging scenarios.

Checksum offload: NIC computes/validates checksums.

For many servers (especially “endpoint” hosts like web servers and clients), leaving offloads enabled is usually beneficial.

When disabling offloads can help

There are situations where disabling specific offloads is reasonable:

If the host is acting as a router, firewall, or doing traffic shaping, large aggregated packets can reduce the effectiveness of per-packet policies and distort queueing.

If you’re troubleshooting packet corruption or odd drops and suspect driver/firmware issues.

If you rely heavily on packet capture accuracy on the host (GRO can merge packets, making captures appear different).

Changes can be applied temporarily:

bash
sudo ethtool -K <iface> gro off gso off tso off

Avoid blanket “turn everything off” policies. Disabling offloads can raise CPU usage dramatically at high bandwidth, shifting the bottleneck from NIC to CPU.

Scenario: latency-sensitive trading app on a shared host

Consider a low-latency UDP feed handler running on a host that also performs large TCP backups. GRO and large receive aggregation can increase burstiness and queueing inside the kernel, worsening tail latency for small packets during backup windows. In such a scenario, selectively disabling GRO on the feed-facing interface (or isolating traffic to separate NICs/queues) can improve latency consistency, even if it slightly reduces maximum throughput.

This kind of tradeoff should be validated with latency measurements under mixed load, not assumed.

Increase parallelism: queues, RSS, RPS/XPS, and interrupt affinity

On modern NICs, network processing is parallelized using multiple RX/TX queues. Receive Side Scaling (RSS) uses a hash of packet headers to distribute flows across queues, which are typically mapped to different CPU cores via interrupts. If all interrupts land on one core, the system can become CPU-bound on that core long before the link is saturated.

Check queue count and RSS configuration

Start by checking how many channels (queues) the NIC and driver expose:

bash
sudo ethtool -l <iface>

Some drivers allow changing channels:

bash

# Example: set combined RX/TX queues to 8 (if supported)

sudo ethtool -L <iface> combined 8

Not all NICs support changing queues, and some environments (notably certain cloud instances) restrict it.

Inspect IRQ distribution

Check which CPUs handle NIC interrupts:

bash
grep -i <iface> /proc/interrupts

If you see most interrupts incrementing on CPU0 only, you’re likely hitting an interrupt affinity bottleneck.

Ubuntu often runs irqbalance, which spreads interrupts across CPUs. Verify it is installed and running:

bash
systemctl status irqbalance

irqbalance is a good default for general-purpose servers. For specialized, latency-sensitive systems, you may prefer explicit pinning of IRQs and application threads, but that becomes an architecture exercise (NUMA locality, CPU isolation, etc.).

NUMA awareness for multi-socket servers

On dual-socket systems, NIC PCIe locality matters. If the NIC is attached to NUMA node 0 but your workload threads run mostly on node 1, you can incur cross-socket memory traffic and higher latency.

Check NUMA node of the NIC:

bash

# Identify PCI address first via ethtool -i or lspci

lspci -vv -s <pci_addr> | grep -i numa -n
cat /sys/bus/pci/devices/0000:<pci_addr>/numa_node

Then check CPU topology:

bash
lscpu
numactl --hardware

If you’re optimizing a high-throughput, CPU-heavy dataplane (e.g., software load balancer, IDS/IPS, packet broker), align IRQs and worker threads with the NIC’s NUMA node.

RPS/XPS: software steering when hardware RSS is limited

RPS (Receive Packet Steering) and XPS (Transmit Packet Steering) are kernel mechanisms to distribute packet processing across CPUs even if hardware queues are limited. These can help on NICs with few queues (or on certain virtio setups), but they also add overhead.

You can inspect and set RPS CPU masks per RX queue under /sys/class/net/<iface>/queues/.

Example (illustrative) approach:

bash

# Show existing RPS settings

for f in /sys/class/net/<iface>/queues/rx-*/rps_cpus; do echo "$f: $(cat $f)"; done

If you decide to change RPS/XPS, do so deliberately and document the CPU mask chosen. In many general server cases, enabling or changing RPS is not necessary if RSS and irqbalance are functioning.

Scenario: 25GbE storage node stuck at ~8Gb/s

A storage node moving large objects over TCP shows throughput capping around 8Gb/s with one CPU core pegged at 100% softirq. ethtool -l reveals only 4 combined queues, and /proc/interrupts shows most activity on a single core due to static affinity. Enabling irqbalance (or manually distributing IRQs across cores local to the NIC’s NUMA node), and increasing queues to 8 if the driver supports it, often unlocks near-line-rate throughput.

This is a classic case where the bottleneck is not TCP buffer size, but parallelism in packet processing.

Kernel and sysctl tuning: focus on the constraints you actually hit

Linux defaults are conservative and aim to behave safely across many environments. For high-bandwidth or high-concurrency servers, you may need to adjust socket buffer limits, backlog queues, and TCP behavior. The most common mistake is copying a “sysctl tuning list” without understanding what each parameter affects.

The safest approach is:

1) Change a small set of parameters.

2) Validate with your baseline tests.

3) Confirm there are no regressions in latency, memory use, or connection behavior.

On Ubuntu, persist sysctl settings with files under /etc/sysctl.d/, for example /etc/sysctl.d/99-net-performance.conf.

Increase socket buffer limits for high bandwidth-delay products

High-throughput TCP over links with non-trivial latency (the bandwidth-delay product, BDP) requires sufficient socket buffers. If your TCP window is too small, you won’t fill the pipe.

Key parameters:

net.core.rmem_max, net.core.wmem_max: maximum socket receive/send buffer sizes.

net.ipv4.tcp_rmem, net.ipv4.tcp_wmem: min/default/max TCP buffer sizes.

A common pattern is to raise maxima to allow scaling under load, while leaving defaults reasonable.

Example configuration (treat as a starting point, not a universal truth):

bash
sudo tee /etc/sysctl.d/99-net-performance.conf >/dev/null <<'EOF'

# Allow larger per-socket buffers for high-throughput links

net.core.rmem_max = 134217728
net.core.wmem_max = 134217728

# TCP autotuning ranges: min, default, max (bytes)

net.ipv4.tcp_rmem = 4096 1048576 134217728
net.ipv4.tcp_wmem = 4096 1048576 134217728
EOF

sudo sysctl --system

You should validate memory impact. Large maxima don’t mean every socket uses that much; Linux autotuning grows buffers based on need. But on connection-heavy systems, permissive maxima can still contribute to higher memory usage under certain traffic patterns.

Queueing and backlog parameters for burst handling

If your host receives bursts (many packets arriving faster than the kernel can process momentarily), backlog sizing matters.

net.core.netdev_max_backlog controls how many packets can be queued on the input side when the kernel is overwhelmed.

net.core.somaxconn caps the listen backlog for TCP sockets (interacts with application listen() backlog).

Example:

bash
sudo tee /etc/sysctl.d/99-net-queues.conf >/dev/null <<'EOF'

# Larger backlog for bursty traffic (balance against memory and latency)

net.core.netdev_max_backlog = 250000

# Allow larger listen backlogs for high connection rates

net.core.somaxconn = 4096
EOF

sudo sysctl --system

Raising backlogs can reduce drops during bursts, but it can also increase latency by allowing deeper queues. For latency-sensitive workloads, you may prefer to keep queues smaller and address root causes (interrupt distribution, application accept rates, or upstream load shaping).

TCP congestion control and queue discipline

TCP congestion control algorithms govern how aggressively TCP increases its sending rate and reacts to loss/ECN. Linux supports multiple algorithms; modern Ubuntu kernels commonly use CUBIC by default. BBR is available on many kernels and can improve throughput and latency on some paths, but it is not universally better, and its fairness characteristics can matter in shared networks.

Check current congestion control and qdisc:

bash
sysctl net.ipv4.tcp_congestion_control
sysctl net.core.default_qdisc

If you consider experimenting, do it in a controlled environment and validate with realistic traffic and co-tenancy. A common combination for general latency/throughput balance is fq (Fair Queuing) as the default qdisc:

bash
sudo tee /etc/sysctl.d/99-net-qdisc.conf >/dev/null <<'EOF'
net.core.default_qdisc = fq
EOF

sudo sysctl --system

Changing congestion control is a bigger decision; if you do, capture before/after metrics and ensure you’re not impacting other tenants or upstream devices.

TCP keepalives and time-wait behavior (connection scale)

If your server handles millions of short-lived connections (e.g., API gateways, reverse proxies), you may run into resource pressure around ephemeral ports and TIME_WAIT sockets. Tuning here can help, but it’s easy to break protocol correctness.

Safer first steps are:

Ensure applications use keep-alive where appropriate (HTTP keep-alive, connection pooling).

Scale fs.file-max and per-process limits appropriately.

Only then consider TCP-level adjustments.

You can inspect TIME_WAIT counts:

bash
ss -ant state time-wait | wc -l

If ephemeral port exhaustion is possible, review the port range:

bash
sysctl net.ipv4.ip_local_port_range

Widening the ephemeral port range can help at high connection rates:

bash
sudo tee /etc/sysctl.d/99-net-ports.conf >/dev/null <<'EOF'
net.ipv4.ip_local_port_range = 10240 65535
EOF

sudo sysctl --system

Avoid risky settings you may find online (for example, disabling TIME_WAIT handling in ways that can break NAT scenarios). If you’re fronting services behind load balancers or NAT gateways, correctness and reuse timing matter.

MTU, MSS clamping, and overlay networks

If you run Ubuntu Server in environments with overlays—VXLAN, Geneve, WireGuard, IPsec, or Kubernetes CNI overlays—effective MTU becomes central to performance and reliability. Encapsulation adds overhead, reducing the maximum payload size that can be transmitted without fragmentation.

Choose MTU based on the full encapsulation stack

For example:

VXLAN adds 50 bytes overhead (outer Ethernet + IP + UDP + VXLAN headers; exact overhead can vary with VLANs and IPv6).

WireGuard adds overhead typically around 60 bytes depending on IP version.

If your underlay MTU is 1500 and you add VXLAN, your overlay MTU must be smaller (often around 1450) to avoid fragmentation.

Inconsistent MTU across nodes frequently manifests as:

Good performance for small responses but stalls/hangs for larger transfers.

Intermittent gRPC/HTTP2 resets.

Strange TCP retransmissions without obvious link errors.

Validate with DF pings at the effective size between endpoints inside the overlay.

MSS clamping for routed/firewalled paths

If you control firewall rules, TCP MSS clamping can prevent oversized TCP segments when path MTU is smaller than endpoints assume. This is more common on routers/firewalls than on pure endpoints, but Ubuntu servers acting as gateways may need it.

On Linux with iptables/nftables, MSS clamping is possible, but implement it only if you understand your path MTU problem and can validate it. Overusing MSS clamping can hide MTU issues that should be fixed at the source.

Reduce CPU overhead in the data path

Once fundamentals and parallelism are addressed, CPU overhead often becomes the limiting factor—especially at 10GbE and above, or when PPS is high (small packets). Ubuntu can move a lot of traffic, but you need to ensure the CPU isn’t spending unnecessary cycles on extra layers.

Minimize expensive firewall rules on hot paths

If the server is not intended to be a firewall, keep host firewall rules minimal and efficient. Long rule sets with many matches can be costly at high PPS. Prefer stateful rules that match early and avoid per-packet logging.

On Ubuntu, nftables is common (via ufw or directly). Regardless of tooling, the principle is to simplify and ensure the majority of traffic hits fast paths.

If you suspect firewall overhead, measure CPU during a controlled iperf3 run with firewall enabled vs. disabled (in a safe maintenance window) to quantify impact.

Avoid unnecessary bridging and packet mangling

Bridging (Linux bridge), NAT, and heavy use of conntrack can add overhead. If your host is a virtualization node or container host, consider whether traffic can be kept on an accelerated path (e.g., virtio + vhost-net, SR-IOV, or moving routing/NAT to an edge).

Conntrack table pressure can also create drops and latency. You can monitor conntrack usage:

bash
sudo sysctl net.netfilter.nf_conntrack_count 2>/dev/null
sudo sysctl net.netfilter.nf_conntrack_max 2>/dev/null

If conntrack is used heavily (Kubernetes nodes with NodePort, NAT gateways, etc.), plan capacity appropriately. Increasing conntrack tables without memory planning can cause system-wide pressure.

Scenario: Kubernetes node with periodic packet drops

A Kubernetes worker node runs many services and uses kube-proxy in iptables/nft mode. During traffic spikes, you see intermittent drops and increased latency. Basic link checks look clean, but nstat shows TCP retransmits rising, and CPU shows spikes in softirq and netfilter paths.

In this type of environment, the optimization path often involves a combination of:

Ensuring RSS/queues and IRQ distribution are healthy.

Reviewing conntrack sizing and timeouts based on expected connection counts.

Reducing unnecessary service hops (for example, preferring cluster-internal load balancing that avoids extra NAT where possible).

The key is that “network tuning” may actually be “packet processing pipeline” tuning, where netfilter and conntrack are major actors.

Apply interface-level tuning safely and persistently

Some of the most effective improvements come from NIC and interface settings that live outside sysctl. The operational challenge is persistence: settings applied with ethtool can revert on reboot or link flap unless you hook them into your network configuration management.

Ring buffers: reduce drops under burst

NIC ring buffers (RX/TX) buffer packets between the NIC and the kernel. If the RX ring is too small, bursts can overflow and drop packets even though average throughput is fine.

Check current and maximum ring sizes:

bash
sudo ethtool -g <iface>

If the driver supports it, increase RX (and sometimes TX) rings cautiously:

bash
sudo ethtool -G <iface> rx 4096 tx 4096

Bigger rings can reduce drops during bursts, but they can also increase latency by allowing more packets to queue before processing. For latency-sensitive services, don’t blindly maximize ring sizes; test with your real traffic.

Coalescing: trade CPU for latency

Interrupt coalescing controls how frequently the NIC raises interrupts. Aggressive coalescing reduces CPU overhead but can increase latency (packets wait longer before being processed).

Inspect current coalescing:

bash
sudo ethtool -c <iface>

If you tune coalescing, do so with clear goals. For high-throughput batch transfer nodes, more coalescing may be acceptable. For real-time systems, you may want less coalescing.

Because coalescing settings are highly NIC/driver-specific, avoid copying values from unrelated hardware. Measure: change one parameter, re-test throughput and tail latency.

Persisting ethtool settings on Ubuntu

How you persist depends on your renderer and automation.

If you use systemd-networkd, you can use .link files for some settings (like MTU) and .network for others, but ethtool-style tuning (rings/coalescing/offloads) may still require scripting.

Common approaches include:

A systemd oneshot service that runs after network is up.

A networkd-dispatcher script.

Cloud-init or configuration management (Ansible) that applies settings at boot.

Example systemd unit (adapt it and test carefully):

bash
sudo tee /etc/systemd/system/ethtool-tune@.service >/dev/null <<'EOF'
[Unit]
Description=Apply ethtool tuning for %i
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/ethtool -G %i rx 4096 tx 4096
ExecStart=/usr/sbin/ethtool -K %i gro on gso on tso on

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ethtool-tune@<iface>.service

Keep this kind of persistence under version control, and make it explicit in your runbooks. “Someone ran ethtool once” is not a sustainable operational state.

Optimize routing, ARP/ND, and neighbor behavior for scale

On busy servers, routing and neighbor cache behavior can become a hidden limiter—especially for L3-heavy hosts, gateways, or nodes with many peers.

Validate routing and avoid asymmetric paths

Asymmetric routing (packets in and out using different paths) can cause confusing behavior when stateful firewalls, conntrack, or load balancers are involved. Even without statefulness, asymmetry can increase latency and cause uneven load.

Check routing tables and policy routing:

bash
ip route
ip rule
ip route show table all | head -n 50

If you use multiple interfaces, verify that return traffic leaves the interface you expect, especially when source-based routing or VRFs are involved.

Neighbor cache sizing and garbage collection

If a host talks to many L2 neighbors (for example, a node that communicates with thousands of pods or VMs on the same subnet), ARP (IPv4) and ND (IPv6) caches can churn.

You can inspect neighbor entries:

bash
ip neigh show | head
ip -6 neigh show | head

If you see frequent neighbor resolution delays, you may need to review neighbor cache thresholds. This area is sensitive; changing it without understanding can lead to memory bloat or stale entries. It’s usually more effective to address L2 design (subnet sizing, use of L3 boundaries) than to stretch neighbor caches indefinitely.

Align application behavior with the tuned network stack

Even perfect kernel tuning won’t help if the application is inefficient: too few worker threads, small read/write buffers, frequent reconnects, or Nagle/Delayed ACK interactions for small messages.

Ensure services can accept connections fast enough

For TCP servers under high connection rates, the accept loop and backlog matter. If your application is single-threaded in accept handling, somaxconn increases won’t help. You’ll see SYNs arriving, retransmissions, and an increasing queue.

For web stacks, confirm that:

The reverse proxy (NGINX/HAProxy) has enough worker processes and file descriptor limits.

The application server is not blocking on slow upstream calls.

Keep-alive and connection reuse are enabled where appropriate.

Match buffer sizes to request patterns

For bulk transfer services, larger read/write buffers can reduce syscall overhead. For small-message RPC, latency might be more sensitive to scheduling and queueing than to buffer size.

In other words, network tuning and application tuning are coupled. When you re-run benchmarks after changing sysctl or NIC settings, also ensure application settings remain constant.

Scenario: NGINX reverse proxy saturates CPU before NIC

An Ubuntu Server with 10GbE handles TLS termination and proxying to upstream services. iperf3 shows the NIC can push 9–10Gb/s, but real HTTPS traffic tops out at 3–4Gb/s with high CPU. NIC tuning doesn’t help because the bottleneck is TLS and user-space proxying, not the kernel datapath.

In that case, the optimization path is different: enable TLS session reuse, choose appropriate ciphers, ensure OpenSSL uses CPU features, scale worker processes, and consider load distribution across more nodes. The earlier baseline step prevents you from misdiagnosing this as a kernel network issue.

Validate improvements with controlled experiments

After each tuning change, go back to the same baseline tests and record results. At minimum, capture:

Throughput and retransmits (for TCP).

Packet loss (for UDP, if applicable).

Latency percentiles at the application layer.

CPU usage broken down by user/system/softirq.

Interrupt distribution.

A good validation pattern is A/B testing:

1) Apply one change.

2) Run tests multiple times.

3) If results are consistently better and no regressions appear, keep it.

4) Otherwise revert.

Where possible, validate using production-like traffic. Synthetic tools can hide issues like head-of-line blocking in HTTP/2, request size variance, or connection churn patterns.

Practical tuning bundles by server role (use as starting points)

At this stage, you should have an idea of what your bottleneck is. Rather than prescribing one “best” configuration, it’s more useful to tie tuning to intent.

High-throughput bulk transfer node

If your server primarily moves large files/objects (backups, replication, data pipelines), you typically want:

Healthy RSS/queues and balanced IRQs.

Offloads enabled.

Larger socket buffer maxima for high BDP paths.

Potentially fq qdisc to reduce queueing issues.

Possibly jumbo frames if the entire path supports it.

Validate with multiple parallel TCP streams and monitor CPU softirq saturation.

Latency-sensitive service node

If your priority is consistent low latency (trading, real-time control, VoIP backends), you typically want:

Controlled queue depths (avoid extremely large rings/backlogs unless proven beneficial).

Careful interrupt and CPU placement (NUMA locality, possible explicit IRQ pinning).

Avoiding excessive coalescing.

Possibly disabling GRO on specific interfaces if it demonstrably improves tail latency.

Validation should focus on p95/p99 latency under load, not just peak throughput.

Virtualization host or container node

If the host forwards a lot of traffic between vNICs, bridges, overlays, and NAT, you typically want:

Efficient vSwitching path (virtio/vhost, or SR-IOV where appropriate).

Awareness of conntrack and netfilter overhead.

Queue and IRQ distribution that accounts for multiple busy interfaces.

MTU planning for overlays.

Here, “network performance” is often limited by packet processing overhead rather than raw NIC bandwidth.

Operational safeguards: change control, observability, and rollback

Network tuning changes can have outsized impact, and some are hard to diagnose after the fact. Treat changes as you would any performance-related change: staged rollouts, observability, and rollback plans.

Record what you changed and why

Maintain a simple record of:

Kernel parameters changed (file name under /etc/sysctl.d/).

Interface-level settings applied (rings, offloads, coalescing, MTU).

The baseline and post-change metrics.

The workload context (traffic pattern, concurrency, peer location).

This prevents “configuration drift by folklore,” where settings persist for years without anyone knowing if they still help.

Monitor the right signals continuously

Even after successful tuning, keep an eye on:

Interface error and drop counters (ip -s link).

TCP retransmits and resets (nstat).

Softirq CPU saturation (mpstat, sar -n DEV, sar -n SOCK where available).

Application latency percentiles.

A tuning change that improves throughput today might harm latency next month when traffic mix changes or when the kernel is updated.

Plan for kernel and driver updates

Ubuntu kernel updates can change default behavior (for example, qdisc defaults, NIC driver updates, offload handling). For high-performance systems, track kernel versions and NIC firmware/driver updates as part of capacity management.

Before rolling a kernel update fleet-wide, re-run a small subset of your baseline tests on a canary node. This is especially important for systems pushing high PPS or using advanced NIC features.

How to Optimize Network Performance in Ubuntu Server (Practical Admin Guide)