← BACK TO BLOG

TCP Tuning on a Lossy Backbone

1777956300

Two anycast POPs connected over a tunnel. POP-A is a 1 vCPU Debian VM. POP-B is a 4-core Arch router. RTT between them is 30 ms. The link carries iBGP, transit for an anycasted /24 and /47, and client traffic.

Single-stream iperf3 from POP-A to POP-B:

[  5] 0.00-10.00 sec  146 MBytes  122 Mbits/sec  216 retr

Reverse direction was 535 Mbit/s with the same config. The problem was only on sends from POP-A.

Tunnel swap: WireGuard to GRE

First test was replacing WireGuard with ip6gre. No crypto, lower per-packet overhead. Reverse direction results:

TunnelReverse Mbit/sRetransmits
WireGuard3926,079
GRE5351,903

Forward direction after the swap: still 120 Mbit/s. POP-A load during the test was 0.00. The tunnel was not the constraint.

Sender-side defaults

POP-A vs POP-B TCP stack:

SettingPOP-APOP-B
Congestion controlcubicbbr
Default qdiscpfifo_fastfq
rmem_max208 KB32 MB
wmem_max208 KB16 MB
tcp_mtu_probing01
BBR modulenot loadedloaded

Cubic is loss-based. On a path with any meaningful random loss it cuts cwnd on every drop and never recovers. BBR paces at the measured delivery rate and ignores loss as a congestion signal, but it requires fq for pacing and adequate socket buffers to fill the BDP.

At 30 ms and 500 Mbit/s, BDP is about 1.87 MB. A 208 KB wmem_max caps the send buffer well below that.

Tuning applied to POP-A

# /etc/sysctl.d/99-network-tuning.conf
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 131072 33554432
net.ipv4.tcp_wmem = 4096 32768 33554432
net.ipv4.tcp_mtu_probing = 1
net.core.netdev_max_backlog = 5000

Plus tcp_bbr in /etc/modules-load.d/bbr.conf, and tc qdisc replace on the live interfaces.

Same test after the change:

[  5] 0.00-15.00 sec  1.12 GBytes  644 Mbits/sec  84400 retr

644 Mbit/s, cwnd climbed to 5.8 MB, 84,000 retransmits. Roughly 10% of packets were lost and throughput did not collapse.

Path characterization

UDP iperf3 at fixed rates:

OfferedDeliveredLoss
200 Mbit/s1971.2%
400 Mbit/s3639.4%
700 Mbit/s6358.7%
1000 Mbit/s6407.4%

Delivered rate plateaus near 640 Mbit/s regardless of offered rate. That profile matches a policer, not a congested link with flowing drops. Most likely the VPS uplink allocation on POP-A.

Full results

TestDirectionCCTunnelMbit/s
InitialA to BcubicWG157
Tunnel swapA to BcubicGRE120
+ BBR/fq/buffersA to BbbrGRE644
+ 4 streamsA to BbbrGRE790
ReverseB to AbbrWG392
ReverseB to AbbrGRE535
No tunnel, directA to Bbbrnone671

Tunnel overhead is within noise. Multi-stream gets closer to the policer ceiling because loss distributes across flows. GRE gave a real 30-35% win in reverse direction where POP-A’s single vCPU was WireGuard-bound on decrypt. In forward direction, GRE was secondary to the TCP tuning.

Configuration reference

# /etc/sysctl.d/99-network-tuning.conf
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 131072 33554432
net.ipv4.tcp_wmem = 4096 32768 33554432

net.ipv4.tcp_mtu_probing = 1
net.core.netdev_max_backlog = 5000
# /etc/modules-load.d/bbr.conf
tcp_bbr

Next

POP-A is being replaced with a CachyOS v3 image. CachyOS ships BBR and fq by default, runs a newer kernel with BBR v3 and other network patches, and uses a V3-optimized glibc. Comparison numbers after that swap in a follow-up post.


PREV: Age Encryption Benchmarks