TCP Tuning on a Lossy Backbone
1777956300Two anycast POPs connected over a tunnel. POP-A is a 1 vCPU Debian VM.
POP-B is a 4-core Arch router. RTT between them is 30 ms. The link
carries iBGP, transit for an anycasted /24 and /47, and client traffic.
Single-stream iperf3 from POP-A to POP-B:
[ 5] 0.00-10.00 sec 146 MBytes 122 Mbits/sec 216 retr
Reverse direction was 535 Mbit/s with the same config. The problem was
only on sends from POP-A.
Tunnel swap: WireGuard to GRE
First test was replacing WireGuard with ip6gre. No crypto, lower
per-packet overhead. Reverse direction results:
| Tunnel | Reverse Mbit/s | Retransmits |
|---|---|---|
| WireGuard | 392 | 6,079 |
| GRE | 535 | 1,903 |
Forward direction after the swap: still 120 Mbit/s. POP-A load during
the test was 0.00. The tunnel was not the constraint.
Sender-side defaults
POP-A vs POP-B TCP stack:
| Setting | POP-A | POP-B |
|---|---|---|
| Congestion control | cubic | bbr |
| Default qdisc | pfifo_fast | fq |
rmem_max | 208 KB | 32 MB |
wmem_max | 208 KB | 16 MB |
tcp_mtu_probing | 0 | 1 |
| BBR module | not loaded | loaded |
Cubic is loss-based. On a path with any meaningful random loss it cuts
cwnd on every drop and never recovers. BBR paces at the measured
delivery rate and ignores loss as a congestion signal, but it requires
fq for pacing and adequate socket buffers to fill the BDP.
At 30 ms and 500 Mbit/s, BDP is about 1.87 MB. A 208 KB wmem_max
caps the send buffer well below that.
Tuning applied to POP-A
# /etc/sysctl.d/99-network-tuning.conf
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 131072 33554432
net.ipv4.tcp_wmem = 4096 32768 33554432
net.ipv4.tcp_mtu_probing = 1
net.core.netdev_max_backlog = 5000
Plus tcp_bbr in /etc/modules-load.d/bbr.conf, and tc qdisc replace
on the live interfaces.
Same test after the change:
[ 5] 0.00-15.00 sec 1.12 GBytes 644 Mbits/sec 84400 retr
644 Mbit/s, cwnd climbed to 5.8 MB, 84,000 retransmits. Roughly 10% of packets were lost and throughput did not collapse.
Path characterization
UDP iperf3 at fixed rates:
| Offered | Delivered | Loss |
|---|---|---|
| 200 Mbit/s | 197 | 1.2% |
| 400 Mbit/s | 363 | 9.4% |
| 700 Mbit/s | 635 | 8.7% |
| 1000 Mbit/s | 640 | 7.4% |
Delivered rate plateaus near 640 Mbit/s regardless of offered rate.
That profile matches a policer, not a congested link with flowing drops.
Most likely the VPS uplink allocation on POP-A.
Full results
| Test | Direction | CC | Tunnel | Mbit/s |
|---|---|---|---|---|
| Initial | A to B | cubic | WG | 157 |
| Tunnel swap | A to B | cubic | GRE | 120 |
| + BBR/fq/buffers | A to B | bbr | GRE | 644 |
| + 4 streams | A to B | bbr | GRE | 790 |
| Reverse | B to A | bbr | WG | 392 |
| Reverse | B to A | bbr | GRE | 535 |
| No tunnel, direct | A to B | bbr | none | 671 |
Tunnel overhead is within noise. Multi-stream gets closer to the policer
ceiling because loss distributes across flows. GRE gave a real 30-35%
win in reverse direction where POP-A’s single vCPU was WireGuard-bound
on decrypt. In forward direction, GRE was secondary to the TCP tuning.
Configuration reference
# /etc/sysctl.d/99-network-tuning.conf
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 131072 33554432
net.ipv4.tcp_wmem = 4096 32768 33554432
net.ipv4.tcp_mtu_probing = 1
net.core.netdev_max_backlog = 5000
# /etc/modules-load.d/bbr.conf
tcp_bbr
Next
POP-A is being replaced with a CachyOS v3 image. CachyOS ships BBR
and fq by default, runs a newer kernel with BBR v3 and other network
patches, and uses a V3-optimized glibc. Comparison numbers after that
swap in a follow-up post.