I noticed that requests going to one of our systems were slower than others. Data shows this started two weeks ago when we added a new node to the system to handle the extra load.
I've traced it back to one of the TCP offload parameters that had been tweaked on existing nodes without being added back to configuration or documentation.
I've tested a fix, staged it, and believe data shows that if we apply this in production, we have enough headroom to handle our scaling challenges.
Data from testing is available on the ticket, pipelines are green, I'd appreciate a code review but we can have this live in 30 minutes.