TCP connection issues after linux-5.4.0-1039-aws Update

Asked by Seb Mel on 2021-03-19

Hi,

i am encountering some weird behaviours in an AWS environment with a Ubuntu 18.04 instance (t3a.large). The linux kernel was updated from 5.4.0-1038-aws to 5.4.0-1039-aws and after that I receive errors when proxying connections in nginx.
The setup looks like this:
Requests are received via HTTPS on port 443 by nginx. These are forwarded to an AWS NLB via
proxy_pass https://my-nlb;

The SSL decoding there is done on the destination host.

When doing many (4-5 req/s) requests through this setup, I observe that the nginx on the destination logs 499 client abort errors rather early (0.1-0.3s). On the host in question I observe that the requests to the destination get "stuck" and run into a proxy timeout after 60s. So it seems that the TCP connection between the nginx and the NLB destination breaks at some point.
Over the course of 3k requests maybe 2-3 break for each of our test.

I downgraded the kernel to 5.4.0-1038-aws and everything is working fine again now.
The changelog for 5.4.0-1039-aws does not give me any hints what change might have caused it, and it is a bit hard for me to pin this down.

How can i debug this further? I would be very greatful for any hint, thanks in advance!

Some more info on the setup:
cat /proc/version_signature
Ubuntu 5.4.0-1039.41~18.04.1-aws 5.4.94

lspci -vnvn
00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] [8086:1237]
 Subsystem: Amazon.com, Inc. 440FX - 82441FX PMC [Natoma] [1d0f:1237]
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0

00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] [8086:7000]
 Physical Slot: 1
 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:01.3 Non-VGA unclassified device [0000]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 08)
 Physical Slot: 1
 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Interrupt: pin ? routed to IRQ 9

00:03.0 VGA compatible controller [0300]: Amazon.com, Inc. Device [1d0f:1111] (prog-if 00 [VGA controller])
 Physical Slot: 3
 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Region 0: Memory at fe400000 (32-bit, prefetchable) [size=4M]
 Expansion ROM at 000c0000 [disabled] [size=128K]

00:04.0 Non-Volatile memory controller [0108]: Amazon.com, Inc. Device [1d0f:8061] (prog-if 02 [NVM Express])
 Subsystem: Amazon.com, Inc. Device [1d0f:0000]
 Physical Slot: 4
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 11
 NUMA node: 0
 Region 0: Memory at febf0000 (32-bit, non-prefetchable) [size=16K]
 Capabilities: [70] Express (v2) Endpoint, MSI 00
  DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
   ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
   MaxPayload 128 bytes, MaxReadReq 128 bytes
  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
  LnkCap: Port #0, Speed unknown, Width x0, ASPM not supported, Exit Latency L0s <64ns, L1 <1us
   ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
  DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
  DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
  LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
    EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
 Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
  Vector table: BAR=0 offset=00002000
  PBA: BAR=0 offset=00003000
 Kernel driver in use: nvme

00:05.0 Ethernet controller [0200]: Amazon.com, Inc. Elastic Network Adapter (ENA) [1d0f:ec20]
 Physical Slot: 5
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Region 0: Memory at febf4000 (32-bit, non-prefetchable) [size=16K]
 Capabilities: [70] Express (v2) Endpoint, MSI 00
  DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
   ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
   MaxPayload 128 bytes, MaxReadReq 128 bytes
  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
  LnkCap: Port #0, Speed unknown, Width x0, ASPM not supported, Exit Latency L0s <64ns, L1 <1us
   ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
  DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
  DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
  LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
    EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
 Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
  Vector table: BAR=0 offset=00002000
  PBA: BAR=0 offset=00003000
 Kernel driver in use: ena
 Kernel modules: ena

Question information

Language:
English Edit question
Status:
Expired
For:
Ubuntu Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Bernard Stafford (bernard010) said : #1

https://www.nginx.com/blog/troubleshooting-application-performance-and-slow-tcp-connections-with-nginx-amplify/
Testing Setup to measure the TCP connection time, and with the use of Wire shark & nginx amplify.
https://docs.nginx.com/nginx-app-protect/troubleshooting/
This has a section for debug logs for nginx troubleshooting guide.
The logging configuration file is located in: /etc/app_protect/bd/logger.cfg
https://docs.nginx.com/nginx-ingress-controller/troubleshooting/
Troubleshooting the nginx ingress controller including Events of a Virtual Server and Virtual Server Route Resources.

Seb Mel (sebmelc) said : #2

Hi Bernard,

thanks for your reply but i am not quite sure how these resources are due to help me as the issue happens on a vanilla nginx instance which has no app protect nor is an nginx ingress controller. Also nginx amplify is more or less a paid product.

As i can directly attribute the issues to the kernel Update, i was more guessing that the problem lies somewhere in the update?

Regards,
Sebastian

Bernard Stafford (bernard010) said : #3

https://open.vanillaforums.com/categories/questions
I would ask the same question at vanilla forums.
https://open.vanillaforums.com/discussion/9915/vanilla-on-nginx
This is different and has rewrite errors.

Seb Mel (sebmelc) said : #4

I think there has been a mixup? The links to not relate to the question in this case?

gfdsf gfdsaf (vacer34) said : #5

yes I am also facing the similar issue and you can see my project here https://besttennisreviews.com/ and also facing the same issue here

Launchpad Janitor (janitor) said : #6

This question was expired because it remained in the 'Open' state without activity for the last 15 days.