Comment 5 for bug 1996678

Revision history for this message
Michael Sinz (michaelsinz) wrote : Re: 1.5.9-0ubuntu1~20.04.5 sporadic timeouts

Just a note - this seems to be related to the kernel - specifically, we have seen this in Ubuntu 18.04 too with the update to the 5.4.0-1095 kernel from the 5.4.0-1094 kernel.

Same problem that at some point, sometimes minutes sometimes many hours after starting, the kernetes node metrics fail and that is actually due to containerd having problems getting metrics.

They actually do continue to work but change from taking a few milliseconds to taking many seconds to complete which is beyond the timeout for collecting metrics.

We tried changing containerd versions but it did not matter - just the kernel change showed the impact and rolling that back to the 5.4.0-1094 kernel fixed everything.

Note that the 5.4.0-1095 update has a massive change list: https://launchpad.net/ubuntu/+source/linux-azure-5.4/5.4.0-1095.101~18.04.1

It may be that the cgroup change is involved (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988584) but I don't yet know if that is true. It may be something with IPC that is actually the problem. Have not gotten a chance to debug the kernel yet (been busy dealing with our services and rolling back the kernel update)