ceph -- Unable to mount ceph volume on s390x

Bug #1875863 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Skipper Bug Screeners
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Focal
Fix Released
Undecided
Canonical Kernel Team
Groovy
Fix Released
High
Canonical Kernel Team

Bug Description

SRU Justification:
==================

[Impact]

* Unable to mount ceph volumes on big endian systems, like s390x.

* The mount operation always fails with an IO error.

* This is caused by an endiness issue in function handle_session where variable features is always little endian.

* But test_bit assumes the host order of bytes, hence causes a problem on big endian systems.

[Fix]

* 0fa8263367db9287aa0632f96c1a5f93cc478150 0fa8263367db "ceph: fix endianness bug when handling MDS session feature bits"

[Test Case]

* Setup ceph on s390x.

* Try to mount a ceph volume.

* If it mounts correctly the patch is applied and working.

* Without the patch a mount always fails on big endian / s390x.

[Regression Potential]

* There is regression potential with having code changes in ceph's session handler, which is common code.

* However, the patch was accepted (slightly changed) by the ceph maintainers and with that got upstream accepted, too.

* The patch is fairly limited (5 lines removed, 3 added), hence the changes are quite traceable.

__________

When mounting a ceph volume, mount operation fails with an IO error.
The problem is always reproducible.

Identified potential root cause as kernel endian bug:

In the function handle_session() variable @features always
contains little endian order of bytes. Just because The feature
mask sent by the MDS is little-endian (bits are packed bytewise
from left to right in encode_supported_features()).

However, test_bit(), called to check features availability, assumes
the host order of bytes in that variable. This leads to problems on
big endian architectures. Specifically it is impossible to mount
ceph volume on s390.

A fixup was proposed to convert little-endian order of bytes to the host one. That fixup was modified by ceph maintainers to use existing unpacking means for the conversion. The resulted patch attached.

Related discussion in the ceph-development mailing list:
https://marc.info/?l=ceph-devel&m=158815357301332&w=2

Revision history for this message
bugproxy (bugproxy) wrote : Kernel messages related to failed mount operation on the client machine

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-185690 severity-high targetmilestone-inin2004
Revision history for this message
bugproxy (bugproxy) wrote : Proposed fixup

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → ceph (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → High
assignee: nobody → Ubuntu OpenStack (ubuntu-openstack)
tags: added: openstack-ibm
Revision history for this message
Frank Heimes (fheimes) wrote :

Hi, was the attached patch already brought to upstream's attention?
Please provide a github ticket (PR/issues) - so far I/we couldn't find one.
Please notice that we want to avoid handling out-of-tree patches - thx.

Changed in ubuntu-z-systems:
status: New → Incomplete
James Page (james-page)
Changed in ceph (Ubuntu):
status: New → Invalid
Revision history for this message
Frank Heimes (fheimes) wrote :

Oh, just noticed that this is a kernel issue and fix.
In this case we need of course an upstream accepted kernel patch.
I didn't found anything like "ceph: fix up endian bug in managing feature bits" in linux-next yet.
Is it available at some staging tree?

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-04-29 17:30 EDT-------
(In reply to comment #12)
> Oh, just noticed that this is a kernel issue and fix.
> In this case we need of course an upstream accepted kernel patch.
> I didn't found anything like "ceph: fix up endian bug in managing feature
> bits" in linux-next yet.
> Is it available at some staging tree?

Hi, It was queued for Linux-5.7-rcX (don't look for this in staging trees):
https://marc.info/?l=ceph-devel&m=158817653611155&w=2

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: Ubuntu OpenStack (ubuntu-openstack) → Skipper Bug Screeners (skipper-screen-team)
Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
no longer affects: ceph (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-05-08 13:56 EDT-------
It was accepted today. Will be in Linux-5.7-rc5

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb24fdd8e6f5c6bb95129748a1801c6476492aba

Revision history for this message
Frank Heimes (fheimes) wrote :

Thanks Eduard,
found it in linux-next, could cleanly cherry-pick and compile it.
And it's coming just in time for the SRU cycle - perfect!

Revision history for this message
Frank Heimes (fheimes) wrote :

Kernel SRU request submitted:
https://lists.ubuntu.com/archives/kernel-team/2020-May/thread.html#109710
Updating status to 'In Progress'.

Changed in linux (Ubuntu):
status: New → In Progress
Changed in ubuntu-z-systems:
status: Incomplete → In Progress
Frank Heimes (fheimes)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in linux (Ubuntu Groovy):
status: Fix Committed → In Progress
Changed in linux (Ubuntu Focal):
status: New → In Progress
Frank Heimes (fheimes)
description: updated
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in linux (Ubuntu Focal):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-05-20 04:25 EDT-------
Hi,

IIUC, if this patch requires cephadm/ceph-container to do the testing as I understand from Eduard, then it's not ready today unfortunately, it's still WIP. I wonder if we can still bootstrap the cluster manually as we have been doing in the past months.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-05-20 06:53 EDT-------
@Canonical, Comment #10 is not preciseliy correct. IBM will verify the fix soon..

bugproxy (bugproxy)
tags: added: verification-failed-focal
removed: verification-needed-focal
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-05-25 11:40 EDT-------
Hi,

As of today the kernel 5.4.0.33.37-38 in focal-proposed doesn't seem to have this patch pulled in.

It seems it was only pulled in master-next branch this morning :
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/fs/ceph?h=master-next&id=f86b72f1b8b399e456e84151f3bfe0ad7cabe29d

Talked briefly with Frank on IRC, Frank also thinks that the patch didn't make it. He is helping on this. Thanks Frank.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-05-25 11:54 EDT-------
So 5.4.0-34.38 (only 34.37 kernel images and meta package linux-generic=33.38 were available before) was just tagged few minutes ago from master-next to include this patch.
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?h=Ubuntu-5.4.0-34.38&id=dc6c325dcb0c741ce4e70d0091dca3f610cbcb3f

Quote from Frank on IRC:
jfh> so the kernel that incl. the ceph patch and is supposed to be in proposed right now, needed to be re-spun due to a regression
<jfh> the the next kernel that will come -34 will have the ceph patch included
<jfh> indeed a special case

Will test again as soon as new kernel get built. Thanks.

Revision history for this message
Frank Heimes (fheimes) wrote :

@kernel-team and @ubuntu-archive just for clarification: the 'verification-failed-focal' tag was only set because the kernel in proposed that should have the patch that is mentioned here included, needed to be revoked due to a regression and the kernel used for the verification didn't had it in.
So the verification was not possible, rather than failed.

Revision history for this message
Frank Heimes (fheimes) wrote :

@Tuan, would you be able to verify this LP ticket again?
The patch should be included in the kernel that is currently in proposed (the re-spin was done).
Thx

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-06-08 04:21 EDT-------
It seems to work now with 5.4.0-34.38 being the latest in focal-proposed.
Thanks Frank.

Revision history for this message
Frank Heimes (fheimes) wrote :

Hi Tuan, thx for re-verifying on short notice!

tags: added: verification-done-focal
removed: verification-failed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.4.0-37.41

---------------
linux (5.4.0-37.41) focal; urgency=medium

  * CVE-2020-0543
    - SAUCE: x86/speculation/spectre_v2: Exclude Zhaoxin CPUs from SPECTRE_V2
    - SAUCE: x86/cpu: Add a steppings field to struct x86_cpu_id
    - SAUCE: x86/cpu: Add 'table' argument to cpu_matches()
    - SAUCE: x86/speculation: Add Special Register Buffer Data Sampling (SRBDS)
      mitigation
    - SAUCE: x86/speculation: Add SRBDS vulnerability and mitigation documentation
    - SAUCE: x86/speculation: Add Ivy Bridge to affected list

 -- Marcelo Henrique Cerri <email address hidden> Wed, 03 Jun 2020 11:24:23 -0300

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (linux-oracle-5.4/5.4.0-1019.19~18.04.1)

All autopkgtests for the newly accepted linux-oracle-5.4 (5.4.0-1019.19~18.04.1) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

zfs-linux/unknown (armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#linux-oracle-5.4

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.4.0-42.46

---------------
linux (5.4.0-42.46) focal; urgency=medium

  * focal/linux: 5.4.0-42.46 -proposed tracker (LP: #1887069)

  * linux 4.15.0-109-generic network DoS regression vs -108 (LP: #1886668)
    - SAUCE: Revert "netprio_cgroup: Fix unlimited memory leak of v2 cgroups"

linux (5.4.0-41.45) focal; urgency=medium

  * focal/linux: 5.4.0-41.45 -proposed tracker (LP: #1885855)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * CVE-2019-19642
    - kernel/relay.c: handle alloc_percpu returning NULL in relay_open

  * CVE-2019-16089
    - SAUCE: nbd_genl_status: null check for nla_nest_start

  * CVE-2020-11935
    - aufs: do not call i_readcount_inc()

  * ip_defrag.sh in net from ubuntu_kernel_selftests failed with 5.0 / 5.3 / 5.4
    kernel (LP: #1826848)
    - selftests: net: ip_defrag: ignore EPERM

  * Update lockdown patches (LP: #1884159)
    - SAUCE: acpi: disallow loading configfs acpi tables when locked down

  * seccomp_bpf fails on powerpc (LP: #1885757)
    - SAUCE: selftests/seccomp: fix ptrace tests on powerpc

  * Introduce the new NVIDIA 418-server and 440-server series, and update the
    current NVIDIA drivers (LP: #1881137)
    - [packaging] add signed modules for the 418-server and the 440-server
      flavours

 -- Khalid Elmously <email address hidden> Thu, 09 Jul 2020 19:50:26 -0400

Changed in linux (Ubuntu Groovy):
status: In Progress → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-07-28 03:40 EDT-------
IBM bugzilla status-> closed, Fix Relased with 20.04

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.