Project

General

Profile

Bug #36606

osd: checksum failure during upgrade test

Added by Patrick Donnelly about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:

Description

2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) fsck error: #-1:7b3f43c4:::osd_superblock:0# error during read: (5) Input/output error

From: /ceph/teuthology-archive/pdonnell-2018-10-27_02:23:12-fs-wip-pdonnell-testing-20181026.235100-distro-basic-smithi/3189896/teuthology.log

Recent run that did not fail: /ceph/teuthology-archive/pdonnell-2018-10-17_19:40:05-fs-wip-pdonnell-testing-20181017.175152-distro-basic-smithi/3152440/teuthology.log

Branch: https://github.com/ceph/ceph-ci/tree/wip-pdonnell-testing-20181026.235100

History

#1 Updated by Patrick Donnelly about 2 months ago

  • Project changed from fs to RADOS
  • Component(RADOS) OSD added

#2 Updated by Patrick Donnelly about 2 months ago

Note that this could be caused by a recent merge into luminous.

#3 Updated by Patrick Donnelly about 2 months ago

Does not affect filestore. Only upgrade tests (fs:upgrade) with bluestore (replicated or EC).

#4 Updated by Igor Fedotov about 2 months ago

I think mkfs doesn't run properly for bluestore since --no-mon-config param isn't recognized for unknown reason):

2018-10-27T10:17:32.370 INFO:teuthology.orchestra.run.smithi184:Running: 'sudo MALLOC_CHECK_=3 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-osd --no-mon-config --cluster ceph --mkfs --mkkey -i 1 --monmap /home/ubuntu/cephtest/ceph.monmap'
2018-10-27T10:17:32.483 INFO:teuthology.orchestra.run.smithi184.stdout:usage: ceph-osd -i <ID> [flags]
2018-10-27T10:17:32.484 INFO:teuthology.orchestra.run.smithi184.stdout:  --osd-data PATH data directory
2018-10-27T10:17:32.484 INFO:teuthology.orchestra.run.smithi184.stdout:  --osd-journal PATH
2018-10-27T10:17:32.484 INFO:teuthology.orchestra.run.smithi184.stdout:                    journal file or block device
2018-10-27T10:17:32.484 INFO:teuthology.orchestra.run.smithi184.stdout:  --mkfs            create a [new] data directory
2018-10-27T10:17:32.484 INFO:teuthology.orchestra.run.smithi184.stdout:  --mkkey           generate a new secret key. This is normally used in combination with --mkfs
2018-10-27T10:17:32.484 INFO:teuthology.orchestra.run.smithi184.stdout:  --convert-filestore
2018-10-27T10:17:32.485 INFO:teuthology.orchestra.run.smithi184.stdout:                    run any pending upgrade operations
2018-10-27T10:17:32.485 INFO:teuthology.orchestra.run.smithi184.stdout:  --flush-journal   flush all data out of journal
2018-10-27T10:17:32.485 INFO:teuthology.orchestra.run.smithi184.stdout:  --mkjournal       initialize a new journal
2018-10-27T10:17:32.485 INFO:teuthology.orchestra.run.smithi184.stdout:  --check-wants-journal
2018-10-27T10:17:32.485 INFO:teuthology.orchestra.run.smithi184.stdout:                    check whether a journal is desired
2018-10-27T10:17:32.485 INFO:teuthology.orchestra.run.smithi184.stdout:  --check-allows-journal
2018-10-27T10:17:32.485 INFO:teuthology.orchestra.run.smithi184.stdout:                    check whether a journal is allowed
2018-10-27T10:17:32.486 INFO:teuthology.orchestra.run.smithi184.stdout:  --check-needs-journal
2018-10-27T10:17:32.486 INFO:teuthology.orchestra.run.smithi184.stdout:                    check whether a journal is required
2018-10-27T10:17:32.486 INFO:teuthology.orchestra.run.smithi184.stdout:  --debug_osd <N>   set debug level (e.g. 10)
2018-10-27T10:17:32.486 INFO:teuthology.orchestra.run.smithi184.stdout:  --get-device-fsid PATH
2018-10-27T10:17:32.486 INFO:teuthology.orchestra.run.smithi184.stdout:                    get OSD fsid for the given block device
2018-10-27T10:17:32.486 INFO:teuthology.orchestra.run.smithi184.stdout:
2018-10-27T10:17:32.486 INFO:teuthology.orchestra.run.smithi184.stdout:  --conf/-c FILE    read configuration from the given configuration file
2018-10-27T10:17:32.487 INFO:teuthology.orchestra.run.smithi184.stdout:  --id/-i ID        set ID portion of my name
2018-10-27T10:17:32.487 INFO:teuthology.orchestra.run.smithi184.stdout:  --name/-n TYPE.ID set name
2018-10-27T10:17:32.487 INFO:teuthology.orchestra.run.smithi184.stdout:  --cluster NAME    set cluster name (default: ceph)
2018-10-27T10:17:32.487 INFO:teuthology.orchestra.run.smithi184.stdout:  --setuser USER    set uid to user or uid (and gid to user's gid)
2018-10-27T10:17:32.487 INFO:teuthology.orchestra.run.smithi184.stdout:  --setgroup GROUP  set gid to group or gid
2018-10-27T10:17:32.487 INFO:teuthology.orchestra.run.smithi184.stdout:  --version         show version and quit
2018-10-27T10:17:32.487 INFO:teuthology.orchestra.run.smithi184.stdout:
2018-10-27T10:17:32.488 INFO:teuthology.orchestra.run.smithi184.stdout:  -d                run in foreground, log to stderr
2018-10-27T10:17:32.488 INFO:teuthology.orchestra.run.smithi184.stdout:  -f                run in foreground, log to usual location
2018-10-27T10:17:32.488 INFO:teuthology.orchestra.run.smithi184.stdout:
2018-10-27T10:17:32.488 INFO:teuthology.orchestra.run.smithi184.stdout:  --debug_ms N      set message debug level (e.g. 1)
2018-10-27T10:17:32.488 INFO:teuthology.orchestra.run.smithi184.stderr:2018-10-27 10:17:32.448617 7fe5e5076e00 -1 unrecognized arg --no-mon-config

#5 Updated by Patrick Donnelly about 2 months ago

Igor Fedotov wrote:

I think mkfs doesn't run properly for bluestore since --no-mon-config param isn't recognized for unknown reason):

[...]

From looking at `git log -p`, it seems that option is deliberately not in luminous. So perhaps some recent QA change is incompatible with luminous and must be fixed.

#6 Updated by Neha Ojha about 2 months ago

  • Status changed from New to Verified

/a/nojha-2018-10-29_19:19:04-fs:upgrade-master-distro-basic-smithi/3201377/

#7 Updated by Sage Weil about 1 month ago

The --no-mon-config splats or normal.. qa/tasks/ceph.py tries first with --no-mon-config and, if it fails, does the mkfs command again without it. the log shows the second time succeeded.

#8 Updated by Neha Ojha about 1 month ago

Yes, the mkfs suceeds. That part of the logs is also present in the successful run of this test.

2018-10-18T14:24:22.754 INFO:teuthology.orchestra.run.smithi176:Running: 'sudo MALLOC_CHECK_=3 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-osd --no-mon-config --cluster ceph --mkfs --mkkey -i 0 --monmap /home/ubuntu/cephtest/ceph.monmap'
2018-10-18T14:24:22.892 INFO:teuthology.orchestra.run.smithi176.stdout:usage: ceph-osd -i <ID> [flags]
2018-10-18T14:24:22.892 INFO:teuthology.orchestra.run.smithi176.stdout:  --osd-data PATH data directory
2018-10-18T14:24:22.893 INFO:teuthology.orchestra.run.smithi176.stdout:  --osd-journal PATH
2018-10-18T14:24:22.893 INFO:teuthology.orchestra.run.smithi176.stdout:                    journal file or block device
2018-10-18T14:24:22.893 INFO:teuthology.orchestra.run.smithi176.stdout:  --mkfs            create a [new] data directory
2018-10-18T14:24:22.893 INFO:teuthology.orchestra.run.smithi176.stdout:  --mkkey           generate a new secret key. This is normally used in combination with --mkfs
2018-10-18T14:24:22.893 INFO:teuthology.orchestra.run.smithi176.stdout:  --convert-filestore
2018-10-18T14:24:22.893 INFO:teuthology.orchestra.run.smithi176.stdout:                    run any pending upgrade operations
2018-10-18T14:24:22.893 INFO:teuthology.orchestra.run.smithi176.stdout:  --flush-journal   flush all data out of journal
2018-10-18T14:24:22.894 INFO:teuthology.orchestra.run.smithi176.stdout:  --mkjournal       initialize a new journal
2018-10-18T14:24:22.894 INFO:teuthology.orchestra.run.smithi176.stdout:  --check-wants-journal
2018-10-18T14:24:22.894 INFO:teuthology.orchestra.run.smithi176.stdout:                    check whether a journal is desired
2018-10-18T14:24:22.894 INFO:teuthology.orchestra.run.smithi176.stdout:  --check-allows-journal
2018-10-18T14:24:22.894 INFO:teuthology.orchestra.run.smithi176.stdout:                    check whether a journal is allowed
2018-10-18T14:24:22.894 INFO:teuthology.orchestra.run.smithi176.stdout:  --check-needs-journal
2018-10-18T14:24:22.894 INFO:teuthology.orchestra.run.smithi176.stdout:                    check whether a journal is required
2018-10-18T14:24:22.895 INFO:teuthology.orchestra.run.smithi176.stdout:  --debug_osd <N>   set debug level (e.g. 10)
2018-10-18T14:24:22.895 INFO:teuthology.orchestra.run.smithi176.stdout:  --get-device-fsid PATH
2018-10-18T14:24:22.895 INFO:teuthology.orchestra.run.smithi176.stdout:                    get OSD fsid for the given block device
2018-10-18T14:24:22.895 INFO:teuthology.orchestra.run.smithi176.stdout:
2018-10-18T14:24:22.895 INFO:teuthology.orchestra.run.smithi176.stdout:  --conf/-c FILE    read configuration from the given configuration file
2018-10-18T14:24:22.895 INFO:teuthology.orchestra.run.smithi176.stdout:  --id/-i ID        set ID portion of my name
2018-10-18T14:24:22.896 INFO:teuthology.orchestra.run.smithi176.stdout:  --name/-n TYPE.ID set name
2018-10-18T14:24:22.896 INFO:teuthology.orchestra.run.smithi176.stdout:  --cluster NAME    set cluster name (default: ceph)
2018-10-18T14:24:22.896 INFO:teuthology.orchestra.run.smithi176.stdout:  --setuser USER    set uid to user or uid (and gid to user's gid)
2018-10-18T14:24:22.896 INFO:teuthology.orchestra.run.smithi176.stdout:  --setgroup GROUP  set gid to group or gid
2018-10-18T14:24:22.896 INFO:teuthology.orchestra.run.smithi176.stdout:  --version         show version and quit
2018-10-18T14:24:22.897 INFO:teuthology.orchestra.run.smithi176.stdout:
2018-10-18T14:24:22.897 INFO:teuthology.orchestra.run.smithi176.stdout:  -d                run in foreground, log to stderr
2018-10-18T14:24:22.897 INFO:teuthology.orchestra.run.smithi176.stdout:  -f                run in foreground, log to usual location
2018-10-18T14:24:22.897 INFO:teuthology.orchestra.run.smithi176.stdout:
2018-10-18T14:24:22.897 INFO:teuthology.orchestra.run.smithi176.stdout:  --debug_ms N      set message debug level (e.g. 1)
2018-10-18T14:24:22.897 INFO:teuthology.orchestra.run.smithi176.stderr:2018-10-18 14:24:22.891515 7f1ad96d7e00 -1 unrecognized arg --no-mon-config
2018-10-18T14:24:22.898 DEBUG:teuthology.orchestra.run:got remote process result: 1

Successful test: /a/pdonnell-2018-10-17_19:40:05-fs-wip-pdonnell-testing-20181017.175152-distro-basic-smithi/3152440/

#9 Updated by Neha Ojha about 1 month ago

The following seem to be the relevant pieces for one osd leading to the failure:

2018-10-27T10:18:03.564 INFO:tasks.ceph.osd.0.smithi184.stdout:starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
2018-10-27T10:18:04.413 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:18:04.395137 7f0c2e88ee00 -1 osd.0 0 log_to_monitors {default=true}
2018-10-27T10:18:05.818 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:18:05.794096 7f0c16a28700 -1 osd.0 0 waiting for initial osdmap
2018-10-27T10:21:17.330 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.326 7f608492e600 -1 Falling back to public interface
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xda1180d2, expected 0x3e37199, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-10-27T10:21:17.633 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:17.630 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) fsck error: #-1:7b3f43c4:::osd_superblock:0# error during read: (5) Input/output error
2018-10-27T10:21:18.157 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:18.150 7f608492e600 -1 bluestore(/var/lib/ceph/osd/ceph-0) _mount fsck found 1 errors
2018-10-27T10:21:18.157 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:18.150 7f608492e600 -1 osd.0 0 OSD:init: unable to mount object store
2018-10-27T10:21:18.157 INFO:tasks.ceph.osd.0.smithi184.stderr:2018-10-27 10:21:18.150 7f608492e600 -1  ** ERROR: osd init failed: (5) Input/output error
2018-10-27T10:21:18.312 INFO:tasks.ceph.osd.0.smithi184.stderr:daemon-helper: command failed with exit status 1

Reproducer arguments for teuthology: -s fs:upgrade --filter 'snaps/{clusters/3-mds.yaml conf/{client.yaml mds.yaml mon.yaml osd.yaml} objectstore-ec/bluestore-ec-root.yaml overrides/{frag_enable.yaml multimds/yes.yaml whitelist_health.yaml whitelist_rstat.yaml whitelist_wrongly_marked_down.yaml} tasks/{0-luminous.yaml 1-client.yaml 2-upgrade.yaml 3-sanity.yaml 4-client-upgrade/no.yaml 5-client-sanity.yaml 6-snap-upgrade.yaml 7-client-sanity.yaml'

#11 Updated by Igor Fedotov about 1 month ago

  • Status changed from Verified to In Progress

#12 Updated by Igor Fedotov about 1 month ago

  • Status changed from In Progress to Need Review

#14 Updated by Sage Weil about 1 month ago

  • Status changed from Need Review to Pending Backport
  • Backport set to mimic,luminous

Igor, we should backport this, right?

#15 Updated by Igor Fedotov about 1 month ago

Sage, no, it's specific to Nautilus for now. We need it when/if we backport BlueFS migrate stuff.

#16 Updated by Igor Fedotov about 1 month ago

  • Status changed from Pending Backport to Resolved

#17 Updated by Igor Fedotov about 1 month ago

  • Backport deleted (mimic,luminous)

Also available in: Atom PDF