Bug #25006
closedbad csum during upgrade test
0%
Description
Run: http://pulpito.ceph.com/yuriw-2018-07-18_21:43:34-upgrade:luminous-x-mimic-distro-basic-smithi/
Job: 2796264
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2018-07-18_21:43:34-upgrade:luminous-x-mimic-distro-basic-smithi/2796264/teuthology.log
2018-07-19T12:26:20.247 INFO:teuthology.orchestra.run.smithi130:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight' 2018-07-19T12:26:20.302 INFO:tasks.ceph.osd.5.smithi130.stderr:2018-07-19 12:26:20.290112 7fcb47faa700 -1 received signal: Hangup from PID: 50243 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 5 UID: 0 2018-07-19T12:26:20.378 INFO:tasks.ceph.osd.5.smithi130.stderr:2018-07-19 12:26:20.375775 7fcb47faa700 -1 received signal: Hangup from PID: 50243 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 5 UID: 0 2018-07-19T12:26:20.391 INFO:tasks.ceph.osd.1.smithi104.stderr:2018-07-19 12:26:20.384 7f56ea308700 -1 bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x10000 checksum at blob offset 0x0, got 0x1c104513, expected 0x315eeaf7, device location [0xed70000~10000], logical extent 0x1b0000~10000, object #14:1d1092d3:::benchmark_data_smithi105_10641_object5831:head# 2018-07-19T12:26:20.391 INFO:tasks.ceph.osd.1.smithi104.stderr:2018-07-19 12:26:20.384 7f56ea308700 -1 log_channel(cluster) log [ERR] : 14.8 missing primary copy of 14:1d1092d3:::benchmark_data_smithi105_10641_object5831:head, unfound 2018-07-19T12:26:20.393 INFO:teuthology.orchestra.run.smithi130:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_blocked_ops' 2018-07-19T12:26:20.503 INFO:tasks.ceph.osd.7.smithi130.stderr:2018-07-19 12:26:20.500133 7f4ba88a1700 -1 received signal: Hangup from PID: 15430 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 7 UID: 0
Updated by Nathan Cutler over 5 years ago
The "missing primary copy" message happens on a mixed luminous/mimic cluster (MONs, MGR, and half of OSDs running mimic, other half of OSDs running luminous) in the third or fourth iteration of radosbench within "stress-tasks", right after a bluestore "bad crc32c/0x10000 checksum" message.
Here is the immediately preceding cluster health status:
2018-07-19T12:26:19.206 INFO:teuthology.orchestra.run.smithi104.stdout: 2018-07-19T12:26:19.207 INFO:teuthology.orchestra.run.smithi104.stdout:{ 2018-07-19T12:26:19.207 INFO:teuthology.orchestra.run.smithi104.stdout: "fsid": "6fb081a2-4dae-443e-ad49-08c8b4670dc8", 2018-07-19T12:26:19.207 INFO:teuthology.orchestra.run.smithi104.stdout: "health": { 2018-07-19T12:26:19.207 INFO:teuthology.orchestra.run.smithi104.stdout: "checks": { 2018-07-19T12:26:19.207 INFO:teuthology.orchestra.run.smithi104.stdout: "OSDMAP_FLAGS": { 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: "severity": "HEALTH_WARN", 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: "summary": { 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: "message": "nodeep-scrub flag(s) set" 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: "OBJECT_MISPLACED": { 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: "severity": "HEALTH_WARN", 2018-07-19T12:26:19.208 INFO:teuthology.orchestra.run.smithi104.stdout: "summary": { 2018-07-19T12:26:19.209 INFO:teuthology.orchestra.run.smithi104.stdout: "message": "166/9278 objects misplaced (1.789%)" 2018-07-19T12:26:19.209 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.209 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.209 INFO:teuthology.orchestra.run.smithi104.stdout: "PG_DEGRADED": { 2018-07-19T12:26:19.209 INFO:teuthology.orchestra.run.smithi104.stdout: "severity": "HEALTH_WARN", 2018-07-19T12:26:19.209 INFO:teuthology.orchestra.run.smithi104.stdout: "summary": { 2018-07-19T12:26:19.209 INFO:teuthology.orchestra.run.smithi104.stdout: "message": "Degraded data redundancy: 211/9278 objects degraded (2.274%), 1 pg degraded" 2018-07-19T12:26:19.210 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.210 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.210 INFO:teuthology.orchestra.run.smithi104.stdout: "SMALLER_PGP_NUM": { 2018-07-19T12:26:19.210 INFO:teuthology.orchestra.run.smithi104.stdout: "severity": "HEALTH_WARN", 2018-07-19T12:26:19.210 INFO:teuthology.orchestra.run.smithi104.stdout: "summary": { 2018-07-19T12:26:19.210 INFO:teuthology.orchestra.run.smithi104.stdout: "message": "1 pools have pg_num > pgp_num" 2018-07-19T12:26:19.210 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.211 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.211 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.211 INFO:teuthology.orchestra.run.smithi104.stdout: "status": "HEALTH_WARN", 2018-07-19T12:26:19.211 INFO:teuthology.orchestra.run.smithi104.stdout: "overall_status": "HEALTH_WARN" 2018-07-19T12:26:19.211 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.211 INFO:teuthology.orchestra.run.smithi104.stdout: "election_epoch": 12, 2018-07-19T12:26:19.211 INFO:teuthology.orchestra.run.smithi104.stdout: "quorum": [ 2018-07-19T12:26:19.212 INFO:teuthology.orchestra.run.smithi104.stdout: 0, 2018-07-19T12:26:19.212 INFO:teuthology.orchestra.run.smithi104.stdout: 1, 2018-07-19T12:26:19.212 INFO:teuthology.orchestra.run.smithi104.stdout: 2 2018-07-19T12:26:19.213 INFO:teuthology.orchestra.run.smithi104.stdout: ], 2018-07-19T12:26:19.213 INFO:teuthology.orchestra.run.smithi104.stdout: "quorum_names": [ 2018-07-19T12:26:19.213 INFO:teuthology.orchestra.run.smithi104.stdout: "a", 2018-07-19T12:26:19.213 INFO:teuthology.orchestra.run.smithi104.stdout: "b", 2018-07-19T12:26:19.213 INFO:teuthology.orchestra.run.smithi104.stdout: "c" 2018-07-19T12:26:19.213 INFO:teuthology.orchestra.run.smithi104.stdout: ], 2018-07-19T12:26:19.213 INFO:teuthology.orchestra.run.smithi104.stdout: "monmap": { 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "epoch": 2, 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "fsid": "6fb081a2-4dae-443e-ad49-08c8b4670dc8", 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "modified": "2018-07-19 11:55:04.062383", 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "created": "2018-07-19 11:53:29.762084", 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "features": { 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "persistent": [ 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "kraken", 2018-07-19T12:26:19.214 INFO:teuthology.orchestra.run.smithi104.stdout: "luminous", 2018-07-19T12:26:19.215 INFO:teuthology.orchestra.run.smithi104.stdout: "mimic", 2018-07-19T12:26:19.215 INFO:teuthology.orchestra.run.smithi104.stdout: "osdmap-prune" 2018-07-19T12:26:19.215 INFO:teuthology.orchestra.run.smithi104.stdout: ], 2018-07-19T12:26:19.215 INFO:teuthology.orchestra.run.smithi104.stdout: "optional": [] 2018-07-19T12:26:19.215 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.215 INFO:teuthology.orchestra.run.smithi104.stdout: "mons": [ 2018-07-19T12:26:19.215 INFO:teuthology.orchestra.run.smithi104.stdout: { 2018-07-19T12:26:19.216 INFO:teuthology.orchestra.run.smithi104.stdout: "rank": 0, 2018-07-19T12:26:19.216 INFO:teuthology.orchestra.run.smithi104.stdout: "name": "a", 2018-07-19T12:26:19.216 INFO:teuthology.orchestra.run.smithi104.stdout: "addr": "172.21.15.104:6789/0", 2018-07-19T12:26:19.216 INFO:teuthology.orchestra.run.smithi104.stdout: "public_addr": "172.21.15.104:6789/0" 2018-07-19T12:26:19.216 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.216 INFO:teuthology.orchestra.run.smithi104.stdout: { 2018-07-19T12:26:19.216 INFO:teuthology.orchestra.run.smithi104.stdout: "rank": 1, 2018-07-19T12:26:19.217 INFO:teuthology.orchestra.run.smithi104.stdout: "name": "b", 2018-07-19T12:26:19.217 INFO:teuthology.orchestra.run.smithi104.stdout: "addr": "172.21.15.104:6790/0", 2018-07-19T12:26:19.217 INFO:teuthology.orchestra.run.smithi104.stdout: "public_addr": "172.21.15.104:6790/0" 2018-07-19T12:26:19.217 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.217 INFO:teuthology.orchestra.run.smithi104.stdout: { 2018-07-19T12:26:19.217 INFO:teuthology.orchestra.run.smithi104.stdout: "rank": 2, 2018-07-19T12:26:19.217 INFO:teuthology.orchestra.run.smithi104.stdout: "name": "c", 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: "addr": "172.21.15.104:6791/0", 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: "public_addr": "172.21.15.104:6791/0" 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: ] 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: "osdmap": { 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: "osdmap": { 2018-07-19T12:26:19.218 INFO:teuthology.orchestra.run.smithi104.stdout: "epoch": 949, 2018-07-19T12:26:19.219 INFO:teuthology.orchestra.run.smithi104.stdout: "num_osds": 8, 2018-07-19T12:26:19.219 INFO:teuthology.orchestra.run.smithi104.stdout: "num_up_osds": 8, 2018-07-19T12:26:19.219 INFO:teuthology.orchestra.run.smithi104.stdout: "num_in_osds": 8, 2018-07-19T12:26:19.219 INFO:teuthology.orchestra.run.smithi104.stdout: "full": false, 2018-07-19T12:26:19.219 INFO:teuthology.orchestra.run.smithi104.stdout: "nearfull": false, 2018-07-19T12:26:19.219 INFO:teuthology.orchestra.run.smithi104.stdout: "num_remapped_pgs": 0 2018-07-19T12:26:19.219 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.220 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.220 INFO:teuthology.orchestra.run.smithi104.stdout: "pgmap": { 2018-07-19T12:26:19.220 INFO:teuthology.orchestra.run.smithi104.stdout: "pgs_by_state": [ 2018-07-19T12:26:19.220 INFO:teuthology.orchestra.run.smithi104.stdout: { 2018-07-19T12:26:19.220 INFO:teuthology.orchestra.run.smithi104.stdout: "state_name": "active+clean", 2018-07-19T12:26:19.220 INFO:teuthology.orchestra.run.smithi104.stdout: "count": 102 2018-07-19T12:26:19.220 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: { 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: "state_name": "active+recovering+degraded", 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: "count": 1 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: }, 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: { 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: "state_name": "active+recovering", 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: "count": 1 2018-07-19T12:26:19.221 INFO:teuthology.orchestra.run.smithi104.stdout: } 2018-07-19T12:26:19.222 INFO:teuthology.orchestra.run.smithi104.stdout: ], 2018-07-19T12:26:19.222 INFO:teuthology.orchestra.run.smithi104.stdout: "num_pgs": 104, 2018-07-19T12:26:19.222 INFO:teuthology.orchestra.run.smithi104.stdout: "num_pools": 2, 2018-07-19T12:26:19.222 INFO:teuthology.orchestra.run.smithi104.stdout: "num_objects": 4639, 2018-07-19T12:26:19.223 INFO:teuthology.orchestra.run.smithi104.stdout: "data_bytes": 19453181976, 2018-07-19T12:26:19.223 INFO:teuthology.orchestra.run.smithi104.stdout: "bytes_used": 58726039552, 2018-07-19T12:26:19.223 INFO:teuthology.orchestra.run.smithi104.stdout: "bytes_avail": 714368073728, 2018-07-19T12:26:19.223 INFO:teuthology.orchestra.run.smithi104.stdout: "bytes_total": 773094113280, 2018-07-19T12:26:19.223 INFO:teuthology.orchestra.run.smithi104.stdout: "degraded_objects": 211, 2018-07-19T12:26:19.223 INFO:teuthology.orchestra.run.smithi104.stdout: "degraded_total": 9278, 2018-07-19T12:26:19.223 INFO:teuthology.orchestra.run.smithi104.stdout: "degraded_ratio": 0.022742, 2018-07-19T12:26:19.224 INFO:teuthology.orchestra.run.smithi104.stdout: "misplaced_objects": 166, 2018-07-19T12:26:19.224 INFO:teuthology.orchestra.run.smithi104.stdout: "misplaced_total": 9278, 2018-07-19T12:26:19.224 INFO:teuthology.orchestra.run.smithi104.stdout: "misplaced_ratio": 0.017892, 2018-07-19T12:26:19.224 INFO:teuthology.orchestra.run.smithi104.stdout: "recovering_objects_per_sec": 62, 2018-07-19T12:26:19.224 INFO:teuthology.orchestra.run.smithi104.stdout: "recovering_bytes_per_sec": 260815863, 2018-07-19T12:26:19.224 INFO:teuthology.orchestra.run.smithi104.stdout: "recovering_keys_per_sec": 0, 2018-07-19T12:26:19.224 INFO:teuthology.orchestra.run.smithi104.stdout: "num_objects_recovered": 351, 2018-07-19T12:26:19.225 INFO:teuthology.orchestra.run.smithi104.stdout: "num_bytes_recovered": 1472200704, 2018-07-19T12:26:19.225 INFO:teuthology.orchestra.run.smithi104.stdout: "num_keys_recovered": 0, 2018-07-19T12:26:19.225 INFO:teuthology.orchestra.run.smithi104.stdout: "read_bytes_sec": 181, 2018-07-19T12:26:19.225 INFO:teuthology.orchestra.run.smithi104.stdout: "write_bytes_sec": 52757806, 2018-07-19T12:26:19.225 INFO:teuthology.orchestra.run.smithi104.stdout: "read_op_per_sec": 0, 2018-07-19T12:26:19.225 INFO:teuthology.orchestra.run.smithi104.stdout: "write_op_per_sec": 416 2018-07-19T12:26:19.225 INFO:teuthology.orchestra.run.smithi104.stdout: }, ...
and here are the "incriminated log messages":
2018-07-19T12:26:20.073 INFO:tasks.ceph.osd.0.smithi104.stderr:2018-07-19 12:26:20.060 7fd055462700 -1 received signal: Hangup from /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 0 (PID: 40895) UID: 0 2018-07-19T12:26:20.157 INFO:teuthology.orchestra.run.smithi130:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.6.asok dump_historic_ops' 2018-07-19T12:26:20.202 INFO:tasks.ceph.osd.4.smithi130.stderr:2018-07-19 12:26:20.200241 7f110ceee700 -1 received signal: Hangup from PID: 12916 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 4 UID: 0 2018-07-19T12:26:20.247 INFO:teuthology.orchestra.run.smithi130:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight' 2018-07-19T12:26:20.302 INFO:tasks.ceph.osd.5.smithi130.stderr:2018-07-19 12:26:20.290112 7fcb47faa700 -1 received signal: Hangup from PID: 50243 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 5 UID: 0 2018-07-19T12:26:20.378 INFO:tasks.ceph.osd.5.smithi130.stderr:2018-07-19 12:26:20.375775 7fcb47faa700 -1 received signal: Hangup from PID: 50243 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 5 UID: 0 2018-07-19T12:26:20.391 INFO:tasks.ceph.osd.1.smithi104.stderr:2018-07-19 12:26:20.384 7f56ea308700 -1 bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x10000 checksum at blob offset 0x0, got 0x1c104513, expected 0x315eeaf7, device location [0xed70000~10000], logical extent 0x1b0000~10000, object #14:1d1092d3:::benchmark_data_smithi105_10641_object5831:head# 2018-07-19T12:26:20.391 INFO:tasks.ceph.osd.1.smithi104.stderr:2018-07-19 12:26:20.384 7f56ea308700 -1 log_channel(cluster) log [ERR] : 14.8 missing primary copy of 14:1d1092d3:::benchmark_data_smithi105_10641_object5831:head, unfound
Updated by Nathan Cutler over 5 years ago
- Related to Bug #22464: Bluestore: many checksum errors, always 0x6706be76 (which matches a zero block) added
Updated by Nathan Cutler over 5 years ago
- Related to Bug #24901: Client reads fail due to bad CRC under high memory pressure on OSDs added
Updated by Nathan Cutler over 5 years ago
Also, I noticed this in the test yaml:
- parallel: - stress-tasks - install.upgrade: client.0: null osd.4: null - ceph.restart: daemons: - osd.4 - osd.5 - osd.6 - osd.7 wait-for-healthy: false wait-for-osds-up: true
The only thing within parallel
is "stress-tasks". Shouldn't the "install.upgrade" and "ceph.restart" stanzas be included under parallel
, as well?
The test in its current form seems to do: (1) upgrade all MONs, MGRs, and half of the OSDs to mimic, (2) stress this "split luminous/mimic" cluster, (3) upgrade the remaining OSDs to mimic.
It does not appear to thrash or otherwise stress the cluster during the upgrade process.
Updated by Neha Ojha over 5 years ago
- Project changed from RADOS to bluestore
- Assignee changed from Neha Ojha to Sage Weil
Updated by Sage Weil over 5 years ago
Nathan Cutler wrote:
Also, I noticed this in the test yaml:
[...]
The only thing within
parallel
is "stress-tasks". Shouldn't the "install.upgrade" and "ceph.restart" stanzas be included underparallel
, as well?The test in its current form seems to do: (1) upgrade all MONs, MGRs, and half of the OSDs to mimic, (2) stress this "split luminous/mimic" cluster, (3) upgrade the remaining OSDs to mimic.
It does not appear to thrash or otherwise stress the cluster during the upgrade process.
That's by design. The intent is to stress a mixed-version cluster with thrashing. The parallel tests do the upgrade and stress workload in parallel.
Updated by Sage Weil over 5 years ago
- Subject changed from "[ERR] : 14.8 missing primary copy of .." in upgrade:luminous-x-mimic to bad csum during upgrade test
- Status changed from New to Need More Info
- Assignee deleted (
Sage Weil) - Priority changed from Urgent to High
Looking at the log, I don't see any useful clues as to what might have went wrong. No intervening writes, etc.
Updated by Sage Weil over 5 years ago
- Status changed from Need More Info to Can't reproduce