Bug #21218
closedthrash-eio + bluestore (hangs with unfound objects or read_log_and_missing assert)
0%
Description
2017-09-01T21:24:33.959 INFO:teuthology.orchestra.run.smithi037.stdout: "health": { 2017-09-01T21:24:33.975 INFO:teuthology.orchestra.run.smithi037.stdout: "checks": { 2017-09-01T21:24:34.070 INFO:teuthology.orchestra.run.smithi037.stdout: "OSDMAP_FLAGS": { 2017-09-01T21:24:34.081 INFO:teuthology.orchestra.run.smithi037.stdout: "severity": "HEALTH_WARN", 2017-09-01T21:24:34.139 INFO:teuthology.orchestra.run.smithi037.stdout: "summary": { 2017-09-01T21:24:34.194 INFO:teuthology.orchestra.run.smithi037.stdout: "message": "noscrub,nodeep-scrub flag(s) set" 2017-09-01T21:24:34.252 INFO:teuthology.orchestra.run.smithi037.stdout: } 2017-09-01T21:24:34.265 INFO:teuthology.orchestra.run.smithi037.stdout: }, 2017-09-01T21:24:34.370 INFO:teuthology.orchestra.run.smithi037.stdout: "OBJECT_UNFOUND": { 2017-09-01T21:24:34.420 INFO:teuthology.orchestra.run.smithi037.stdout: "severity": "HEALTH_WARN", 2017-09-01T21:24:34.461 INFO:teuthology.orchestra.run.smithi037.stdout: "summary": { 2017-09-01T21:24:34.483 INFO:teuthology.orchestra.run.smithi037.stdout: "message": "55/32900 objects unfound (0.167%)" 2017-09-01T21:24:34.504 INFO:teuthology.orchestra.run.smithi037.stdout: } 2017-09-01T21:24:34.633 INFO:teuthology.orchestra.run.smithi037.stdout: }, 2017-09-01T21:24:35.093 INFO:teuthology.orchestra.run.smithi037.stdout: "PG_DEGRADED": { 2017-09-01T21:24:35.102 INFO:teuthology.orchestra.run.smithi037.stdout: "severity": "HEALTH_WARN", 2017-09-01T21:24:35.168 INFO:teuthology.orchestra.run.smithi037.stdout: "summary": { 2017-09-01T21:24:35.365 INFO:teuthology.orchestra.run.smithi037.stdout: "message": "Degraded data redundancy: 110/65800 objects degraded (0.167%), 1 pg unclean, 1 pg degraded" 2017-09-01T21:24:35.384 INFO:teuthology.orchestra.run.smithi037.stdout: } 2017-09-01T21:24:35.485 INFO:teuthology.orchestra.run.smithi037.stdout: }, 2017-09-01T21:24:35.520 INFO:teuthology.orchestra.run.smithi037.stdout: "REQUEST_STUCK": { 2017-09-01T21:24:35.679 INFO:teuthology.orchestra.run.smithi037.stdout: "severity": "HEALTH_ERR", 2017-09-01T21:24:35.688 INFO:teuthology.orchestra.run.smithi037.stdout: "summary": { 2017-09-01T21:24:36.108 INFO:teuthology.orchestra.run.smithi037.stdout: "message": "27 stuck requests are blocked > 4096 sec" 2017-09-01T21:24:36.125 INFO:teuthology.orchestra.run.smithi037.stdout: } 2017-09-01T21:24:36.242 INFO:teuthology.orchestra.run.smithi037.stdout: } 2017-09-01T21:24:36.311 INFO:teuthology.orchestra.run.smithi037.stdout: }, 2017-09-01T21:24:36.359 INFO:teuthology.orchestra.run.smithi037.stdout: "status": "HEALTH_ERR", 2017-09-01T21:24:36.403 INFO:teuthology.orchestra.run.smithi037.stdout: "overall_status": "HEALTH_ERR" 2017-09-01T21:24:36.468 INFO:teuthology.orchestra.run.smithi037.stdout: },
but
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.52734 root default -3 0.26367 host smithi037 0 ssd 0.08789 osd.0 up 1.00000 0 1 ssd 0.08789 osd.1 up 1.00000 0.29178 2 ssd 0.08789 osd.2 up 1.00000 1.00000 -5 0.26367 host smithi134 3 ssd 0.08789 osd.3 up 1.00000 1.00000 4 ssd 0.08789 osd.4 up 1.00000 0.67581 5 ssd 0.08789 osd.5 up 1.00000 0.80031
/a/sage-2017-09-01_13:34:04-rados-wip-sage-testing2-2017-08-31-1648-distro-basic-smithi/1586491
Updated by Sage Weil over 6 years ago
another run, eio injection led to a read_log_and_missing assert:
2017-09-14 10:04:50.702877 7efbe3817700 0 bluestore(/var/lib/ceph/osd/ceph-1) read: inject random EIO 2017-09-14 10:04:50.702879 7efbe3817700 10 bluestore(/var/lib/ceph/osd/ceph-1) read 2.d_head #2:be213826:::benchmark_data_smithi164_25541_object9702:head# 0x0~100000 = -5 2017-09-14 10:04:50.702888 7efbe3817700 10 osd.1 pg_epoch: 68 pg[2.d( v 43'2305 (29'800,43'2305] local-lis/les=20/21 n=2305 ec=20/20 lis/c 20/20 les/c/f 21/67/0 20/20/20) [1,3] r=0 lpr=20 crt=43'2305 mlcod 43'2304 active+clean] rep_repair_primary_object 2:be213826:::benchmark_data_smithi164_25541_object9702:head peers osd.{1,3} 2017-09-14 10:04:50.702910 7efbe3817700 15 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.d_head #2:be213826:::benchmark_data_smithi164_25541_object9702:head# _ 2017-09-14 10:04:50.702917 7efbe3817700 10 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.d_head #2:be213826:::benchmark_data_smithi164_25541_object9702:head# _ = 0 2017-09-14 10:04:50.702933 7efbe3817700 10 osd.1 pg_epoch: 68 pg[2.d( v 43'2305 (29'800,43'2305] local-lis/les=20/21 n=2305 ec=20/20 lis/c 20/20 les/c/f 21/67/0 20/20/20) [1,3] r=0 lpr=20 crt=43'2305 mlcod 43'2304 active+clean m=1] 2.d unexpectedly missing 2:be213826:::benchmark_data_smithi164_25541_object9702:head v28'634, there should be a copy on shard 3 2017-09-14 10:04:50.702956 7efbe3817700 -1 log_channel(cluster) log [ERR] : 2.d missing primary copy of 2:be213826:::benchmark_data_smithi164_25541_object9702:head, will try copies on 3 ... 2017-09-14 10:07:17.438658 7feed0fd1d00 10 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.d_head #2:be213826:::benchmark_data_smithi164_25541_object9702:head# _ = 0 2017-09-14 10:07:17.441520 7feed0fd1d00 -1 /build/ceph-13.0.0-754-gc515880/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]' thread 7feed0fd1d00 time 2017-09-14 10:07:17.438670 /build/ceph-13.0.0-754-gc515880/src/osd/PGLog.h: 1456: FAILED assert(oi.version == i.second.have) ceph version 13.0.0-754-gc515880 (c515880f545000869652e851a0bf27a1640f30dc) mimic (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7feed065cebe] 2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set<true>&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, bool*, DoutPrefixProvider const*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, bool)+0x1833) [0x7feed0214e13] 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x377) [0x7feed01c5d67] 4: (OSD::load_pgs()+0x984) [0x7feed01270e4] 5: (OSD::init()+0x2127) [0x7feed013e8c7]
/a/sage-2017-09-13_13:31:57-rados-wip-sage-testing-2017-09-12-1750-distro-basic-smithi/1628024
description: rados/singleton/{all/thrash-eio.yaml msgr-failures/many.yaml msgr/simple.yaml objectstore/bluestore.yaml rados.yaml}
Updated by Sage Weil over 6 years ago
- Category set to Correctness/Safety
- Priority changed from High to Urgent
Updated by Sage Weil over 6 years ago
- Subject changed from thrash-eio + bluestore hangs with unfound objects to thrash-eio + bluestore (hangs with unfound objects or read_log_and_missing assert)
Updated by Daniel Oliveira over 6 years ago
Pls, would you have the reproducer for this, so I could give it a try and check it out in my environment?
Thanks,
Updated by Sage Weil over 6 years ago
- Project changed from RADOS to bluestore
- Category deleted (
Correctness/Safety)
Updated by Sage Weil over 6 years ago
- Priority changed from Urgent to High
demoting this since i haven't seen it recently. will update the bug next time it happens in qa.
Updated by Sage Weil over 6 years ago
- Priority changed from High to Urgent
/a/sage-2017-12-18_22:56:18-rados-wip-sage-testing-2017-12-18-1406-distro-basic-smithi/1976871
description: rados/singleton/{all/thrash-eio.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore-comp.yaml rados.yaml}
Updated by Kefu Chai about 6 years ago
thrash-eio + bluestore
/a/kchai-2018-02-11_04:16:47-rados-wip-kefu-testing-2018-02-11-0959-distro-basic-mira/2181825
description: rados/singleton/{all/thrash-eio.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore-comp.yaml rados.yaml}
2018-02-11 11:20:38.459 7f16260c8dc0 -1 /build/ceph-13.0.1-1854-gca7168c/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, O bjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, s td::set<std::__cxx11::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore ::CollectionImpl>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7f16260c8dc0 time 2018-02-11 11:20:38.464812 /build/ceph-13.0.1-1854-gca7168c/src/osd/PGLog.h: 1450: FAILED assert(oi.version == i.second.have) ceph version 13.0.1-1854-gca7168c (ca7168c71561bb0bd321e3d0495b1ee33416f91a) mimic (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x7f161d40f995] 2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set<true>&, std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, bool*, DoutPrefixProvider const*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, bool)+0x2b63) [0x55eb8c130243] 3: (PG::read_state(ObjectStore*)+0x1d5) [0x55eb8c0fdbe5] 4: (OSD::load_pgs()+0x786) [0x55eb8c03c606] 5: (OSD::init()+0x1def) [0x55eb8c03ee1f] 6: (main()+0x38bc) [0x55eb8bf2e87c] 7: (__libc_start_main()+0xf0) [0x7f161b627830] 8: (_start()+0x29) [0x55eb8bffb2f9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil about 6 years ago
For kefu's run above,
2018-02-11 11:17:18.271 7f6a6378a700 0 bluestore(/var/lib/ceph/osd/ceph-1) read: inject random EIO 2018-02-11 11:17:18.271 7f6a6378a700 10 bluestore(/var/lib/ceph/osd/ceph-1) read 2.3_head #2:cb2d15b0:::benchmark_data_mira053_26158_object6085:head# 0x0~100000 = -5 2018-02-11 11:17:18.271 7f6a6378a700 10 osd.1 pg_epoch: 42 pg[2.3( v 42'566 (0'0,42'566] local-lis/les=15/16 n=566 ec=15/15 lis/c 15/15 les/c/f 16/42/0 15/15/15) [1,2] r=0 lpr=15 crt=42'566 mlcod 42'565 active+clean] rep_repair_primary_object 2:cb2d15b0:::benchmark_data_mira053_26158_object6085:head peers osd.{1,2} 2018-02-11 11:17:18.271 7f6a6378a700 15 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.3_head #2:cb2d15b0:::benchmark_data_mira053_26158_object6085:head# _ 2018-02-11 11:17:18.271 7f6a6378a700 10 bluestore(/var/lib/ceph/osd/ceph-1) getattr 2.3_head #2:cb2d15b0:::benchmark_data_mira053_26158_object6085:head# _ = 0 2018-02-11 11:17:18.271 7f6a6378a700 10 osd.1 pg_epoch: 42 pg[2.3( v 42'566 (0'0,42'566] local-lis/les=15/16 n=566 ec=15/15 lis/c 15/15 les/c/f 16/42/0 15/15/15) [1,2] r=0 lpr=15 crt=42'566 mlcod 42'565 active+clean m=1] 2.3 unexpectedly missing 2:cb2d15b0:::benchmark_data_mira053_26158_object6085:head v35'395, there should be a copy on shard 2 2018-02-11 11:17:18.271 7f6a6378a700 -1 log_channel(cluster) log [ERR] : 2.3 missing primary copy of 2:cb2d15b0:::benchmark_data_mira053_26158_object6085:head, will try copies on 2
which means we run this code in rep_repair_primary_object()
missing_loc.add_missing(soid, v, eversion_t());
then osd gets killed.
on restart, we fail assert that the missing set version (eversion_t(), as per above statement) matches oi (the actual object we have that we got an EIO from).
Updated by David Zafman about 6 years ago
This assert can only happen in the following two cases:
osd debug verify missing on start = true. Used in teuthology testing
internal value force_rebuild_missing set to true when upgrade from Jewel
I propose that we change the assert or simply bypass the store->getattr() if i.second.have eversion_t().
assert(i.second.have eversion_t() || oi.version == i.second.have)
Either way this assumes that recovery of the object can proceed correctly from this state.
Updated by David Zafman about 6 years ago
- Status changed from 12 to Fix Under Review
Updated by David Zafman about 6 years ago
- Status changed from Fix Under Review to 7
Updated by David Zafman about 6 years ago
- Project changed from bluestore to RADOS
Updated by Sage Weil about 6 years ago
- Status changed from 7 to Pending Backport
- Backport set to luminous
Updated by Nathan Cutler about 6 years ago
- Copied to Backport #23024: luminous: thrash-eio + bluestore (hangs with unfound objects or read_log_and_missing assert) added
Updated by Nathan Cutler about 6 years ago
- Status changed from Pending Backport to Resolved