Bug #46043
openosd/ECBackend.cc: 1551: FAILED assert(!(*m).is_missing(hoid))
0%
Description
We observed this crush on on one of the customer servers:
2020-06-04 03:41:22.606085 7f18b1407700 -1 /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/osd/ECBackend.cc: In function 'void ECBackend::get_all_avail_shards(const hobject_t&, const std::set<pg_shard_t>&, std::set<int>&, std::map<shard_id_t, pg_shard_t>&, bool)' thread 7f18b1407700 time 2020-06-04 03:41:22.487645 /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/osd/ECBackend.cc: 1551: FAILED assert(!(*m).is_missing(hoid)) ceph version 12.2.12-573-g67074fa839 (67074fa839ea075078c95b0f6d1679c37f1c61f5) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x55b04fb8bb2e] 2: (ECBackend::get_all_avail_shards(hobject_t const&, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > const&, std::set<int, std::less<int>, std::allocator<int> >&, std::map<shard_id_t, pg_shard_t, std::less<shard_id_t>, std::allocator<std::pair<shard_id_t const, pg_shard_t> > >&, bool)+0x827) [0x55b04f8fbae7] 3: (ECBackend::get_min_avail_to_read_shards(hobject_t const&, std::set<int, std::less<int>, std::allocator<int> > const&, bool, bool, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> >*)+0xf9) [0x55b04f8fbc79] 4: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x24b) [0x55b04f90851b] 5: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x621) [0x55b04f90b311] 6: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x103e) [0x55b04f7c2eae] 7: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x468) [0x55b04f7caee8] 8: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x754) [0x55b04f61feb4] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xed4) [0x55b04f62f6f4] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x829) [0x55b04fb912d9] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b04fb93220] 12: (()+0x8724) [0x7f18d276f724] 13: (clone()+0x6d) [0x7f18d17f5e8d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Not sure if this is related but the crash happened just after pg repair finished and reporte 1 object missing. The last entries in the log for the crashed thread:
-522> 2020-06-04 03:41:21.967954 7f18b1407700 -1 log_channel(cluster) log [ERR] : 5.5b5s0 repair 1 missing, 0 inconsistent objects -521> 2020-06-04 03:41:22.007437 7f18b1407700 -1 log_channel(cluster) log [ERR] : 5.5b5 repair 7 errors, 6 fixed -520> 2020-06-04 03:41:22.007497 7f18b1407700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, tri mmed: , trimmed_dups: , clear_divergent_priors: 0 -519> 2020-06-04 03:41:22.007831 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.13:6806/50740 -- MOSDScrubReserve(5.5b5s4 RELEASE e595283) v1 -- 0x55b070402400 con 0 -518> 2020-06-04 03:41:22.007849 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.16:6818/2888562 -- MOSDScrubReserve(5.5b5s1 RELEASE e595283) v1 -- 0x55b093a30a00 con 0 -517> 2020-06-04 03:41:22.007860 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.19:6828/27993 -- MOSDScrubReserve(5.5b5s3 RELEASE e595283) v1 -- 0x55b0bdd2f200 con 0 -516> 2020-06-04 03:41:22.007871 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.11:6822/53584 -- MOSDScrubReserve(5.5b5s2 RELEASE e595283) v1 -- 0x55b06fb62000 con 0 -515> 2020-06-04 03:41:22.007880 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.12:6836/80022 -- MOSDScrubReserve(5.5b5s5 RELEASE e595283) v1 -- 0x55b07c2ce000 con 0 -514> 2020-06-04 03:41:22.007893 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.13:6806/50740 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595 283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->4)=(empty) epoch 595283) v5 -- 0x55b060e52760 con 0 -513> 2020-06-04 03:41:22.007908 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.16:6818/2888562 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,5 95283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->1)=(empty) epoch 595283) v5 -- 0x55b0990e4f00 con 0 -512> 2020-06-04 03:41:22.007918 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.19:6828/27993 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595 283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->3)=(empty) epoch 595283) v5 -- 0x55b08613ba20 con 0 -511> 2020-06-04 03:41:22.007931 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.11:6822/53584 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595 283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->2)=(empty) epoch 595283) v5 -- 0x55b07f4dbc00 con 0 -510> 2020-06-04 03:41:22.007943 7f18b1407700 1 -- 192.168.40.18:6820/25658 --> 192.168.40.12:6836/80022 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595 283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->5)=(empty) epoch 595283) v5 -- 0x55b07c5d0000 con 0
I am attaching the full log to the ticket.
A crush with the similar backtrace has already been reported [1] but I decided to open a new ticket because that issue is already resolved and the fix backported to 12.2.12 [2]. Note, the crush we observed is not for the upstream but for the SUSE Enterprise Storage build, and I rechecked it includes the backported fix [3].
[1] https://tracker.ceph.com/issues/37919
[2] https://tracker.ceph.com/issues/38105
[3] https://github.com/SUSE/ceph/tree/v12.2.12-573-g67074fa839/src/osd/PG.cc#L685
Files
Updated by Neha Ojha almost 4 years ago
- Status changed from New to Need More Info
It is really hard to say what caused this assert without enough debug logging and I doubt we will able to reproduce this ourselves. If you see this happen again, can you upload osd logs with debug_osd=20.