Project

General

Profile

Bug #46043

osd/ECBackend.cc: 1551: FAILED assert(!(*m).is_missing(hoid))

Added by Mykola Golub over 3 years ago. Updated over 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observed this crush on on one of the customer servers:

2020-06-04 03:41:22.606085 7f18b1407700 -1 /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/osd/ECBackend.cc: In function 'void ECBackend::get_all_avail_shards(const hobject_t&, 
const std::set<pg_shard_t>&, std::set<int>&, std::map<shard_id_t, pg_shard_t>&, bool)' thread 7f18b1407700 time 2020-06-04 03:41:22.487645
/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/osd/ECBackend.cc: 1551: FAILED assert(!(*m).is_missing(hoid))

 ceph version 12.2.12-573-g67074fa839 (67074fa839ea075078c95b0f6d1679c37f1c61f5) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x55b04fb8bb2e]
 2: (ECBackend::get_all_avail_shards(hobject_t const&, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> > const&, std::set<int, std::less<int>, std::allocator<int> >&, std::map<shard_id_t, pg_shard_t, std::less<shard_id_t>, std::allocator<std::pair<shard_id_t const, pg_shard_t> > >&, bool)+0x827) [0x55b04f8fbae7]
 3: (ECBackend::get_min_avail_to_read_shards(hobject_t const&, std::set<int, std::less<int>, std::allocator<int> > const&, bool, bool, std::set<pg_shard_t, std::less<pg_shard_t>, std::allocator<pg_shard_t> >*)+0xf9) [0x55b04f8fbc79]
 4: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x24b) [0x55b04f90851b]
 5: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x621) [0x55b04f90b311]
 6: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x103e) [0x55b04f7c2eae]
 7: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x468) [0x55b04f7caee8]
 8: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x754) [0x55b04f61feb4]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xed4) [0x55b04f62f6f4]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x829) [0x55b04fb912d9]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b04fb93220]
 12: (()+0x8724) [0x7f18d276f724]
 13: (clone()+0x6d) [0x7f18d17f5e8d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Not sure if this is related but the crash happened just after pg repair finished and reporte 1 object missing. The last entries in the log for the crashed thread:

  -522> 2020-06-04 03:41:21.967954 7f18b1407700 -1 log_channel(cluster) log [ERR] : 5.5b5s0 repair 1 missing, 0 inconsistent objects
  -521> 2020-06-04 03:41:22.007437 7f18b1407700 -1 log_channel(cluster) log [ERR] : 5.5b5 repair 7 errors, 6 fixed
  -520> 2020-06-04 03:41:22.007497 7f18b1407700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, tri
mmed: , trimmed_dups: , clear_divergent_priors: 0
  -519> 2020-06-04 03:41:22.007831 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.13:6806/50740 -- MOSDScrubReserve(5.5b5s4 RELEASE e595283) v1 -- 0x55b070402400 con 0
  -518> 2020-06-04 03:41:22.007849 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.16:6818/2888562 -- MOSDScrubReserve(5.5b5s1 RELEASE e595283) v1 -- 0x55b093a30a00 con 0
  -517> 2020-06-04 03:41:22.007860 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.19:6828/27993 -- MOSDScrubReserve(5.5b5s3 RELEASE e595283) v1 -- 0x55b0bdd2f200 con 0
  -516> 2020-06-04 03:41:22.007871 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.11:6822/53584 -- MOSDScrubReserve(5.5b5s2 RELEASE e595283) v1 -- 0x55b06fb62000 con 0
  -515> 2020-06-04 03:41:22.007880 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.12:6836/80022 -- MOSDScrubReserve(5.5b5s5 RELEASE e595283) v1 -- 0x55b07c2ce000 con 0
  -514> 2020-06-04 03:41:22.007893 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.13:6806/50740 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595
283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->4)=(empty) epoch 595283) v5 -- 0x55b060e52760 con 0
  -513> 2020-06-04 03:41:22.007908 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.16:6818/2888562 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,5
95283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->1)=(empty) epoch 595283) v5 -- 0x55b0990e4f00 con 0
  -512> 2020-06-04 03:41:22.007918 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.19:6828/27993 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595
283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->3)=(empty) epoch 595283) v5 -- 0x55b08613ba20 con 0
  -511> 2020-06-04 03:41:22.007931 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.11:6822/53584 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595
283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->2)=(empty) epoch 595283) v5 -- 0x55b07f4dbc00 con 0
  -510> 2020-06-04 03:41:22.007943 7f18b1407700  1 -- 192.168.40.18:6820/25658 --> 192.168.40.12:6836/80022 -- pg_info((query:595283 sent:595283 5.5b5s0( v 595283'3546457 (595283'3544858,595
283'3546457] local-lis/les=594236/594237 n=191357 ec=138329/138329 lis/c 594236/594236 les/c/f 594237/594237/0 593659/594236/591705) 0->5)=(empty) epoch 595283) v5 -- 0x55b07c5d0000 con 0

I am attaching the full log to the ticket.

A crush with the similar backtrace has already been reported [1] but I decided to open a new ticket because that issue is already resolved and the fix backported to 12.2.12 [2]. Note, the crush we observed is not for the upstream but for the SUSE Enterprise Storage build, and I rechecked it includes the backported fix [3].

[1] https://tracker.ceph.com/issues/37919
[2] https://tracker.ceph.com/issues/38105
[3] https://github.com/SUSE/ceph/tree/v12.2.12-573-g67074fa839/src/osd/PG.cc#L685

ceph-osd.99.log.gz (342 KB) Mykola Golub, 06/17/2020 07:16 AM

History

#1 Updated by Neha Ojha over 3 years ago

  • Status changed from New to Need More Info

It is really hard to say what caused this assert without enough debug logging and I doubt we will able to reproduce this ourselves. If you see this happen again, can you upload osd logs with debug_osd=20.

Also available in: Atom PDF