Project

General

Profile

Bug #23030

osd: crash during recovery with assert(p != recovery_info.ss.clone_snap)and assert(recovery_info.oi.legacy_snaps.size())

Added by Paul Emmerich about 6 years ago. Updated almost 5 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've got some OSDs in an 5/3 EC pool crashing during recovery. The crash happens simultaneously on 5 to 10 OSDs, some OSDs are part of all crashes.

Unfortunately I can't get a 20/20 log at the moment because it happens too rarely and the log file is huge because there's a lot of backfilling going on in that cluster.

Here's one crash log filtered for the crashing thread:

 -1653> 2018-02-16 15:29:20.784840 7fdd15a51700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
 -1552> 2018-02-16 15:29:24.051448 7fdd15a51700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
 -1508> 2018-02-16 15:29:25.186091 7fdd15a51700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
(removed some lines with write_log_and_missing)
  -871> 2018-02-16 15:30:53.332108 7fdd15a51700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
     0> 2018-02-16 15:31:20.992155 7fdd15a51700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fdd15a51700 time 2018-02-16 15:31:20.987932
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55acd22d7550]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x12b1) [0x55acd1eaabd1]
 3: (ECBackend::handle_recovery_push(PushOp const&, RecoveryMessages*)+0xf36) [0x55acd203d1c6]
 4: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x22e) [0x55acd204c3be]
 5: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55acd1f503b0]
 6: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5ae) [0x55acd1ebbf9e]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x55acd1d46899]
 8: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55acd1fbe897]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x55acd1d7443e]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x55acd22dd069]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55acd22df000]
 12: (()+0x7e25) [0x7fdd30531e25]
 13: (clone()+0x6d) [0x7fdd2f62534d]

Here's another OSD failing with a different assert and different log messages at the exact same time:

 (... lots of messages like this ...)
  -162> 2018-02-16 16:01:28.956984 7fab65ca5700  5 osd.22 pg_epoch: 44934 pg[7.fes0( v 43650'6880385 (43531'6870338,43650'6880385] local-lis/les=44883/44884 n=14777 ec=159/159 lis/c 44883/37761 les/c/f 44884/37918/0 44881/44883/44690) [22,12,5,30,2,13,58,59]/[22,12,5,30,2,13,2147483647,59] r=0 lpr=44883 pi=[37761,44883)/9 rops=1 bft=58(6) crt=43650'6880385 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling snaptrimq=[d86~1,db4~1,dcf~1,dd2~2,dd8~3,ddc~5,de2~1,de4~1,de6~1,de8~1,dea~4,def~1,df1~1,df9~1,dfc~3,e00~5,e06~1,e08~3,e0c~3,e10~4,e15~1,e17~3,e1b~2,e21~3,e25~c,e32~1,e34~1,e36~2,e39~3,e3d~5,e43~1,e45~1,e47~5,e4d~3,ec2~7,eca~7,ed2~8,edb~1,edd~6,ee4~7,eec~5,ef2~4,efa~3,eff~8,f09~3,f0d~2,f10~3,f16~1,f18~3,f23~3,f28~b,f34~26,f6a~1,f6c~2,f79~63]] backfill_pos is 7:7f5cb5d7:::rbd_data.6.6fcb1643c9869.0000000000003150:136
  -161> 2018-02-16 16:01:29.321960 7fab65ca5700  5 osd.22 pg_epoch: 44934 pg[7.fes0( v 43650'6880385 (43531'6870338,43650'6880385] local-lis/les=44883/44884 n=14777 ec=159/159 lis/c 44883/37761 les/c/f 44884/37918/0 44881/44883/44690) [22,12,5,30,2,13,58,59]/[22,12,5,30,2,13,2147483647,59] r=0 lpr=44883 pi=[37761,44883)/9 rops=1 bft=58(6) crt=43650'6880385 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling snaptrimq=[d86~1,db4~1,dcf~1,dd2~2,dd8~3,ddc~5,de2~1,de4~1,de6~1,de8~1,dea~4,def~1,df1~1,df9~1,dfc~3,e00~5,e06~1,e08~3,e0c~3,e10~4,e15~1,e17~3,e1b~2,e21~3,e25~c,e32~1,e34~1,e36~2,e39~3,e3d~5,e43~1,e45~1,e47~5,e4d~3,ec2~7,eca~7,ed2~8,edb~1,edd~6,ee4~7,eec~5,ef2~4,efa~3,eff~8,f09~3,f0d~2,f10~3,f16~1,f18~3,f23~3,f28~b,f34~26,f6a~1,f6c~2,f79~63]] backfill_pos is 7:7f5cb5d7:::rbd_data.6.6fcb1643c9869.0000000000003150:17d
  -156> 2018-02-16 16:01:29.828697 7fab65ca5700  5 osd.22 pg_epoch: 44934 pg[7.fes0( v 43650'6880385 (43531'6870338,43650'6880385] local-lis/les=44883/44884 n=14777 ec=159/159 lis/c 44883/37761 les/c/f 44884/37918/0 44881/44883/44690) [22,12,5,30,2,13,58,59]/[22,12,5,30,2,13,2147483647,59] r=0 lpr=44883 pi=[37761,44883)/9 rops=1 bft=58(6) crt=43650'6880385 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling snaptrimq=[d86~1,db4~1,dcf~1,dd2~2,dd8~3,ddc~5,de2~1,de4~1,de6~1,de8~1,dea~4,def~1,df1~1,df9~1,dfc~3,e00~5,e06~1,e08~3,e0c~3,e10~4,e15~1,e17~3,e1b~2,e21~3,e25~c,e32~1,e34~1,e36~2,e39~3,e3d~5,e43~1,e45~1,e47~5,e4d~3,ec2~7,eca~7,ed2~8,edb~1,edd~6,ee4~7,eec~5,ef2~4,efa~3,eff~8,f09~3,f0d~2,f10~3,f16~1,f18~3,f23~3,f28~b,f34~26,f6a~1,f6c~2,f79~63]] backfill_pos is 7:7f5cb5d7:::rbd_data.6.6fcb1643c9869.0000000000003150:head
  -154> 2018-02-16 16:01:30.065048 7fab65ca5700  5 osd.22 pg_epoch: 44934 pg[7.fes0( v 43650'6880385 (43531'6870338,43650'6880385] local-lis/les=44883/44884 n=14777 ec=159/159 lis/c 44883/37761 les/c/f 44884/37918/0 44881/44883/44690) [22,12,5,30,2,13,58,59]/[22,12,5,30,2,13,2147483647,59] r=0 lpr=44883 pi=[37761,44883)/9 rops=1 bft=58(6) crt=43650'6880385 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling snaptrimq=[d86~1,db4~1,dcf~1,dd2~2,dd8~3,ddc~5,de2~1,de4~1,de6~1,de8~1,dea~4,def~1,df1~1,df9~1,dfc~3,e00~5,e06~1,e08~3,e0c~3,e10~4,e15~1,e17~3,e1b~2,e21~3,e25~c,e32~1,e34~1,e36~2,e39~3,e3d~5,e43~1,e45~1,e47~5,e4d~3,ec2~7,eca~7,ed2~8,edb~1,edd~6,ee4~7,eec~5,ef2~4,efa~3,eff~8,f09~3,f0d~2,f10~3,f16~1,f18~3,f23~3,f28~b,f34~26,f6a~1,f6c~2,f79~63]] backfill_pos is 7:7f5cb97e:::rbd_data.6.40cf02ae8944a.00000000000024e0:head
   -58> 2018-02-16 16:01:31.221451 7fab65ca5700  5 osd.22 pg_epoch: 44936 pg[7.fes0( v 43650'6880385 (43531'6870338,43650'6880385] local-lis/les=44883/44884 n=14777 ec=159/159 lis/c 44883/37761 les/c/f 44884/37918/0 44881/44883/44690) [22,12,5,30,2,13,58,59]/[22,12,5,30,2,13,2147483647,59] r=0 lpr=44883 pi=[37761,44883)/9 rops=1 bft=58(6) crt=43650'6880385 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling snaptrimq=[d86~1,db4~1,dcf~1,dd2~2,dd8~3,ddc~5,de2~1,de4~1,de6~1,de8~1,dea~4,def~1,df1~1,df9~1,dfc~3,e00~5,e06~1,e08~3,e0c~3,e10~4,e15~1,e17~3,e1b~2,e21~3,e25~c,e32~1,e34~1,e36~2,e39~3,e3d~5,e43~1,e45~1,e47~5,e4d~3,ec2~7,eca~7,ed2~8,edb~1,edd~6,ee4~7,eec~5,ef2~4,efa~3,eff~8,f09~3,f0d~2,f10~3,f16~1,f18~3,f23~3,f28~b,f34~26,f6a~1,f6c~2,f79~63]] backfill_pos is 7:7f5cc33a:::rbd_data.6.78a042ae8944a.00000000000029c9:head
    -8> 2018-02-16 16:01:41.822049 7fab65ca5700  5 osd.22 pg_epoch: 44937 pg[7.fes0( v 43650'6880385 (43531'6870338,43650'6880385] local-lis/les=44883/44884 n=14777 ec=159/159 lis/c 44883/37761 les/c/f 44884/37918/0 44881/44883/44690) [22,12,5,30,2,13,58,59]/[22,12,5,30,2,13,2147483647,59] r=0 lpr=44883 pi=[37761,44883)/9 rops=1 bft=58(6) crt=43650'6880385 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling snaptrimq=[d86~1,db4~1,dcf~1,dd2~2,dd8~3,ddc~5,de2~1,de4~1,de6~1,de8~1,dea~4,def~1,df1~1,df9~1,dfc~3,e00~5,e06~1,e08~3,e0c~3,e10~4,e15~1,e17~3,e1b~2,e21~3,e25~c,e32~1,e34~1,e36~2,e39~3,e3d~5,e43~1,e45~1,e47~5,e4d~3,ec2~7,eca~7,ed2~8,edb~1,edd~6,ee4~7,eec~5,ef2~4,efa~3,eff~8,f09~3,f0d~2,f10~3,f16~1,f18~3,f23~3,f28~b,f34~26,f6a~1,f6c~2,f79~63]] backfill_pos is 7:7f5cc5c3:::rbd_data.6.4074d2ae8944a.0000000000008a41:head
     0> 2018-02-16 16:01:48.550013 7fab65ca5700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fab65ca5700 time 2018-02-16 16:01:48.545620
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 353: FAILED assert(recovery_info.oi.legacy_snaps.size())

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5648fa4a7550]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x12d5) [0x5648fa07abf5]
 3: (ECBackend::handle_recovery_push(PushOp const&, RecoveryMessages*)+0xf36) [0x5648fa20d1c6]
 4: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x22e) [0x5648fa21c3be]
 5: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x5648fa1203b0]
 6: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5ae) [0x5648fa08bf9e]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x5648f9f16899]
 8: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x5648fa18e897]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x5648f9f4443e]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x5648fa4ad069]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5648fa4af000]
 12: (()+0x7e25) [0x7fab80785e25]
 13: (clone()+0x6d) [0x7fab7f87934d]

http://tracker.ceph.com/issues/21827 seems related. But we have no inconsistent PGs here.

Two crash logs are attached but without 20/20 debug :(

crash-osd-22.txt.gz (255 KB) Paul Emmerich, 02/17/2018 12:35 AM

crash-osd-50.txt.gz (76.5 KB) Paul Emmerich, 02/17/2018 12:35 AM

History

#1 Updated by Patrick Donnelly about 6 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from OSD crashes during recovery with assert(p != recovery_info.ss.clone_snap)and assert(recovery_info.oi.legacy_snaps.size()) to osd: crash during recovery with assert(p != recovery_info.ss.clone_snap)and assert(recovery_info.oi.legacy_snaps.size())
  • Category deleted (OSD)

#2 Updated by Greg Farnum about 6 years ago

  • Assignee set to Sage Weil

This snapshot assert looks like "Ceph Luminous - pg is down due to src/osd/SnapMapper.cc: 246: FAILED assert(r == -2)"

See the archived discussion at https://marc.info/?t=151606589500002&r=1&w=2 :/

Sage, any more thoughts?

#3 Updated by Neha Ojha almost 5 years ago

  • Status changed from New to Fix Under Review

Also available in: Atom PDF