Project

General

Profile

Bug #12954

osd/osd_types.cc: 4074: FAILED assert(clone_size.count(clone))

Added by David Zafman over 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From Chris Taylor on ceps-users mailing list

------
Fellow Ceph Users,

I have 3 OSD nodes and 3 MONs on separate servers. Our storage was near full on some OSD so we added additional drives, almost doubling our space. Since then we are getting OSDs that are respawning. We added additional RAM to the OSD nodes, from 12G to 24G. It started with one OSD respawning so we reweighted it to 0, then removed it from the CRUSHMAP. Then another OSD started doing the same thing. We repeated the process again. We are now one the 5th drive. Some of these drives are new and we're thinking there may be something else going on than just bad/corrupt drives. Can someone please help?

Here are some snippets from the logs files. If there is any thing else you want to see let me know.

Thanks,
Chris

ceph-osd-03:syslog

Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744712] init: ceph-osd (ceph/23) main process (458) killed by ABRT signal
Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744736] init: ceph-osd (ceph/23) main process ended, respawning
Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871768] init: ceph-osd (ceph/23) main process (938) killed by ABRT signal
Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871791] init: ceph-osd (ceph/23) main process ended, respawning
Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527056] init: ceph-osd (ceph/23) main process (1463) killed by ABRT signal
Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527079] init: ceph-osd (ceph/23) main process ended, respawning
Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159178] init: ceph-osd (ceph/23) main process (1963) killed by ABRT signal
Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159228] init: ceph-osd (ceph/23) main process ended, respawning
Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843268] init: ceph-osd (ceph/23) main process (2478) killed by ABRT signal
Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843282] init: ceph-osd (ceph/23) main process ended, respawning

ceph-osd-03:ceph-osd.23.log

   -11> 2015-08-27 11:37:53.054359 7fef96fa4700  1 -- 10.21.0.23:6802/10219 <== osd.10 10.21.0.22:6800/3623 21 ==== osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 ==== 1226+0+15250 (4145454591 0 2457813372) 0x10229600 con 0xf885440
   -10> 2015-08-27 11:37:53.054407 7fef96fa4700  5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054223, event: header_read, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -9> 2015-08-27 11:37:53.054420 7fef96fa4700  5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054226, event: throttled, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -8> 2015-08-27 11:37:53.054427 7fef96fa4700  5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054348, event: all_read, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -7> 2015-08-27 11:37:53.054434 7fef96fa4700  5 -- op tracker -- seq: 390, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -6> 2015-08-27 11:37:53.054571 7fef9ecdb700  5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054570, event: reached_pg, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -5> 2015-08-27 11:37:53.054606 7fef9ecdb700  5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054606, event: started, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -4> 2015-08-27 11:37:53.054737 7fef9ecdb700  5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054737, event: done, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -3> 2015-08-27 11:37:53.054832 7fef9dcd9700  2 osd.23 pg_epoch: 141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] local-les=141062 n=11076 ec=101 les/c 141062/141062 141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep] scrub_compare_maps   osd.23 has 25 items
    -2> 2015-08-27 11:37:53.054877 7fef9dcd9700  2 osd.23 pg_epoch: 141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] local-les=141062 n=11076 ec=101 les/c 141062/141062 141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep] scrub_compare_maps replica 10 has 25 items
    -1> 2015-08-27 11:37:53.055207 7fef9dcd9700  2 osd.23 pg_epoch: 141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] local-les=141062 n=11076 ec=101 les/c 141062/141062 141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep]
     0> 2015-08-27 11:37:53.060242 7fef9dcd9700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fef9dcd9700 time 2015-08-27 11:37:53.055278
osd/osd_types.cc: 4074: FAILED assert(clone_size.count(clone))

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2b8b]
 2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x79da36]
 3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair<unsigned int, unsigned int>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::pair<unsigned int, unsigned int> > > > const&)+0xa2a) [0x88491a]
 4: (PG::scrub_compare_maps()+0xd89) [0x7f3c39]
 5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x7f6d9e]
 6: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f870e]
 7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cca59]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb3d1e]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4dc0]
 10: (()+0x8182) [0x7fefc0fb9182]
 11: (clone()+0x6d) [0x7fefbf52447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

root@ceph-admin:~# ceph -s
    cluster d960d672-e035-413d-ba39-8341f4131760
     health HEALTH_WARN
            3 pgs backfill
            2 pgs backfilling
            5 pgs stuck unclean
            2 requests are blocked > 32 sec
            recovery 363/10993480 objects degraded (0.003%)
            recovery 181602/10993480 objects misplaced (1.652%)
            pool libvirt-pool pg_num 512 > pgp_num 412
     monmap e1: 3 mons at {ceph-mon1=10.20.0.11:6789/0,ceph-mon2=10.20.0.12:6789/0,ceph-mon3=10.20.0.13:6789/0}
            election epoch 4780, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
     osdmap e141113: 46 osds: 42 up, 42 in; 5 remapped pgs
      pgmap v17289866: 1600 pgs, 4 pools, 20158 GB data, 5316 kobjects
            42081 GB used, 32394 GB / 74476 GB avail
            363/10993480 objects degraded (0.003%)
            181602/10993480 objects misplaced (1.652%)
                1595 active+clean
                   3 active+remapped+wait_backfill
                   2 active+remapped+backfilling
recovery io 83489 kB/s, 20 objects/s
  client io 13079 kB/s rd, 24569 kB/s wr, 113 op/s
root@ceph-admin:~# 

Related issues

Copied to Ceph - Backport #15842: hammer: osd/osd_types.cc: 4074: FAILED assert(clone_size.count(clone)) Resolved

History

#1 Updated by David Zafman over 8 years ago

  • Assignee set to David Zafman

Scrub shouldn't crash an osd because of a corrupt SnapSet.

I've already rewitten ReplicaltedPG::_scrub() for bug #12738. Adding additional checking fits right in on top of that change.

Another question remains: What caused the SnapSet for this object to be wrong?

#2 Updated by Paul Emmerich over 8 years ago

We are seeing the same bug after adding more OSDs:

    -4> 2015-09-24 10:30:13.723276 7fdbadac3700 2 osd.109 pg_epoch: 88719 pg[34.b54( v 88202'1241904 (65518'1238899,88202'1241904] local-les=88718 n=1654 ec=27285 les/c 88718/88719 88717/88717/88717) [109,48,44] r=0 lpr=88717 crt=87661'1241902 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub_compare_maps osd.109 has 25 items
    -3> 2015-09-24 10:30:13.723313 7fdbadac3700 2 osd.109 pg_epoch: 88719 pg[34.b54( v 88202'1241904 (65518'1238899,88202'1241904] local-les=88718 n=1654 ec=27285 les/c 88718/88719 88717/88717/88717) [109,48,44] r=0 lpr=88717 crt=87661'1241902 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub_compare_maps replica 44 has 25 items
    -2> 2015-09-24 10:30:13.723319 7fdbadac3700 2 osd.109 pg_epoch: 88719 pg[34.b54( v 88202'1241904 (65518'1238899,88202'1241904] local-les=88718 n=1654 ec=27285 les/c 88718/88719 88717/88717/88717) [109,48,44] r=0 lpr=88717 crt=87661'1241902 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub_compare_maps replica 48 has 25 items
    -1> 2015-09-24 10:30:13.723489 7fdbadac3700 2 osd.109 pg_epoch: 88719 pg[34.b54( v 88202'1241904 (65518'1238899,88202'1241904] local-les=88718 n=1654 ec=27285 les/c 88718/88719 88717/88717/88717) [109,48,44] r=0 lpr=88717 crt=87661'1241902 lcod 0'0 mlcod 0'0 active+clean+scrubbing] 
     0> 2015-09-24 10:30:13.725361 7fdbadac3700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fdbadac3700 time 2015-09-24 10:30:13.723517
osd/osd_types.cc: 4076: FAILED assert(clone_size.count(clone))

 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc376b]
 2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x79dc06]
 3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair<unsigned int, unsigned int>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::pair<unsigned int, unsigned int> > > > const&)+0xa2a) [0x8852da]
 4: (PG::scrub_compare_maps()+0xd89) [0x7f45a9]
 5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x7f770e]
 6: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f907e]
 7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cc9d9]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb425e]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbb5300]
 10: (()+0x8182) [0x7fdbd8071182]
 11: (clone()+0x6d) [0x7fdbd65dc47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#3 Updated by Ruifeng Yang over 8 years ago

we hit the same bug

2015-09-29 15:25:51.655118 7fc178fff700 0 log_channel(cluster) log [INF] : 7.18c scrub starts
2015-09-29 15:25:51.696026 7fc178fff700 -1 log_channel(cluster) log [ERR] : be_compare_scrubmaps: 7.18c shard 21: soid c1c09d8c/rbd_data.20be92ae8944a.000000000000c89e/head//7 size 4194304 != known size 4063232
2015-09-29 15:25:51.700754 7fc178fff700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fc178fff700 time 2015-09-29 15:25:51.696113
osd/osd_types.cc: 4076: FAILED assert(clone_size.count(clone))

ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc36ab]
2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x79dcf6]
3: (ReplicatedPG::_scrub(ScrubMap&, std::map&lt;hobject_t, std::pair&lt;unsigned int, unsigned int&gt;, std::less&lt;hobject_t&gt;, std::allocator&lt;std::pair&lt;hobject_t const, std::pair&lt;unsigned int, unsigned int&gt; > > > const&)+0xa2a) [0x88577a]
4: (PG::scrub_compare_maps()+0xd89) [0x7f46c9]
5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x7f782e]
6: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f919e]
7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6ccab9]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xbb4326]
9: (ThreadPool::WorkThread::entry()+0x10) [0xbb53d0]
10: (()+0x8182) [0x7fc19baae182]
11: (clone()+0x6d) [0x7fc19a01947d]

#4 Updated by David Zafman over 8 years ago

  • Status changed from New to Fix Under Review

Crashes will be fixed by https://github.com/ceph/ceph/pull/5783

In particular commit https://github.com/dzafman/ceph/commit/3ec5b62d5f68c11514b56c8ac2f4165c07bf521e

I will resolve on the assumption that the snapset corruption occurred because of a bug in an older release.

#5 Updated by David Zafman over 8 years ago

  • Status changed from Fix Under Review to Resolved

eb0ca424815e94c78a2d09dbf787d102172f4ddf

#6 Updated by Nathan Cutler almost 8 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to hammer

#7 Updated by Nathan Cutler almost 8 years ago

  • Copied to Backport #15842: hammer: osd/osd_types.cc: 4074: FAILED assert(clone_size.count(clone)) added

#8 Updated by Nathan Cutler almost 8 years ago

@David Zafman: Is the fix for this issue included in https://github.com/ceph/ceph/pull/7702 which has been merged into hammer and will be part of the 0.94.7 point release?

#9 Updated by David Zafman almost 8 years ago

Nathan Cutler wrote:

@David Zafman: Is the fix for this issue included in https://github.com/ceph/ceph/pull/7702 which has been merged into hammer and will be part of the 0.94.7 point release?

Yes, I see that eb0ca424815e94c78a2d09dbf787d102172f4ddf is cherry-pick'ed as part of pull #7702 and it is merged to the hammer branch.

#11 Updated by Nathan Cutler almost 8 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF