Bug #24801
PG num_bytes becomes huge
0%
Description
dzafman-2018-07-05_12:45:56-rados-wip-19753-distro-basic-smithi/2739140
description: rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml 2-recovery-overrides/{more-active-recovery.yaml} backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/off.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore-comp.yaml rados.yaml rocksdb.yaml supported-random-distro$/{rhel_latest.yaml} thrashers/morepggrow.yaml thrashosds-health.yaml workloads/snaps-few-objects.yaml}
While testing new backfill out of space handling I noticed a job that was taking a long time to finish. Upon analysis on one of the test machines I notice that backfill_toofull was set. This is as it should be given that the num_bytes of the pg had the value 9223372036854561577. This made it appear that the contents would have no chance to backfill to a target OSD given the new code.
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 2.5b 1 0 1 2 0 9223372036854561577 380 380 active+undersized+degraded+remapped+backfill_toofull 2018-07-06 17:07:15.464466 515'822 4772:10438 [7,1,4] 7 [6,3] 6 234'585 2018-07-06 09:42:34.599920 0'0 2018-07-06 09:38:35.758926 17
The new code adjusted the kb_used after backfill completed to verify fullness before take the reservation:
2018-07-06 09:47:27.279 7fb7a6666700 20 osd.0 pg_epoch: 548 pg[2.5b( v 515'822 (179'442,515'822] lb MIN (bitwise) local-lis/les=518/519 n=0 ec=545/17 lis/c 547/505 les/c/f 548/506/547 517/547/547) [5,6,0]/[2,6,3] r=-1 lpr=547 pi=[505,547)/2 luod=0'0 crt=515'822 lcod 0'0 active+remapped mbc={}] state<Started/ReplicaActive/RepNotRecovering>: react Before kb_used 1106096 2018-07-06 09:47:27.279 7fb7a6666700 20 osd.0 pg_epoch: 548 pg[2.5b( v 515'822 (179'442,515'822] lb MIN (bitwise) local-lis/les=518/519 n=0 ec=545/17 lis/c 547/505 les/c/f 548/506/547 517/547/547) [5,6,0]/[2,6,3] r=-1 lpr=547 pi=[505,547)/2 luod=0'0 crt=515'822 lcod 0'0 active+remapped mbc={}] state<Started/ReplicaActive/RepNotRecovering>: react After kb_used 9007199255846878
Related issues
History
#1 Updated by David Zafman over 5 years ago
The OSD logs and this bug point to a slight flaw in https://github.com/ceph/ceph/pull/22797. I add the adjustment amount so that the normal space handling can determine if we are over backfill full. When it fails there is a brief point that fullness is triggered, but really we will not take the backfill reservation and then unset the adjustment amount. We need to do a tentative fullness check and only commit to the adjustment amount if we will be proceeding with the backfill.
2018-07-06T18:00:15.613 INFO:tasks.ceph.osd.1.smithi011.stderr:2018-07-06 18:00:15.606 7f33d2055700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now -2147483648% full
#2 Updated by Josh Durgin over 5 years ago
- Priority changed from Normal to High
#3 Updated by David Zafman over 5 years ago
- Blocks Bug #19753: Deny reservation if expected backfill size would put us over backfill_full_ratio added
#4 Updated by David Zafman over 5 years ago
- Status changed from New to In Progress
#5 Updated by David Zafman over 5 years ago
So far with assert added to object_stat_sum_t::add() we saw this. Still not sure why the num_bytes is off.
#7 0x0000558461d846de in ceph::__ceph_assert_fail (ctx=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/common/assert.cc:79 #8 0x00005584620abb99 in object_stat_sum_t::add (this=this@entry=0x55846e7f8748, o=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/osd_types.cc:2151 #9 0x0000558461f63271 in add (o=..., this=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/osd_types.h:1928 #10 PG::prepare_write_info (this=this@entry=0x55846e7f8000, km=km@entry=0x7f5e93fc44e0) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/PG.cc:3478 #11 0x0000558461f634eb in PG::write_if_dirty (this=this@entry=0x55846e7f8000, t=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/PG.cc:3569 #12 0x0000558461f6352d in PG::finish_split_stats (this=this@entry=0x55846e7f8000, stats=..., t=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/PG.cc:2607 Python Exception <type 'exceptions.ValueError'> Cannot find type const std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >::_Rep_type: #13 0x0000558461eb0a48 in OSD::split_pgs (this=<optimized out>, parent=<optimized out>, childpgids=std::set with 1 elements, out_pgs=<optimized out>, curmap=warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>' warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>' std::shared_ptr (count 3, weak 1) 0x55846cb08000, nextmap=warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>' warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>' std::shared_ptr (count 50, weak 1) 0x55846c100d00, rctx=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:8226 #14 0x0000558461ebcb35 in OSD::advance_pg (this=this@entry=0x55846bf9e000, osd_epoch=<optimized out>, pg=pg@entry=0x55846cdbe000, handle=..., rctx=rctx@entry=0x7f5e93fc48f0) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:7930 #15 0x0000558461ebcee1 in OSD::dequeue_peering_evt (this=0x55846bf9e000, sdata=<optimized out>, pg=0x55846cdbe000, evt=std::shared_ptr (count 2, weak 0) 0x55846d718910, handle=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:9037 #16 0x0000558462104220 in PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OpQueueItem.cc:34 ---Type <return> to continue, or q <return> to quit--- #17 0x0000558461eb4b3c in run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f5e93fc4aa0) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OpQueueItem.h:134 #18 OSD::ShardedOpWQ::_process (this=0x55846bf9f050, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:10009 #19 0x000055846242b716 in ShardedThreadPool::shardedthreadpool_worker (this=0x55846bf9e938, thread_index=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/common/WorkQueue.cc:340 #20 0x000055846242c310 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/common/WorkQueue.h:691 #21 0x00007f5ebc91de25 in start_thread () from /lib64/libpthread.so.0 #22 0x00007f5ebb7f7bad in clone () from /lib64/libc.so.6
#6 Updated by David Zafman over 5 years ago
Fix is included in pull request https://github.com/ceph/ceph/pull/22797
#7 Updated by David Zafman about 5 years ago
- Status changed from In Progress to Resolved