Project

General

Profile

Bug #24801

PG num_bytes becomes huge

Added by David Zafman over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

dzafman-2018-07-05_12:45:56-rados-wip-19753-distro-basic-smithi/2739140

description: rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml 2-recovery-overrides/{more-active-recovery.yaml} backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/off.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore-comp.yaml rados.yaml rocksdb.yaml supported-random-distro$/{rhel_latest.yaml} thrashers/morepggrow.yaml thrashosds-health.yaml workloads/snaps-few-objects.yaml}

While testing new backfill out of space handling I noticed a job that was taking a long time to finish. Upon analysis on one of the test machines I notice that backfill_toofull was set. This is as it should be given that the num_bytes of the pg had the value 9223372036854561577. This made it appear that the contents would have no chance to backfill to a target OSD given the new code.

PG_STAT OBJECTS             MISSING_ON_PRIMARY DEGRADED MISPLACED           UNFOUND BYTES               LOG DISK_LOG STATE                                                STATE_STAMP                VERSION  REPORTED   UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
2.5b                      1                  0        1                   2       0 9223372036854561577 380      380 active+undersized+degraded+remapped+backfill_toofull 2018-07-06 17:07:15.464466  515'822 4772:10438 [7,1,4]          7   [6,3]              6    234'585 2018-07-06 09:42:34.599920             0'0 2018-07-06 09:38:35.758926            17

The new code adjusted the kb_used after backfill completed to verify fullness before take the reservation:

2018-07-06 09:47:27.279 7fb7a6666700 20 osd.0 pg_epoch: 548 pg[2.5b( v 515'822 (179'442,515'822] lb MIN (bitwise) local-lis/les=518/519 n=0 ec=545/17 lis/c 547/505 les/c/f 548/506/547 517/547/547) [5,6,0]/[2,6,3] r=-1 lpr=547 pi=[505,547)/2 luod=0'0 crt=515'822 lcod 0'0 active+remapped mbc={}] state<Started/ReplicaActive/RepNotRecovering>: react Before kb_used 1106096
2018-07-06 09:47:27.279 7fb7a6666700 20 osd.0 pg_epoch: 548 pg[2.5b( v 515'822 (179'442,515'822] lb MIN (bitwise) local-lis/les=518/519 n=0 ec=545/17 lis/c 547/505 les/c/f 548/506/547 517/547/547) [5,6,0]/[2,6,3] r=-1 lpr=547 pi=[505,547)/2 luod=0'0 crt=515'822 lcod 0'0 active+remapped mbc={}] state<Started/ReplicaActive/RepNotRecovering>: react After kb_used 9007199255846878

Related issues

Blocks RADOS - Bug #19753: Deny reservation if expected backfill size would put us over backfill_full_ratio Resolved 04/24/2017

History

#1 Updated by David Zafman over 5 years ago

The OSD logs and this bug point to a slight flaw in https://github.com/ceph/ceph/pull/22797. I add the adjustment amount so that the normal space handling can determine if we are over backfill full. When it fails there is a brief point that fullness is triggered, but really we will not take the backfill reservation and then unset the adjustment amount. We need to do a tentative fullness check and only commit to the adjustment amount if we will be proceeding with the backfill.

2018-07-06T18:00:15.613 INFO:tasks.ceph.osd.1.smithi011.stderr:2018-07-06 18:00:15.606 7f33d2055700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now -2147483648% full

#2 Updated by Josh Durgin over 5 years ago

  • Priority changed from Normal to High

#3 Updated by David Zafman over 5 years ago

  • Blocks Bug #19753: Deny reservation if expected backfill size would put us over backfill_full_ratio added

#4 Updated by David Zafman over 5 years ago

  • Status changed from New to In Progress

#5 Updated by David Zafman over 5 years ago

So far with assert added to object_stat_sum_t::add() we saw this. Still not sure why the num_bytes is off.

#7  0x0000558461d846de in ceph::__ceph_assert_fail (ctx=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/common/assert.cc:79
#8  0x00005584620abb99 in object_stat_sum_t::add (this=this@entry=0x55846e7f8748, o=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/osd_types.cc:2151
#9  0x0000558461f63271 in add (o=..., this=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/osd_types.h:1928
#10 PG::prepare_write_info (this=this@entry=0x55846e7f8000, km=km@entry=0x7f5e93fc44e0) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/PG.cc:3478
#11 0x0000558461f634eb in PG::write_if_dirty (this=this@entry=0x55846e7f8000, t=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/PG.cc:3569
#12 0x0000558461f6352d in PG::finish_split_stats (this=this@entry=0x55846e7f8000, stats=..., t=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/PG.cc:2607
Python Exception <type 'exceptions.ValueError'> Cannot find type const std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >::_Rep_type:
#13 0x0000558461eb0a48 in OSD::split_pgs (this=<optimized out>, parent=<optimized out>, childpgids=std::set with 1 elements, out_pgs=<optimized out>, curmap=warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'

std::shared_ptr (count 3, weak 1) 0x55846cb08000, nextmap=warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const, (ceph::LockPolicy)1>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr (count 50, weak 1) 0x55846c100d00, rctx=<optimized out>)
    at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:8226
#14 0x0000558461ebcb35 in OSD::advance_pg (this=this@entry=0x55846bf9e000, osd_epoch=<optimized out>, pg=pg@entry=0x55846cdbe000, handle=..., rctx=rctx@entry=0x7f5e93fc48f0)
    at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:7930
#15 0x0000558461ebcee1 in OSD::dequeue_peering_evt (this=0x55846bf9e000, sdata=<optimized out>, pg=0x55846cdbe000, evt=std::shared_ptr (count 2, weak 0) 0x55846d718910,
    handle=...) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:9037
#16 0x0000558462104220 in PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...)
    at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OpQueueItem.cc:34
---Type <return> to continue, or q <return> to quit---
#17 0x0000558461eb4b3c in run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f5e93fc4aa0)
    at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OpQueueItem.h:134
#18 OSD::ShardedOpWQ::_process (this=0x55846bf9f050, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/osd/OSD.cc:10009
#19 0x000055846242b716 in ShardedThreadPool::shardedthreadpool_worker (this=0x55846bf9e938, thread_index=<optimized out>)
    at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/common/WorkQueue.cc:340
#20 0x000055846242c310 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-14.0.0-2411-gb84f4a5/src/common/WorkQueue.h:691
#21 0x00007f5ebc91de25 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f5ebb7f7bad in clone () from /lib64/libc.so.6

#6 Updated by David Zafman over 5 years ago

Fix is included in pull request https://github.com/ceph/ceph/pull/22797

#7 Updated by David Zafman about 5 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF