Project

General

Profile

Bug #13098

OSD crashed when reached pool's max_bytes quota

Added by huang jun about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
hammer
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

[enviroment]
ceph version 0.94.3-196-g19ff928 (19ff92806fd1e0fb866737f58e379aa8078b8017)

[huangjun@code253 src]$ uname -a
Linux code253 3.18.19 #1 SMP Thu Jul 23 14:03:27 CST 2015 x86_64 x86_64 x86_64 GNU/Linux

[procedure to produce problem]
create a erasure pool and overlay a cache pool with readforward mode
$ ./ceph osd pool create ec-ca 1 1
$ ./ceph osd pool create ec 1 1 erasure default
$ ./ceph osd tier add ec ec-ca
$ ./ceph osd tier cache-mode ec-ca readforward
$ ./ceph osd tier set-overlay ec ec-ca
$ ./ceph osd pool set ec-ca hit_set_type bloom
$ ./ceph osd pool set-quota ec-ca max_bytes 20480000
$ ./ceph osd pool set-quota ec max_bytes 20480000
$ ./ceph osd pool set ec-ca target_max_bytes 20480000
and then use rados bench to write some objects
$ ./rados -p ec-ca bench 10 write
the bench process stopped when the pool "ec-ca" and "ec" full,
and continue to put object to pool "ec-ca", returned ENOSPC error.
At this time, 2 osds down.

[huangjun@code253 src]$ ./ceph -s
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    cluster f7e9783e-8d0b-49cf-a34d-8506df7ecea9
    health HEALTH_WARN
    25 pgs degraded
    1 pgs down
    1 pgs peering
    25 pgs stuck degraded
    1 pgs stuck inactive
    26 pgs stuck unclean
    25 pgs stuck undersized
    25 pgs undersized
    recovery 40/162 objects degraded (24.691%)
    'ec-ca' at/near target max
    pool 'ec-ca' is full
    pool 'ec' is full
    monmap e1: 1 mons at {a=192.168.0.253:6789/0}
    election epoch 2, quorum 0 a
    mdsmap e5: 1/1/1 up {0=a=up:active}
    osdmap e32: 3 osds: 1 up, 1 in
    pgmap v69: 26 pgs, 5 pools, 196 MB data, 54 objects
    236 GB used, 1441 GB / 1768 GB avail
    40/162 objects degraded (24.691%)
    25 active+undersized+degraded
    1 down+peering

Here is the osd.1.log:
2> 2015-09-15 21:42:05.349239 7f8485ffb700 5 - op tracker -- seq: 658, time: 2015-09-15 21:42:05.349199, event: all_read, op: osd_op(osd.0.4:38 benchmark_data_code253_275199_object42 [assert-version v147,copy-get max 8388608] 3.d3b56ef4 ack+read+ignore_cache+ignore_overlay+flush+map_snap_clone+known_if_redirected e26)
1> 2015-09-15 21:42:05.349246 7f8485ffb700 5 - op tracker -- seq: 658, time: 0.000000, event: dispatched, op: osd_op(osd.0.4:38 benchmark_data_code253_275199_object42 [assert-version v147,copy-get max 8388608] 3.d3b56ef4 ack+read+ignore_cache+ignore_overlay+flush+map_snap_clone+known_if_redirected e26)
0> 2015-09-15 21:42:05.351613 7f84c67fc700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*, ceph::buffer::list::iterator&, OSDOp&, ObjectContextRef&, bool)' thread 7f84c67fc700 time 2015-09-15 21:42:05.349021osd/ReplicatedPG.cc: 6057: FAILED assert(cursor.data_complete)

ceph version 0.94.3-196-g19ff928 (19ff92806fd1e0fb866737f58e379aa8078b8017)
1: (ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*, ceph::buffer::list::iterator&, OSDOp&, std::tr1::shared_ptr<ObjectContext>&, bool)+0x14ad) [0x941e2d]
2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x469b) [0x94ceeb]
3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x61) [0x9572d1]
4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xea3) [0x958303]
5: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2ab1) [0x9602c1]
6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x4e3) [0x8f7573]
7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x178) [0x680f08]
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59e) [0x69856e]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x702) [0xac09a2]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xac4200]
11: /lib64/libpthread.so.0() [0x381a0079d1]
12: (clone()+0x6d) [0x3819ce8b6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Duplicated by Ceph - Bug #12449: ceph-osd core dumped when writing data to the backing storage pool which has a quota set on its cache pool Duplicate 07/23/2015
Copied to Ceph - Backport #13335: hammer: OSD crashed when reached pool's max_bytes quota Resolved

Associated revisions

Revision a1eb380c (diff)
Added by Sage Weil about 5 years ago

osd/ReplicatedPG: fix ENOSPC checking

1. We need to return ENOSPC before we apply our side-effects to the obc
cache in finish_ctx().

2. Consider object count changes too, not just bytes.

3. Consider cluster full flag, not just the pool flag.

4. Reply only if FULL_TRY; silently drop ops that were sent despite the
full flag.

Fixes: #13098
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil about 5 years ago

  • Priority changed from Normal to Urgent
  • Source changed from other to Community (user)

#2 Updated by Sage Weil about 5 years ago

  • Status changed from New to 12

Easily reproduced on hammer.

#3 Updated by Sage Weil about 5 years ago

  • Assignee set to Sage Weil

#4 Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Fix Under Review
  • Assignee deleted (Sage Weil)
  • Backport set to hammer,firefly

#5 Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to 7

#6 Updated by Sage Weil about 5 years ago

  • Status changed from 7 to Pending Backport
  • Backport changed from hammer,firefly to hammer

We probably want to backport a simpler version of this patch that does not include the rados flags.

#8 Updated by Loic Dachary almost 5 years ago

  • Duplicated by Bug #12449: ceph-osd core dumped when writing data to the backing storage pool which has a quota set on its cache pool added

#9 Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

#10 Updated by Josh Durgin over 4 years ago

  • Copied to Backport #14824: hammer: rbd and pool quota do not go well together added

#11 Updated by Josh Durgin over 4 years ago

  • Status changed from Resolved to Pending Backport

#12 Updated by Loic Dachary over 4 years ago

  • Status changed from Pending Backport to Resolved

#13 Updated by Loic Dachary over 4 years ago

  • Status changed from Resolved to Pending Backport

#14 Updated by Loic Dachary over 4 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF