Project

General

Profile

Bug #19956

radosgw Segmentation fault when service is restarted during lifecycle processing

Added by Ben Hines almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
-
Target version:
-
Start date:
05/17/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
kraken
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7),

1. Trigger a lifecycle gc run
2. Restart the service during this run. (this also happens when the process is HUP'd for the nightly log rotation)

---> ceph will sometimes segfault on restart

/var/log/secure:
May 16 20:55:08 <> ; USER=root ; COMMAND=/bin/systemctl restart ceph.target

/var/log/ceph/client.<host>.log

2017-05-16 20:55:16.798034 7fb5e9b9b700 20 reading from .rgw:.bucket.meta.<>:default.42048218.4
2017-05-16 20:55:16.798049 7fb5e9b9b700 20 get_system_obj_state: rctx=0x7fb5e9b97c90 obj=.rgw:.bucket.meta.<>:default.42048218.4 state=0x7fb6a062b298 s->prefetch_data=0
2017-05-16 20:55:16.799199 7fb5e9b9b700 20 get_system_obj_state: s->obj_tag was set empty
2017-05-16 20:55:16.799248 7fb5e9b9b700 20 rados->read ofs=0 len=524288
2017-05-16 20:55:16.801033 7fb5e9b9b700 20 rados->read r=0 bl.length=257
2017-05-16 20:55:16.801101 7fb5e9b9b700 20  bucket index object: .dir.default.42048218.4.15
2017-05-16 20:55:16.906092 7fb6003c8700  0 gc::process: removing .rgw.buckets:default.42048218.37_%SHADOW_DIR%/generatedshaders/9ffbb662f47bc73c.hlsl/7c5b3169b15b90bf85d0a78b640b35d6
2017-05-16 20:55:16.918084 7fb6003c8700  2 garbage collection: stop
2017-05-16 20:55:16.950814 7fb5e9b9b700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fb5e9b9b700 thread_name:lifecycle_thr

 ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
 1: (()+0x50bdca) [0x7fb694bfadca]
 2: (()+0xf370) [0x7fb689e58370]
 3: (RGWGC::add_chain(librados::ObjectWriteOperation&, cls_rgw_obj_chain&, std::string const&)+0x62) [0x7fb694b42d62]
 4: (RGWGC::send_chain(cls_rgw_obj_chain&, std::string const&, bool)+0x5f) [0x7fb694b42e3f]
 5: (RGWRados::Object::complete_atomic_modification()+0xb9) [0x7fb694a01a29]
 6: (RGWRados::Object::Delete::delete_obj()+0xd70) [0x7fb694a512a0]
 7: (RGWRados::delete_obj(RGWObjectCtx&, RGWBucketInfo&, rgw_obj const&, int, unsigned short, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const&)+0x1c8) [0x7fb694a51ce8]
 8: (rgw_remove_object(RGWRados*, RGWBucketInfo&, rgw_bucket&, rgw_obj_key&)+0x25d) [0x7fb694948ced]
 9: (RGWLC::bucket_lc_process(std::string&)+0x1281) [0x7fb694b4d971]
 10: (RGWLC::process(int, int)+0x4f8) [0x7fb694b4e928]
 11: (RGWLC::process()+0x71) [0x7fb694b4edc1]
 12: (RGWLC::LCWorker::entry()+0x243) [0x7fb694b4f033]
 13: (()+0x7dc5) [0x7fb689e50dc5]
 14: (clone()+0x6d) [0x7fb6886b173d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues

Duplicated by rgw - Bug #20755: radosgw crash when service is restarted during lifecycle processing Closed 07/24/2017
Duplicated by rgw - Bug #20756: radosgw daemon crash when service is restarted during lifecycle processing Closed 07/24/2017
Copied to rgw - Backport #20836: kraken: radosgw Segmentation fault when service is restarted during lifecycle processing Rejected

History

#1 Updated by Ben Hines almost 2 years ago

Yikes, same fault also happened when I pushed a new period map, which caused the 'realm reloader' to trigger. That makes this seem more serious.


2017-05-16 21:41:53.691726 7f653addf700  1 rgw realm reloader: Frontends paused
2017-05-16 21:41:53.691938 7f66397dc700  0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
2017-05-16 21:41:54.615538 7f66377d8700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f66377d8700 thread_name:lifecycle_thr

 ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
 1: (()+0x50bdca) [0x7f66e2845dca]
 2: (()+0xf370) [0x7f66d7a95370]
 3: (RGWGC::add_chain(librados::ObjectWriteOperation&, cls_rgw_obj_chain&, std::string const&)+0x62) [0x7f66e278dd62]
 4: (RGWGC::send_chain(cls_rgw_obj_chain&, std::string const&, bool)+0x5f) [0x7f66e278de3f]
 5: (RGWRados::Object::complete_atomic_modification()+0xb9) [0x7f66e264ca29]
 6: (RGWRados::Object::Delete::delete_obj()+0xd70) [0x7f66e269c2a0]

#2 Updated by Ben Hines almost 2 years ago

I got a coredump + symbols for this. let me know if you want the dump, here is the stack with full symbols:

#0 0x00007f6a6cb1723b in raise () from /lib64/libpthread.so.0
#1 0x00007f6a778b9e95 in reraise_fatal (signum=11) at /usr/src/debug/ceph-11.2.0/src/global/signal_handler.cc:72
#2 handle_fatal_signal (signum=11) at /usr/src/debug/ceph-11.2.0/src/global/signal_handler.cc:134
#3 <signal handler called>
#4 RGWGC::add_chain (this=this@entry=0x0, op=..., chain=..., tag="default.68996150.61684839") at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_gc.cc:58
#5 0x00007f6a77801e3f in RGWGC::send_chain (this=0x0, chain=..., tag="default.68996150.61684839", sync=sync@entry=false)
at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_gc.cc:64
#6 0x00007f6a776c0a29 in RGWRados::Object::complete_atomic_modification (this=0x7f69cc8578d0) at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_rados.cc:7870
#7 0x00007f6a777102a0 in RGWRados::Object::Delete::delete_obj (this=this@entry=0x7f69cc857840) at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_rados.cc:8295
#8 0x00007f6a77710ce8 in RGWRados::delete_obj (this=<optimized out>, obj_ctx=..., bucket_info=..., obj=..., versioning_status=0, bilog_flags=<optimized out>,
expiration_time=...) at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_rados.cc:8330
#9 0x00007f6a77607ced in rgw_remove_object (store=0x7f6a810fe000, bucket_info=..., bucket=..., key=...) at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_bucket.cc:519
#10 0x00007f6a7780c971 in RGWLC::bucket_lc_process (this=this@entry=0x7f6a81959c00, shard_id=":globalcache307:default.42048218.11")
at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_lc.cc:283
#11 0x00007f6a7780d928 in RGWLC::process (this=this@entry=0x7f6a81959c00, index=<optimized out>, max_lock_secs=max_lock_secs@entry=60)
at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_lc.cc:482
#12 0x00007f6a7780ddc1 in RGWLC::process (this=0x7f6a81959c00) at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_lc.cc:412
#13 0x00007f6a7780e033 in RGWLC::LCWorker::entry (this=0x7f6a81a820d0) at /usr/src/debug/ceph-11.2.0/src/rgw/rgw_lc.cc:51
#14 0x00007f6a6cb0fdc5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f6a6b37073d in clone () from /lib64/libc.so.6

#3 Updated by Ben Hines almost 2 years ago

rgw.cc:58:
  cls_rgw_gc_set_entry(op, cct->_conf->rgw_gc_obj_min_wait, info);

Might not like my 'gc obj min wait' setting possibly? Here's my config:


keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw thread pool size = 800
debug civetweb = 20
rgw frontends = civetweb enable_keep_alive=yes port=80 num_threads=500 error_log_file=/var/log/ceph/civetweb.error.log access_log_file=/var/log/ceph/civetweb.access.log
rgw num rados handles = 32
rgw cache lru size = 30000
rgw cache enabled = true
rgw override bucket index max shards = 23
rgw enable apis = s3
debug rgw = 20
rgw lifecycle work time = 00:01-23:59
rgw enable ops log = False
rgw gc max objs = 2647
rgw lc max objs = 2647
rgw gc obj min wait = 300
rgw gc processor period = 600
rgw gc processor max time = 600

#4 Updated by wei qiaomiao over 1 year ago

I think this PR: http://tracker.ceph.com/issues/19956 will help you.

#5 Updated by Ben Hines over 1 year ago

wei qiaomiao wrote:

I think this PR: http://tracker.ceph.com/issues/19956 will help you.

The link just goes back here. Which PR? If it's in Luminous i can wait then. Thanks! :)

#6 Updated by Orit Wasserman over 1 year ago

  • Priority changed from Normal to High

#7 Updated by Ben Hines over 1 year ago

Looks like the fix is at https://github.com/ceph/ceph/pull/16495 and this was duplicated by http://tracker.ceph.com/issues/20756

#9 Updated by Casey Bodley over 1 year ago

  • Status changed from New to Pending Backport
  • Backport set to jewel kraken

#10 Updated by Casey Bodley over 1 year ago

  • Backport changed from jewel kraken to kraken

#11 Updated by Casey Bodley over 1 year ago

  • Related to Bug #20755: radosgw crash when service is restarted during lifecycle processing added

#12 Updated by Casey Bodley over 1 year ago

  • Related to deleted (Bug #20755: radosgw crash when service is restarted during lifecycle processing)

#13 Updated by Casey Bodley over 1 year ago

  • Duplicated by Bug #20755: radosgw crash when service is restarted during lifecycle processing added

#14 Updated by Casey Bodley over 1 year ago

  • Duplicated by Bug #20756: radosgw daemon crash when service is restarted during lifecycle processing added

#15 Updated by Nathan Cutler over 1 year ago

  • Copied to Backport #20836: kraken: radosgw Segmentation fault when service is restarted during lifecycle processing added

#16 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF