Project

General

Profile

Actions

Bug #8846

closed

radosgw on 0.80.4 crashes when doing a multi-part upload

Added by Sylvain Munaut almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is the tracelog (from a self compiled version since I started debugging this myself. However same exact issue when using the official package for precise)

   -17> 2014-07-16 14:08:05.789600 7f3cc67fc700  1 ====== starting new request req=0x7f3cbc0094a0 =====
   -16> 2014-07-16 14:08:05.789621 7f3cc67fc700  2 req 0:0.000021::PUT /test/testmulti::initializing
   -15> 2014-07-16 14:08:05.789647 7f3cc67fc700  2 req 0:0.000048:s3:PUT /test/testmulti::getting op
   -14> 2014-07-16 14:08:05.789652 7f3cc67fc700  2 req 0:0.000053:s3:PUT /test/testmulti:put_obj:authorizing
   -13> 2014-07-16 14:08:05.789790 7f3cc67fc700  2 req 0:0.000190:s3:PUT /test/testmulti:put_obj:reading permissions
   -12> 2014-07-16 14:08:05.789909 7f3cc67fc700  2 req 0:0.000300:s3:PUT /test/testmulti:put_obj:init op
   -11> 2014-07-16 14:08:05.789925 7f3cc67fc700  2 req 0:0.000325:s3:PUT /test/testmulti:put_obj:verifying op mask
   -10> 2014-07-16 14:08:05.789928 7f3cc67fc700  2 req 0:0.000329:s3:PUT /test/testmulti:put_obj:verifying op permissions
    -9> 2014-07-16 14:08:05.789931 7f3cc67fc700  5 Searching permissions for uid=test mask=50
    -8> 2014-07-16 14:08:05.789934 7f3cc67fc700  5 Found permission: 15
    -7> 2014-07-16 14:08:05.789935 7f3cc67fc700  5 Searching permissions for group=1 mask=50
    -6> 2014-07-16 14:08:05.789937 7f3cc67fc700  5 Permissions for group not found
    -5> 2014-07-16 14:08:05.789938 7f3cc67fc700  5 Searching permissions for group=2 mask=50
    -4> 2014-07-16 14:08:05.789940 7f3cc67fc700  5 Permissions for group not found
    -3> 2014-07-16 14:08:05.789941 7f3cc67fc700  5 Getting permissions id=test owner=test perm=2
    -2> 2014-07-16 14:08:05.789954 7f3cc67fc700  2 req 0:0.000354:s3:PUT /test/testmulti:put_obj:verifying op params
    -1> 2014-07-16 14:08:05.789958 7f3cc67fc700  2 req 0:0.000358:s3:PUT /test/testmulti:put_obj:executing
     0> 2014-07-16 14:08:05.814949 7f3cc67fc700 -1 *** Caught signal (Aborted) **
 in thread 7f3cc67fc700

 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: /home/whatever/ceph/src/.libs/lt-radosgw() [0x5c3eba]
 2: (()+0xfcb0) [0x7f3d0d147cb0]
 3: (gsignal()+0x35) [0x7f3d0c293425]
 4: (abort()+0x17b) [0x7f3d0c296b8b]
 5: (()+0x7439e) [0x7f3d0c2d139e]
 6: (()+0x7eb96) [0x7f3d0c2dbb96]
 7: (RGWPutObjProcessor_Aio::pop_pending()+0x26) [0x50c5f6]
 8: (RGWPutObjProcessor_Aio::wait_pending_front()+0x9) [0x50c619]
 9: (RGWPutObjProcessor_Aio::throttle_data(void*, bool)+0x70) [0x50c6d0]
 10: (RGWPutObj::execute()+0x5d1) [0x554521]
 11: /home/whatever/ceph/src/.libs/lt-radosgw() [0x4c489c]
 12: /home/whatever/ceph/src/.libs/lt-radosgw() [0x4c4fe7]
 13: /home/whatever/ceph/src/.libs/lt-radosgw() [0x4c0684]
 14: (()+0x7e9a) [0x7f3d0d13fe9a]
 15: (clone()+0x6d) [0x7f3d0c3513fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 0 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   5/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file 
--- end dump of recent events ---
Aborted (core dumped)
Actions #1

Updated by Sylvain Munaut almost 10 years ago

(gdb) bt
#0  0x00007ffff5352425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff5355b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff539039e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff539ab96 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x000000000050c5f6 in deallocate (__p=0x7fff600217d8, this=<optimized out>)
    at /usr/include/c++/4.6/ext/new_allocator.h:98
#5  _M_put_node (__p=0x7fff600217d8, this=<optimized out>)
    at /usr/include/c++/4.6/bits/stl_list.h:331
#6  _M_erase (__position=..., this=<optimized out>) at /usr/include/c++/4.6/bits/stl_list.h:1539
#7  pop_front (this=<optimized out>) at /usr/include/c++/4.6/bits/stl_list.h:974
#8  RGWPutObjProcessor_Aio::pop_pending (this=<optimized out>) at rgw/rgw_rados.cc:1095
#9  0x000000000050c619 in RGWPutObjProcessor_Aio::wait_pending_front (this=0x7fff600217a0)
    at rgw/rgw_rados.cc:1101
#10 0x000000000050c6d0 in RGWPutObjProcessor_Aio::throttle_data (this=0x7fff600217a0, 
    handle=<optimized out>, need_to_wait=true) at rgw/rgw_rados.cc:1136
#11 0x0000000000554521 in RGWPutObj::execute (this=0x7fff60021470) at rgw/rgw_op.cc:1650
#12 0x00000000004c489c in process_request (store=0x876f00, rest=<optimized out>, req=0x7fff6000dae0, 
    client_io=0x7ffef87bfb60, olog=0x0) at rgw/rgw_main.cc:600
#13 0x00000000004c4fe7 in civetweb_callback (conn=0x7fff60009300) at rgw/rgw_main.cc:687
#14 0x00000000004c0684 in handle_request (conn=0x7fff60009300) at civetweb/src/civetweb.c:5560
#15 process_new_connection (conn=0x7fff60009300) at civetweb/src/civetweb.c:6310
---Type <return> to continue, or q <return> to quit---
#16 worker_thread_run (thread_func_param=0x88ef20) at civetweb/src/civetweb.c:6418
#17 worker_thread (thread_func_param=0x88ef20) at civetweb/src/civetweb.c:6453
#18 0x00007ffff61fee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x00007ffff54103fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#20 0x0000000000000000 in ?? ()
Actions #2

Updated by Ian Colle almost 10 years ago

  • Assignee set to Yehuda Sadeh
  • Priority changed from Normal to Urgent
Actions #3

Updated by Sylvain Munaut almost 10 years ago

For more info, I'm doing a multipart upload and it crashes at the last part of the file. Two first part are 5M and the last one is 1M.

In the trace above, I added a debug print in RGWPutObjProcessor_Aio::pop_pending and pending.empty() is true which I think is the issue.

Actions #4

Updated by Yehuda Sadeh almost 10 years ago

Did that happen before (e.g., 0.80.3)? can you add:

debug ms = 1
debug rgw = 20

Actions #5

Updated by Sylvain Munaut almost 10 years ago

Yes, it happens since ea68b9372319fd0bab40856db26528d36359102e as I reported on the ML. (and now realize I forgot to say that in the ticket).

Reverting that commit over 0.80.4 fixes the issue.

This also prevents the crash and makes my test pass :

diff --git a/src/rgw/rgw_rados.cc b/src/rgw/rgw_rados.cc
index be8248e..4b4c590 100644
--- a/src/rgw/rgw_rados.cc
+++ b/src/rgw/rgw_rados.cc
@@ -1132,7 +1132,7 @@ int RGWPutObjProcessor_Aio::throttle_data(void *handle, bool need_to_wait)
   }
   size_t orig_size = pending.size();
   while (pending_has_completed()
-         || need_to_wait) {
+         || (need_to_wait && !pending.empty())) {
     int r = wait_pending_front();
     if (r < 0)
       return r;

However since I don't really understand what the code really does and all its implications, I have no idea if it's a correct fix or not.

Actions #6

Updated by Yehuda Sadeh almost 10 years ago

Are you by any chance using a non-default chunk size?

Actions #7

Updated by Sylvain Munaut almost 10 years ago

Oh yes, I've raised it to 5M to avoid having each part of a multi-part generate 2 objects in rados.

Actions #8

Updated by Yehuda Sadeh almost 10 years ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Yehuda Sadeh to Josh Durgin
Actions #9

Updated by Benjamin Gilbert almost 10 years ago

Even with the default chunk size, this can be triggered by performing a multipart upload consisting of a single small part.

Actions #10

Updated by Sage Weil almost 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions #11

Updated by Sage Weil almost 10 years ago

  • Status changed from Resolved to Pending Backport

oops, we still need to do dumpling

Actions #12

Updated by Sage Weil over 9 years ago

  • Assignee deleted (Josh Durgin)
Actions #13

Updated by Yehuda Sadeh over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF