Bug #8846
closedradosgw on 0.80.4 crashes when doing a multi-part upload
0%
Description
This is the tracelog (from a self compiled version since I started debugging this myself. However same exact issue when using the official package for precise)
-17> 2014-07-16 14:08:05.789600 7f3cc67fc700 1 ====== starting new request req=0x7f3cbc0094a0 ===== -16> 2014-07-16 14:08:05.789621 7f3cc67fc700 2 req 0:0.000021::PUT /test/testmulti::initializing -15> 2014-07-16 14:08:05.789647 7f3cc67fc700 2 req 0:0.000048:s3:PUT /test/testmulti::getting op -14> 2014-07-16 14:08:05.789652 7f3cc67fc700 2 req 0:0.000053:s3:PUT /test/testmulti:put_obj:authorizing -13> 2014-07-16 14:08:05.789790 7f3cc67fc700 2 req 0:0.000190:s3:PUT /test/testmulti:put_obj:reading permissions -12> 2014-07-16 14:08:05.789909 7f3cc67fc700 2 req 0:0.000300:s3:PUT /test/testmulti:put_obj:init op -11> 2014-07-16 14:08:05.789925 7f3cc67fc700 2 req 0:0.000325:s3:PUT /test/testmulti:put_obj:verifying op mask -10> 2014-07-16 14:08:05.789928 7f3cc67fc700 2 req 0:0.000329:s3:PUT /test/testmulti:put_obj:verifying op permissions -9> 2014-07-16 14:08:05.789931 7f3cc67fc700 5 Searching permissions for uid=test mask=50 -8> 2014-07-16 14:08:05.789934 7f3cc67fc700 5 Found permission: 15 -7> 2014-07-16 14:08:05.789935 7f3cc67fc700 5 Searching permissions for group=1 mask=50 -6> 2014-07-16 14:08:05.789937 7f3cc67fc700 5 Permissions for group not found -5> 2014-07-16 14:08:05.789938 7f3cc67fc700 5 Searching permissions for group=2 mask=50 -4> 2014-07-16 14:08:05.789940 7f3cc67fc700 5 Permissions for group not found -3> 2014-07-16 14:08:05.789941 7f3cc67fc700 5 Getting permissions id=test owner=test perm=2 -2> 2014-07-16 14:08:05.789954 7f3cc67fc700 2 req 0:0.000354:s3:PUT /test/testmulti:put_obj:verifying op params -1> 2014-07-16 14:08:05.789958 7f3cc67fc700 2 req 0:0.000358:s3:PUT /test/testmulti:put_obj:executing 0> 2014-07-16 14:08:05.814949 7f3cc67fc700 -1 *** Caught signal (Aborted) ** in thread 7f3cc67fc700 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f) 1: /home/whatever/ceph/src/.libs/lt-radosgw() [0x5c3eba] 2: (()+0xfcb0) [0x7f3d0d147cb0] 3: (gsignal()+0x35) [0x7f3d0c293425] 4: (abort()+0x17b) [0x7f3d0c296b8b] 5: (()+0x7439e) [0x7f3d0c2d139e] 6: (()+0x7eb96) [0x7f3d0c2dbb96] 7: (RGWPutObjProcessor_Aio::pop_pending()+0x26) [0x50c5f6] 8: (RGWPutObjProcessor_Aio::wait_pending_front()+0x9) [0x50c619] 9: (RGWPutObjProcessor_Aio::throttle_data(void*, bool)+0x70) [0x50c6d0] 10: (RGWPutObj::execute()+0x5d1) [0x554521] 11: /home/whatever/ceph/src/.libs/lt-radosgw() [0x4c489c] 12: /home/whatever/ceph/src/.libs/lt-radosgw() [0x4c4fe7] 13: /home/whatever/ceph/src/.libs/lt-radosgw() [0x4c0684] 14: (()+0x7e9a) [0x7f3d0d13fe9a] 15: (clone()+0x6d) [0x7f3d0c3513fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 0 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 5/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent 10000 max_new 1000 log_file --- end dump of recent events --- Aborted (core dumped)
Updated by Sylvain Munaut almost 10 years ago
(gdb) bt #0 0x00007ffff5352425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007ffff5355b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007ffff539039e in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007ffff539ab96 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x000000000050c5f6 in deallocate (__p=0x7fff600217d8, this=<optimized out>) at /usr/include/c++/4.6/ext/new_allocator.h:98 #5 _M_put_node (__p=0x7fff600217d8, this=<optimized out>) at /usr/include/c++/4.6/bits/stl_list.h:331 #6 _M_erase (__position=..., this=<optimized out>) at /usr/include/c++/4.6/bits/stl_list.h:1539 #7 pop_front (this=<optimized out>) at /usr/include/c++/4.6/bits/stl_list.h:974 #8 RGWPutObjProcessor_Aio::pop_pending (this=<optimized out>) at rgw/rgw_rados.cc:1095 #9 0x000000000050c619 in RGWPutObjProcessor_Aio::wait_pending_front (this=0x7fff600217a0) at rgw/rgw_rados.cc:1101 #10 0x000000000050c6d0 in RGWPutObjProcessor_Aio::throttle_data (this=0x7fff600217a0, handle=<optimized out>, need_to_wait=true) at rgw/rgw_rados.cc:1136 #11 0x0000000000554521 in RGWPutObj::execute (this=0x7fff60021470) at rgw/rgw_op.cc:1650 #12 0x00000000004c489c in process_request (store=0x876f00, rest=<optimized out>, req=0x7fff6000dae0, client_io=0x7ffef87bfb60, olog=0x0) at rgw/rgw_main.cc:600 #13 0x00000000004c4fe7 in civetweb_callback (conn=0x7fff60009300) at rgw/rgw_main.cc:687 #14 0x00000000004c0684 in handle_request (conn=0x7fff60009300) at civetweb/src/civetweb.c:5560 #15 process_new_connection (conn=0x7fff60009300) at civetweb/src/civetweb.c:6310 ---Type <return> to continue, or q <return> to quit--- #16 worker_thread_run (thread_func_param=0x88ef20) at civetweb/src/civetweb.c:6418 #17 worker_thread (thread_func_param=0x88ef20) at civetweb/src/civetweb.c:6453 #18 0x00007ffff61fee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #19 0x00007ffff54103fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #20 0x0000000000000000 in ?? ()
Updated by Ian Colle almost 10 years ago
- Assignee set to Yehuda Sadeh
- Priority changed from Normal to Urgent
Updated by Sylvain Munaut almost 10 years ago
For more info, I'm doing a multipart upload and it crashes at the last part of the file. Two first part are 5M and the last one is 1M.
In the trace above, I added a debug print in RGWPutObjProcessor_Aio::pop_pending and pending.empty() is true which I think is the issue.
Updated by Yehuda Sadeh almost 10 years ago
Did that happen before (e.g., 0.80.3)? can you add:
debug ms = 1
debug rgw = 20
Updated by Sylvain Munaut almost 10 years ago
Yes, it happens since ea68b9372319fd0bab40856db26528d36359102e as I reported on the ML. (and now realize I forgot to say that in the ticket).
Reverting that commit over 0.80.4 fixes the issue.
This also prevents the crash and makes my test pass :
diff --git a/src/rgw/rgw_rados.cc b/src/rgw/rgw_rados.cc index be8248e..4b4c590 100644 --- a/src/rgw/rgw_rados.cc +++ b/src/rgw/rgw_rados.cc @@ -1132,7 +1132,7 @@ int RGWPutObjProcessor_Aio::throttle_data(void *handle, bool need_to_wait) } size_t orig_size = pending.size(); while (pending_has_completed() - || need_to_wait) { + || (need_to_wait && !pending.empty())) { int r = wait_pending_front(); if (r < 0) return r;
However since I don't really understand what the code really does and all its implications, I have no idea if it's a correct fix or not.
Updated by Yehuda Sadeh almost 10 years ago
Are you by any chance using a non-default chunk size?
Updated by Sylvain Munaut almost 10 years ago
Oh yes, I've raised it to 5M to avoid having each part of a multi-part generate 2 objects in rados.
Updated by Yehuda Sadeh almost 10 years ago
- Status changed from New to Fix Under Review
- Assignee changed from Yehuda Sadeh to Josh Durgin
Updated by Benjamin Gilbert almost 10 years ago
Even with the default chunk size, this can be triggered by performing a multipart upload consisting of a single small part.
Updated by Sage Weil almost 10 years ago
- Status changed from Fix Under Review to Resolved
Updated by Sage Weil almost 10 years ago
- Status changed from Resolved to Pending Backport
oops, we still need to do dumpling
Updated by Yehuda Sadeh over 9 years ago
- Status changed from Pending Backport to Resolved