Project

General

Profile

Actions

Bug #5925

closed

hung ceph_test_rados_delete_pools_parallel

Added by Sage Weil over 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-08-09T02:08:24.034 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_3_[7718]: created object 0...
2013-08-09T02:08:24.034 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_3_[7718]: created object 25...
2013-08-09T02:08:24.034 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_3_[7718]: created object 49...
2013-08-09T02:08:24.034 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_3_[7718]: finishing.
2013-08-09T02:08:24.035 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7658)
2013-08-09T02:08:24.035 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7659)
2013-08-09T02:08:24.035 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: *******************************
2013-08-09T02:08:24.036 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7718)
2013-08-09T02:08:24.036 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7719)
2013-08-09T02:08:24.036 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: *******************************
2013-08-09T02:08:24.036 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: ******* SUCCESS **********
2013-08-09T02:08:24.036 INFO:teuthology.task.workunit.client.0.err:[10.214.132.36]: + ceph_test_rados_delete_pools_parallel
2013-08-09T02:08:25.774 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_1_[7762]: creating pool foo
2013-08-09T02:08:25.775 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_1_[7762]: created object 0...
2013-08-09T02:08:25.775 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_1_[7762]: created object 25...
2013-08-09T02:08:25.775 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_1_[7762]: created object 49...
2013-08-09T02:08:25.775 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_1_[7762]: finishing.
2013-08-09T02:08:27.270 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_2_[7763]: deleting pool foo
2013-08-09T02:08:27.288 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7762)
2013-08-09T02:08:27.288 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7763)
2013-08-09T02:08:27.288 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: *******************************
2013-08-09T02:08:27.288 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_3_[7799]: creating pool foo
2013-08-09T02:08:27.288 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_3_[7799]: rados_write error -2
2013-08-09T02:08:27.289 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7762)
2013-08-09T02:08:27.289 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7763)
2013-08-09T02:08:27.289 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: *******************************
2013-08-09T02:08:27.290 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: waitpid(7799)
2013-08-09T02:08:27.290 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: test2: got error: run_until_finished: runnable process_3: got error: process_3 returned exit_status 254
2013-08-09T02:08:40.198 INFO:teuthology.task.mon_thrash.mon_thrasher:reviving mon.i
2013-08-09T02:08:40.199 INFO:teuthology.task.ceph.mon.i:Restarting
2013-08-09T02:08:40.199 DEBUG:teuthology.orchestra.run:Running [10.214.132.36]: '/home/ubuntu/cephtest/100911/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/100911/archive/coverage sudo /home/ubuntu/cephtest/100911/daemon-helper kill ceph-mon -f -i i'
2013-08-09T02:08:40.203 INFO:teuthology.task.ceph.mon.i:Started

and then we thrash forever. not sure why this is stuck.

also i noticed a couple of processes are running on the host still.

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-08-09_01:00:22-rados-next-testing-basic-plana/100911


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #9582: librados: segmentation fault on timeoutResolvedSage Weil09/24/2014

Actions
Copied to Ceph - Bug #9845: hung ceph_test_rados_delete_pools_parallelResolvedDavid Zafman08/09/2013

Actions
Actions #1

Updated by David Zafman over 10 years ago

This looks like a test case race.

2 processes hung while mon thrasher keeps running:
ubuntu 7800 1 0 02:09 ? 00:00:02 ceph_test_rados_delete_pools_parallel
ubuntu 7801 1 0 02:09 ? 00:00:02 ceph_test_rados_delete_pools_parallel

Thread 13 (LWP 7800):
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1 0x00000000004050b2 in CrossProcessSem::wait (this=0x243a960) at test/system/cross_process_sem.cc:63
#2 0x0000000000404a22 in StRadosDeletePool::run (this=0x7fff684d8e90) at test/system/st_rados_delete_pool.cc:46
#3 0x0000000000405397 in systest_runnable_pthread_helper (arg=0x7fff684d8e90) at test/system/systest_runnable.cc:202
#4 0x000000000040542b in SysTestRunnable::start (this=0x7fff684d8e90) at test/system/systest_runnable.cc:102
#5 0x000000000040706b in SysTestRunnable::run_until_finished (runnables=...) at test/system/systest_runnable.cc:174
#6 0x0000000000403987 in main (argc=1, argv=0x7fff684d9078) at test/system/rados_delete_pools_parallel.cc:101

int StRadosDeletePool::run() {
rados_t cl;
RETURN1_IF_NONZERO(rados_create(&cl, NULL));
rados_conf_parse_argv(cl, m_argc, m_argv);
RETURN1_IF_NONZERO(rados_conf_read_file(cl, NULL));
rados_conf_parse_env(cl, NULL);
RETURN1_IF_NONZERO(rados_connect(cl));
m_pool_setup_sem->wait(); <<<<thread from 7800 waiting here

Thread 13 (LWP 7801):
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1 0x00000000004050b2 in CrossProcessSem::wait (this=0x243a940) at test/system/cross_process_sem.cc:63
#2 0x0000000000404d06 in StRadosListObjects::run (this=0x7fff684d8db0) at test/system/st_rados_list_objects.cc:61
#3 0x0000000000405397 in systest_runnable_pthread_helper (arg=0x7fff684d8db0) at test/system/systest_runnable.cc:202
#4 0x000000000040542b in SysTestRunnable::start (this=0x7fff684d8db0) at test/system/systest_runnable.cc:102
#5 0x000000000040706b in SysTestRunnable::run_until_finished (runnables=...) at test/system/systest_runnable.cc:174
#6 0x0000000000403987 in main (argc=1, argv=0x7fff684d9078) at test/system/rados_delete_pools_parallel.cc:101

int StRadosListObjects::
run() {
rados_t cl;
RETURN1_IF_NONZERO(rados_create(&cl, NULL));
rados_conf_parse_argv(cl, m_argc, m_argv);
RETURN1_IF_NONZERO(rados_conf_read_file(cl, NULL));
rados_conf_parse_env(cl, NULL);
RETURN1_IF_NONZERO(rados_connect(cl));
m_pool_setup_sem->wait(); <<<<thread from 7801 waiting here

Actions #2

Updated by David Zafman over 10 years ago

The pool create process died after getting a write error:


2013-08-09T02:08:27.288 INFO:teuthology.task.workunit.client.0.out:[10.214.132.36]: process_3_[7799]: rados_write error -2

Around this time PGs were still being created presumably for the newly created pool:

2013-08-09 02:08:27.844542 mon.0 10.214.132.35:6789/0 1247 : [INF] pgmap v575: 104 pgs: 16 creating, 88 active+clean; 9646 bytes data, 892 MB used, 2792 GB / 2793 GB avail
2013-08-09 02:08:29.799293 mon.0 10.214.132.35:6789/0 1248 : [INF] pgmap v576: 104 pgs: 16 creating, 88 active+clean; 9646 bytes data, 893 MB used, 2792 GB / 2793 GB avail

I tried adding a 30 second delay to my own OSD to increase the time it takes for a new pool's PGs to be created, but a write following a new pool creation hangs until PG creation finishes.

Actions #3

Updated by Ian Colle over 10 years ago

  • Priority changed from Urgent to Immediate
Actions #4

Updated by Ian Colle over 10 years ago

  • Priority changed from Immediate to High
Actions #5

Updated by Sage Weil over 10 years ago

  • Priority changed from High to Normal
Actions #6

Updated by Sage Weil over 10 years ago

  • Status changed from New to Can't reproduce
Actions #7

Updated by Joao Eduardo Luis over 9 years ago

I've been reproducing this reliably with wip-9321.giant. Hung job in plana12.

Thread 13 (Thread 0x7f188a2d5700 (LWP 25047)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f188cb0fdab in ceph::log::Log::entry (this=0x2452540) at log/Log.cc:345
#2  0x00007f188c477e9a in start_thread (arg=0x7f188a2d5700) at pthread_create.c:308
#3  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 12 (Thread 0x7f1889387700 (LWP 25052)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x00007f188c8f84a6 in WaitUntil (when=..., mutex=..., this=0x246ec58) at ./common/Cond.h:71
#2  WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x246ec58) at ./common/Cond.h:79
#3  CephContextServiceThread::entry (this=0x246ebe0) at common/ceph_context.cc:58
#4  0x00007f188c477e9a in start_thread (arg=0x7f1889387700) at pthread_create.c:308
#5  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f1888b86700 (LWP 25055)):
#0  0x00007f188bc82a43 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f188c8bcffb in AdminSocket::entry (this=0x2452b30) at common/admin_socket.cc:239
#2  0x00007f188c477e9a in start_thread (arg=0x7f1888b86700) at pthread_create.c:308
#3  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f1883fff700 (LWP 25061)):
#0  pthread_rwlock_rdlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_rdlock.S:85
#1  0x00007f188c852732 in get_read (this=0x2472780) at ./common/RWLock.h:68
#2  RLocker (lock=..., this=<synthetic pointer>) at ./common/RWLock.h:122
#3  Objecter::tick (this=0x24726c0) at osdc/Objecter.cc:1543
#4  0x00007f188c816489 in Context::complete (this=0x7f1858001290, r=<optimized out>) at ./include/Context.h:64
#5  0x00007f188c8d55ed in SafeTimer::timer_thread (this=0x2472838) at common/Timer.cc:105
#6  0x00007f188c8d81ed in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#7  0x00007f188c477e9a in start_thread (arg=0x7f1883fff700) at pthread_create.c:308
#8  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9  0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f18837fe700 (LWP 25063)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f188c9e8358 in Wait (mutex=..., this=0x2472220) at ./common/Cond.h:55
#2  DispatchQueue::entry (this=0x24721b8) at msg/simple/DispatchQueue.cc:196
#3  0x00007f188ca5fa7d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:104
#4  0x00007f188c477e9a in start_thread (arg=0x7f18837fe700) at pthread_create.c:308
#5  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f1882ffd700 (LWP 25065)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f188c9e9d5b in Wait (mutex=..., this=0x24723c0) at ./common/Cond.h:55
#2  DispatchQueue::run_local_delivery (this=0x24721b8) at msg/simple/DispatchQueue.cc:114
#3  0x00007f188ca5fa5d in DispatchQueue::LocalDeliveryThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:118
#4  0x00007f188c477e9a in start_thread (arg=0x7f1882ffd700) at pthread_create.c:308
#5  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f18827fc700 (LWP 25067)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f188ca59f49 in Wait (mutex=..., this=0x2472628) at ./common/Cond.h:55
#2  SimpleMessenger::reaper_entry (this=0x2472000) at msg/simple/SimpleMessenger.cc:211
#3  0x00007f188ca604dd in SimpleMessenger::ReaperThread::entry (this=<optimized out>) at msg/simple/SimpleMessenger.h:199
#4  0x00007f188c477e9a in start_thread (arg=0x7f18827fc700) at pthread_create.c:308
#5  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f1881ffb700 (LWP 25073)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1  0x00007f188c8d57fb in WaitUntil (when=..., mutex=..., this=0x2455230) at common/Cond.h:71
#2  SafeTimer::timer_thread (this=0x2455220) at common/Timer.cc:114
#3  0x00007f188c8d81ed in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#4  0x00007f188c477e9a in start_thread (arg=0x7f1881ffb700) at pthread_create.c:308
#5  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f18817fa700 (LWP 25074)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f188c8d9360 in Wait (mutex=..., this=0x2455348) at ./common/Cond.h:55
#2  Finisher::finisher_thread_entry (this=0x24552e8) at common/Finisher.cc:81
#3  0x00007f188c477e9a in start_thread (arg=0x7f18817fa700) at pthread_create.c:308
#4  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f1888385700 (LWP 30639)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f188ca49852 in Wait (mutex=..., this=0x7f1878025048) at ./common/Cond.h:55
#2  Pipe::writer (this=0x7f1878020e20) at msg/simple/Pipe.cc:1795
#3  0x00007f188ca55aed in Pipe::Writer::entry (this=<optimized out>) at msg/simple/Pipe.h:62
#4  0x00007f188c477e9a in start_thread (arg=0x7f1888385700) at pthread_create.c:308
#5  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f18806f7700 (LWP 30640)):
#0  0x00007f188bc82a43 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f188ca3d9e7 in Pipe::tcp_read_wait (this=<optimized out>) at msg/simple/Pipe.cc:2351
#2  0x00007f188ca3e068 in Pipe::tcp_read (this=0x7f1878020e20, buf=0x7f18806f6e1a "\377\214\030\177", len=1) at msg/simple/Pipe.cc:2321
#3  0x00007f188ca50a50 in Pipe::reader (this=0x7f1878020e20) at msg/simple/Pipe.cc:1480
#4  0x00007f188ca55b0d in Pipe::Reader::entry (this=<optimized out>) at msg/simple/Pipe.h:50
#5  0x00007f188c477e9a in start_thread (arg=0x7f18806f7700) at pthread_create.c:308
#6  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f1880ff9700 (LWP 30641)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f188ca39d37 in Wait (mutex=..., this=0x7f187c002480) at ./common/Cond.h:55
#2  Pipe::DelayedDelivery::entry (this=0x7f187c0023b0) at msg/simple/Pipe.cc:209
#3  0x00007f188c477e9a in start_thread (arg=0x7f1880ff9700) at pthread_create.c:308
#4  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f188eebc780 (LWP 25044)):
#0  0x00007f188c479148 in pthread_join (threadid=139743270532864, thread_return=0x0) at pthread_join.c:89
#1  0x00007f188c9015e2 in Thread::join (this=0x2475230, prval=<optimized out>) at common/Thread.cc:140
#2  0x00007f188c8d728b in SafeTimer::shutdown (this=0x2472838) at common/Timer.cc:77
#3  0x00007f188c8568e0 in Objecter::shutdown (this=0x24726c0) at osdc/Objecter.cc:394
#4  0x00007f188c82006a in librados::RadosClient::shutdown (this=0x2455010) at librados/RadosClient.cc:303
#5  0x00007f188c809919 in rados_shutdown (cluster=0x2455010) at librados/librados.cc:1916
#6  0x000000000040473d in StRadosCreatePool::run (this=0x7fff41ba5960) at test/system/st_rados_create_pool.cc:112
#7  0x000000000040538a in systest_runnable_pthread_helper (arg=<optimized out>) at test/system/systest_runnable.cc:203
#8  0x000000000040544b in SysTestRunnable::start (this=0x7fff41ba5960) at test/system/systest_runnable.cc:102
#9  0x000000000040708b in SysTestRunnable::run_until_finished (runnables=...) at test/system/systest_runnable.cc:174
#10 0x0000000000403957 in main (argc=1, argv=0x7fff41ba5d18) at test/system/rados_delete_pools_parallel.cc:101
Actions #8

Updated by David Zafman over 9 years ago

  • Status changed from Can't reproduce to 12
Actions #9

Updated by David Zafman over 9 years ago

Yup, there is a shutdown race. Thread 1 is waiting for the timer thread while holding the Objecter::rwlock in write mode. The timer thread is in Objecter::tick() trying to get the rwlock in read mode.

Objecter::shutdown() should drop the rwlock so that the timer thread can exit. An assert(tick_event) needs to be done after we know that we aren't shutting down, since it would be NULL in this case.

Actions #10

Updated by David Zafman over 9 years ago

I don't think this would have happened when safe_callbacks was true. It was set to false in a fix for #9582. See also #9650.

Actions #11

Updated by David Zafman over 9 years ago

  • Status changed from 12 to Can't reproduce

Filed #9845 to describe the recent occurence. This bug was probably something else, so I'm setting it back to "Can't reproduce"

Actions #12

Updated by Joao Eduardo Luis over 9 years ago

  • Status changed from Can't reproduce to 12

Hit this again while testing a mon patch. Setting to this 'Verified' again until I check with David or Sam on what to do and on whether my findings hold any value.

Took a closer look and it seems to be a deadlock.

Disregarding all the noise from other threads, these are the two threads that seem relevant:

Thread 9 (Thread 0x7f9676067700 (LWP 7219)):
#0  pthread_rwlock_rdlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_rdlock.S:87
#1  0x00007f967b060ac0 in RWLock::get_read (this=0x1cb0780) at ./common/RWLock.h:68
#2  0x00007f967b088a58 in RLocker (lock=..., this=<synthetic pointer>) at ./common/RWLock.h:122
#3  Objecter::tick (this=0x1cb06c0) at osdc/Objecter.cc:1543
#4  0x00007f967b04bf39 in Context::complete (this=0x1cba5f0, r=<optimized out>) at ./include/Context.h:64
#5  0x00007f967b10a38c in SafeTimer::timer_thread (this=0x1cb0838) at common/Timer.cc:105
#6  0x00007f967b10c48d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#7  0x00007f967ac47182 in start_thread (arg=0x7f9676067700) at pthread_create.c:312
#8  0x00007f967a459fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f967d781840 (LWP 7207)):
#0  0x00007f967ac4866b in pthread_join (threadid=140284201957120, thread_return=0x0) at pthread_join.c:92
#1  0x00007f967b134f12 in Thread::join (this=0x1cb3230, prval=prval@entry=0x0) at common/Thread.cc:140
#2  0x00007f967b108a58 in SafeTimer::shutdown (this=this@entry=0x1cb0838) at common/Timer.cc:77
#3  0x00007f967b08c14c in Objecter::shutdown (this=0x1cb06c0) at osdc/Objecter.cc:394
#4  0x00007f967b057e12 in librados::RadosClient::shutdown (this=this@entry=0x1c93a50) at librados/RadosClient.cc:303
#5  0x00007f967b03233e in rados_shutdown (cluster=0x1c93a50) at librados/librados.cc:1916
#6  0x0000000000404653 in StRadosCreatePool::run (this=0x7fff201a42f0) at test/system/st_rados_create_pool.cc:112
#7  0x000000000040534a in systest_runnable_pthread_helper (arg=arg@entry=0x7fff201a42f0) at test/system/systest_runnable.cc:203
#8  0x000000000040540b in SysTestRunnable::start (this=0x7fff201a42f0) at test/system/systest_runnable.cc:102
#9  0x0000000000406d3a in SysTestRunnable::run_until_finished (runnables=...) at test/system/systest_runnable.cc:174
#10 0x0000000000403897 in main (argc=1, argv=0x7fff201a44f8) at test/system/rados_delete_pools_parallel.cc:101

python tells me that 'threadid=140284201957120' is in fact Thread 0x7f9676067700 (i.e., Thread 9):

>>> hex(140284201957120)
'0x7f9676067700'

So we have Thread 1 waiting on Thread 9 to join. Thread 9 in its stead is waiting to grab a read lock on Objecter::tick().

The lock in question is 'Objecter::rwlock', which gdb tells me has one writer and one queued reader. The reader is Thread 9 and the writer is pid '7207' (or LWP 7207, Thread 1):

(gdb) p this->rwlock
$4 = (RWLock) {
  _vptr.RWLock = 0x7f967b6c9eb0 <vtable for RWLock+16>, 
  L = {
    __data = {
      __lock = 0, 
      __nr_readers = 0, 
      __readers_wakeup = 1, 
      __writer_wakeup = 0, 
      __nr_readers_queued = 1, 
      __nr_writers_queued = 0, 
      __writer = 7207, 
      __shared = 0, 
      __pad1 = 0, 
      __pad2 = 0, 
      __flags = 0
    }, 
    __size = "\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000'\034", '\000' <repeats 29 times>, 
    __align = 0
  }, 
  name = 0x7f967b36940e "Objecter::rwlock", 
  id = -1, 
  nrlock = {
    val = 0
  }, 
  nwlock = {
    val = 1
  }
}

This is due to Thread 1 having called Objecter::shutdown() as part of the shutdown procedure. First thing Objecter::shutdown() does is to grab a read-write lock on 'rwlock'. Later in the code it will call 'timer.shutdown()', which will wait on the Timer thread to finish.

What is still puzzling me though is that Objecter::shutdown() will grab the timer lock before calling SafeTimer::shutdown(), which in turn will cancel all events prior to joining the thread. Canceling events means that we don't run anything in the queue; we simply erase them. Also, upon starting an event, SafeTimer::timer_thread() will grab the timer lock, get the context out of the queue, relinquish the lock and complete the Context. This also means that when the Context from Thread 9 is run the timer lock is still available to be taken, thus it must have been run before Objecter::shutdown() was called.

So the flow would/could have been:

- Objecter::shutdown():
  - grabs 'rwlock' for writing
  - does stuff

- SafeTimer::timer_thread()
  - grabs timer lock
  - figures out Context C to run
  - relinquishes timer lock
  - C->complete(0)
    - Objecter::tick()
      - grabs 'rwlock' for reading
      - waits for 'rwlock' to be available

- Objecter::shutdown():
  - grabs timer lock
  - timer.shutdown()
    - Safetimer::shutdown()
      - timer thread->join()

And we're in a pickle.

Now the funny thing is that Objecter::shutdown() checks if 'tick_event' is set and if so it will cancel the event and set 'tick_event = NULL'. This has me believing that we may have incurred in this situation before we reached this check or that by the time we did it didn't matter already because the tick context was already running.

Actions #13

Updated by Joao Eduardo Luis over 9 years ago

My last statement about the tick even was inaccurate.

gdb tells me that 'tick_event' is still set by the time we issue timer.shutdown() on Objecter::shutdown():

(gdb) p this->tick_event
$8 = (Objecter::C_Tick *) 0x1cba5f0

which happens to be the same Context being run by Thread 9:

Thread 9 (Thread 0x7f9676067700 (LWP 7219)):
#0  pthread_rwlock_rdlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_rdlock.S:87
#1  0x00007f967b060ac0 in RWLock::get_read (this=0x1cb0780) at ./common/RWLock.h:68
#2  0x00007f967b088a58 in RLocker (lock=..., this=<synthetic pointer>) at ./common/RWLock.h:122
#3  Objecter::tick (this=0x1cb06c0) at osdc/Objecter.cc:1543
#4  0x00007f967b04bf39 in Context::complete (this=0x1cba5f0, r=<optimized out>) at ./include/Context.h:64

-- 8< --

See 'Context::complete (this=0x1cba5f0' matches Objecter::tick_event above.

And taking into consideration that in Objecter::shutdown() we see:

if (tick_event) {
  Mutex::Locker l(timer_lock);
  if (timer.cancel_event(tick_event))
    tick_event = NULL;
}

So we will only set 'tick_event' to NULL if we manage to cancel the event. SafeTimer::cancel_event() will return false if the event is not found, which is to be expected if the Context is already running. This means that by the time we reach this if-branch the tick Context has already been run, but it must have been run after we entered Objecter::shutdown().

Actions #14

Updated by Joao Eduardo Luis over 9 years ago

  • Status changed from 12 to Can't reproduce

and then I read David's comments on this ticket and I felt dumb.

Actions #15

Updated by Joao Eduardo Luis over 9 years ago

this was fun though.

I'll stop with the noise now and test this with the patch from #9845.

Actions

Also available in: Atom PDF