Project

General

Profile

Actions

Bug #9845

closed

hung ceph_test_rados_delete_pools_parallel

Added by David Zafman over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Thread 10 (Thread 0x7f1883fff700 (LWP 25061)):
#0  pthread_rwlock_rdlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_rdlock.S:85
#1  0x00007f188c852732 in get_read (this=0x2472780) at ./common/RWLock.h:68
#2  RLocker (lock=..., this=<synthetic pointer>) at ./common/RWLock.h:122
#3  Objecter::tick (this=0x24726c0) at osdc/Objecter.cc:1543
#4  0x00007f188c816489 in Context::complete (this=0x7f1858001290, r=<optimized out>) at ./include/Context.h:64
#5  0x00007f188c8d55ed in SafeTimer::timer_thread (this=0x2472838) at common/Timer.cc:105
#6  0x00007f188c8d81ed in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#7  0x00007f188c477e9a in start_thread (arg=0x7f1883fff700) at pthread_create.c:308
#8  0x00007f188bc8e3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f188eebc780 (LWP 25044)):
#0  0x00007f188c479148 in pthread_join (threadid=139743270532864, thread_return=0x0) at pthread_join.c:89
#1  0x00007f188c9015e2 in Thread::join (this=0x2475230, prval=<optimized out>) at common/Thread.cc:140
#2  0x00007f188c8d728b in SafeTimer::shutdown (this=0x2472838) at common/Timer.cc:77
#3  0x00007f188c8568e0 in Objecter::shutdown (this=0x24726c0) at osdc/Objecter.cc:394
#4  0x00007f188c82006a in librados::RadosClient::shutdown (this=0x2455010) at librados/RadosClient.cc:303
#5  0x00007f188c809919 in rados_shutdown (cluster=0x2455010) at librados/librados.cc:1916
#6  0x000000000040473d in StRadosCreatePool::run (this=0x7fff41ba5960) at test/system/st_rados_create_pool.cc:112
#7  0x000000000040538a in systest_runnable_pthread_helper (arg=<optimized out>) at test/system/systest_runnable.cc:203
#8  0x000000000040544b in SysTestRunnable::start (this=0x7fff41ba5960) at test/system/systest_runnable.cc:102
#9  0x000000000040708b in SysTestRunnable::run_until_finished (runnables=...) at test/system/systest_runnable.cc:174
#10 0x0000000000403957 in main (argc=1, argv=0x7fff41ba5d18) at test/system/rados_delete_pools_parallel.cc:101

I was able to reproduce by adding a 2 second sleep in Objecter::tick().

void Objecter::tick()
{
+ struct timespec ts;
+ ts.tv_sec = 2;
+ ts.tv_nsec = 0;
+ nanosleep(&ts, NULL);
RWLock::RLocker rl(rwlock);
ldout(cct, 10) << "tick" << dendl;

Yup, there is a shutdown race. Thread 1 is waiting for the timer thread while holding the Objecter::rwlock in write mode. The timer thread 10 is in Objecter::tick() trying to get the rwlock in read mode.

Objecter::shutdown() should drop the rwlock so that the timer thread can exit. An assert(tick_event) needs to be done after we know that we aren't shutting down, since it would be NULL in this case.

I don't think this would have happened when safe_callbacks was true. It was set to false in a fix for #9582. See also #9650.


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #9582: librados: segmentation fault on timeoutResolvedSage Weil09/24/2014

Actions
Copied from Ceph - Bug #5925: hung ceph_test_rados_delete_pools_parallelCan't reproduceDavid Zafman08/09/2013

Actions
Actions #1

Updated by David Zafman over 9 years ago

  • Status changed from 12 to Fix Under Review
Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF