Bug #990
osd: PG::replay_queued_ops
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I upgraded my cluster to 24caedc8f549eeeba48b2d4a44927ee16e65c42a after doing so, one of my OSD's crashed with the following backtrace:
(gdb) bt #0 0x00007f37c6dde7bb in raise () from /lib/libpthread.so.0 #1 0x0000000000620ce3 in reraise_fatal (signum=18688) at common/signal.cc:63 #2 0x0000000000621a0b in handle_fatal_signal (signum=6) at common/signal.cc:110 #3 <signal handler called> #4 0x00007f37c59aea75 in raise () from /lib/libc.so.6 #5 0x00007f37c59b25c0 in abort () from /lib/libc.so.6 #6 0x00007f37c62648e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #7 0x00007f37c6262d16 in ?? () from /usr/lib/libstdc++.so.6 #8 0x00007f37c6262d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #9 0x00007f37c6262e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #10 0x0000000000607caa in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, func=0x649dd0 "void PG::replay_queued_ops()") at common/assert.cc:86 #11 0x00000000005649f6 in PG::replay_queued_ops (this=0x1011000) at osd/PG.cc:1984 #12 0x00000000004e468f in OSD::activate_pg (this=0xc3b000, pgid=DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at osd/OSD.cc:4903 #13 0x00000000004e62c8 in OSD::check_replay_queue (this=0xc3b000) at osd/OSD.cc:4885 #14 0x000000000051b85d in OSD::tick (this=0xc3b000) at osd/OSD.cc:1754 #15 0x000000000060352b in SafeTimer::timer_thread (this=0xc3b048) at common/Timer.cc:102 #16 0x0000000000605d1d in SafeTimerThread::entry (this=<value optimized out>) at common/Timer.cc:38 #17 0x00007f37c6dd59ca in start_thread () from /lib/libpthread.so.0 #18 0x00007f37c5a6170d in clone () from /lib/libc.so.6 #19 0x0000000000000000 in ?? () (gdb)
I upgraded my OSD's machine by machine (per 4).
When I started the OSD again with full logging, it didn't crash again and the cluster fully recovered.
The core can be found at atom3.ceph.widodh.nl:/core.atom3.18688, logging is available on noisy.ceph.widodh.nl:/var/log/remote/ceph/osd.log
History
#1 Updated by Sage Weil almost 13 years ago
- Target version set to v0.28
#2 Updated by Sage Weil almost 13 years ago
- Target version changed from v0.28 to v0.26.1
#3 Updated by Wido den Hollander almost 13 years ago
According to sjust this should have been fixed in the osd_wip1000 branch, but I just hit the bug again on several OSD's (11 in total).
The backtrace is a bit different (line numbers):
(gdb) bt #0 0x00007fb35a4fa7bb in raise () from /lib/libpthread.so.0 #1 0x000000000061a493 in reraise_fatal (signum=2306) at common/signal.cc:63 #2 0x000000000061b1bb in handle_fatal_signal (signum=6) at common/signal.cc:110 #3 <signal handler called> #4 0x00007fb3590caa75 in raise () from /lib/libc.so.6 #5 0x00007fb3590ce5c0 in abort () from /lib/libc.so.6 #6 0x00007fb3599808e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #7 0x00007fb35997ed16 in ?? () from /usr/lib/libstdc++.so.6 #8 0x00007fb35997ed43 in std::terminate() () from /usr/lib/libstdc++.so.6 #9 0x00007fb35997ee3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #10 0x000000000060034a in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, func=0x643b10 "void PG::replay_queued_ops()") at common/assert.cc:86 #11 0x0000000000557076 in PG::replay_queued_ops (this=0x2f66000) at osd/PG.cc:1984 #12 0x00000000004e29cf in OSD::activate_pg (this=0x2927000, pgid=DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at osd/OSD.cc:4893 #13 0x00000000004e3998 in OSD::check_replay_queue (this=0x2927000) at osd/OSD.cc:4875 #14 0x000000000051b1a2 in OSD::tick (this=0x2927000) at osd/OSD.cc:1742 #15 0x00000000005fbbcb in SafeTimer::timer_thread (this=0x2927048) at common/Timer.cc:102 #16 0x00000000005fe3bd in SafeTimerThread::entry (this=<value optimized out>) at common/Timer.cc:38 #17 0x00007fb35a4f19ca in start_thread () from /lib/libpthread.so.0 #18 0x00007fb35917d70d in clone () from /lib/libc.so.6 #19 0x0000000000000000 in ?? () (gdb)
#4 Updated by Sage Weil almost 13 years ago
- Target version changed from v0.26.1 to v0.27.1
#5 Updated by Sage Weil over 12 years ago
- Status changed from New to Closed