Project

General

Profile

Actions

Bug #4509

closed

mon: on DataHealthService: FAILED assert(!stats.empty())

Added by Joao Eduardo Luis about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Tamil hit this last night.

Have been unable to reproduce it, but the only scenario in which this would happen is when somehow ::statfs() fails. My guess is that this was triggered during some weird behavior on teuthology's upgrade task, but that's just a theory.

Creating the ticket for future reference. Pushing a patch shortly, which will force the monitor to cleanly shutdown if it is unable to obtain stats from his own disk -- if something like that goes wrong, we sure want to shutdown the monitor asap before things get worse.

2013-03-19T14:10:22.109 INFO:teuthology.task.ceph.mon.c.err:mon/DataHealthService.cc: In function 'void DataHealthService:hare_stats()' thread 7f8310a33700 time 2013-03-19 14:10:22.238341
2013-03-19T14:10:22.109 INFO:teuthology.task.ceph.mon.c.err:mon/DataHealthService.cc: 134: FAILED assert(!stats.empty())
2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: ceph version 0.58-754-gb7e2a0d (b7e2a0d464d63ee00e3c39a1376fd7fc4894c938)
2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 1: (DataHealthService:hare_stats()+0xacb) [0x57ad5b]
2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 2: (DataHealthService:ervice_tick()+0x2b8) [0x57b4c8]
2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 3: (QuorumService::C_Tick::finish(int)+0x17) [0x57bf87]
2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 4: (Context::complete(int)+0xa) [0x4c1d9a]
2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 5: (SafeTimer::timer_thread()+0x425) [0x633ff5]
2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: 6: (SafeTimerThread::entry()+0xd) [0x634c2d]
2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: 7: (()+0x7e9a) [0x7f8314f95e9a]
2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: 8: (clone()+0x6d) [0x7f8313545cbd]
2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-03-19T14:10:22.113 INFO:teuthology.task.ceph.mon.c.err:2013-03-19 14:10:22.239510 7f8310a33700 -1 mon/DataHealthService.cc: In function 'void DataHealthService:hare_stats()' thread 7f8310a33700 time 2013-03-19 14:10:22.238341
Actions #1

Updated by Joao Eduardo Luis about 11 years ago

Ran a monitor with store on dev/foo ; mid-execution moved dev/foo to dev/foo.bar to force statfs(dev/foo) to fail. Monitor shutdown cleanly.

10 mon.b@1(probing).data_health(0) service_tick
 0 log [ERR] : update_stats statfs error: (2) No such file or directory
 1 -- 127.0.0.1:6790/0 --> mon.1 127.0.0.1:6790/0 -- log(1 entries) v1 -- ?+0 0x7f3ad805e1e0
 0 log [ERR] : something went wrong obtaining our disk stats: (2) No such file or directory
 1 -- 127.0.0.1:6790/0 <== mon.1 127.0.0.1:6790/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x7f3ad805e1e0 con 0x1b73870
 1 -- 127.0.0.1:6790/0 --> mon.1 127.0.0.1:6790/0 -- log(1 entries) v1 -- ?+0 0x7f3ad8060230
 0 ** Shutdown via Data Health Service **
20 mon.b@1(probing) e1 have connection
10 mon.b@1(probing) e1 do not have session, making new one
-1 mon.b@1(probing) e1 *** Got Signal Interrupt ***
Actions #2

Updated by Ian Colle about 11 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Joao Eduardo Luis about 11 years ago

  • Status changed from In Progress to Resolved

This should have been resolved with commit 51d62d325c93a8aa7c93045d2e28b505f1491f2f being merged into master a couple of hours ago.

Actions

Also available in: Atom PDF