Bug #4509
closedmon: on DataHealthService: FAILED assert(!stats.empty())
0%
Description
Tamil hit this last night.
Have been unable to reproduce it, but the only scenario in which this would happen is when somehow ::statfs() fails. My guess is that this was triggered during some weird behavior on teuthology's upgrade task, but that's just a theory.
Creating the ticket for future reference. Pushing a patch shortly, which will force the monitor to cleanly shutdown if it is unable to obtain stats from his own disk -- if something like that goes wrong, we sure want to shutdown the monitor asap before things get worse.
2013-03-19T14:10:22.109 INFO:teuthology.task.ceph.mon.c.err:mon/DataHealthService.cc: In function 'void DataHealthService:hare_stats()' thread 7f8310a33700 time 2013-03-19 14:10:22.238341 2013-03-19T14:10:22.109 INFO:teuthology.task.ceph.mon.c.err:mon/DataHealthService.cc: 134: FAILED assert(!stats.empty()) 2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: ceph version 0.58-754-gb7e2a0d (b7e2a0d464d63ee00e3c39a1376fd7fc4894c938) 2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 1: (DataHealthService:hare_stats()+0xacb) [0x57ad5b] 2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 2: (DataHealthService:ervice_tick()+0x2b8) [0x57b4c8] 2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 3: (QuorumService::C_Tick::finish(int)+0x17) [0x57bf87] 2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 4: (Context::complete(int)+0xa) [0x4c1d9a] 2013-03-19T14:10:22.110 INFO:teuthology.task.ceph.mon.c.err: 5: (SafeTimer::timer_thread()+0x425) [0x633ff5] 2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: 6: (SafeTimerThread::entry()+0xd) [0x634c2d] 2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: 7: (()+0x7e9a) [0x7f8314f95e9a] 2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: 8: (clone()+0x6d) [0x7f8313545cbd] 2013-03-19T14:10:22.111 INFO:teuthology.task.ceph.mon.c.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2013-03-19T14:10:22.113 INFO:teuthology.task.ceph.mon.c.err:2013-03-19 14:10:22.239510 7f8310a33700 -1 mon/DataHealthService.cc: In function 'void DataHealthService:hare_stats()' thread 7f8310a33700 time 2013-03-19 14:10:22.238341
Updated by Joao Eduardo Luis about 11 years ago
Ran a monitor with store on dev/foo ; mid-execution moved dev/foo to dev/foo.bar to force statfs(dev/foo) to fail. Monitor shutdown cleanly.
10 mon.b@1(probing).data_health(0) service_tick 0 log [ERR] : update_stats statfs error: (2) No such file or directory 1 -- 127.0.0.1:6790/0 --> mon.1 127.0.0.1:6790/0 -- log(1 entries) v1 -- ?+0 0x7f3ad805e1e0 0 log [ERR] : something went wrong obtaining our disk stats: (2) No such file or directory 1 -- 127.0.0.1:6790/0 <== mon.1 127.0.0.1:6790/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x7f3ad805e1e0 con 0x1b73870 1 -- 127.0.0.1:6790/0 --> mon.1 127.0.0.1:6790/0 -- log(1 entries) v1 -- ?+0 0x7f3ad8060230 0 ** Shutdown via Data Health Service ** 20 mon.b@1(probing) e1 have connection 10 mon.b@1(probing) e1 do not have session, making new one -1 mon.b@1(probing) e1 *** Got Signal Interrupt ***
Updated by Joao Eduardo Luis about 11 years ago
- Status changed from In Progress to Resolved
This should have been resolved with commit 51d62d325c93a8aa7c93045d2e28b505f1491f2f being merged into master a couple of hours ago.