Project

General

Profile

Bug #4357

osd: FAILED assert("join on thread that was never started" == 0)

Added by Wido den Hollander about 11 years ago. Updated almost 11 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I found #1650 which seems related, but rather old and a different use-case.

I got a message from my monitoring system that the health of a small cluster was not ok. Turns out, all 12 OSDs went down with the same backtrace/message:

    -3> 2013-03-05 19:04:18.679948 7f5f37300780 10 -- [2a00:f10:113:0:d585:1138:64c6:be36]:6806/8564 wait: dispatch queue is stopped
    -2> 2013-03-05 19:04:18.679971 7f5f37300780 20 -- [2a00:f10:113:0:d585:1138:64c6:be36]:6806/8564 wait: stopping accepter thread
    -1> 2013-03-05 19:04:18.679984 7f5f37300780 10 accepter.stop accepter
     0> 2013-03-05 19:04:18.683892 7f5f37300780 -1 common/Thread.cc: In function 'int Thread::join(void**)' thread 7f5f37300780 time 2013-03-05 19:04:18.679999
common/Thread.cc: 117: FAILED assert("join on thread that was never started" == 0)

 ceph version 0.56.3-19-g8c6f522 (8c6f52215240f48b5e4d5bb99a5f2f451e7ce70a)
 1: (Thread::join(void**)+0x41) [0x823ee1]
 2: (Accepter::stop()+0x7b) [0x8af5fb]
 3: (SimpleMessenger::wait()+0xa4a) [0x81d6ba]
 4: (main()+0x2282) [0x5733f2]
 5: (__libc_start_main()+0xed) [0x7f5f34f9c76d]
 6: /usr/bin/ceph-osd() [0x575909]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I attached the logs of two OSDs, but I want to mention again that ALL 12 OSDs went down with the same backtrace in about 2 minutes time. Rendering the cluster unable to do any I/O.

ceph-osd.0.log.1.gz (2.12 MB) Wido den Hollander, 03/06/2013 12:44 AM

ceph-osd.4.log.1.gz (2.07 MB) Wido den Hollander, 03/06/2013 12:44 AM

ceph-osd.10.log.gz - osd.10 logs 02-04-2013 (244 KB) Wido den Hollander, 04/02/2013 08:47 AM

History

#2 Updated by Ian Colle about 11 years ago

  • Assignee set to Greg Farnum

Possible messenger issue?

#3 Updated by Greg Farnum about 11 years ago

Perhaps, but the most common cause of this is somebody doing startup and shutdown of stuff in a way that is somehow inappropriate. In particular, the call trace here is when the messenger is shutting down, and that shouldn't happen unless the daemon is going through an orderly shutdown. There were some OSD/messenger interaction changes recently-ish that I haven't had a chance to review; did those do something in this area? (Sam or Sage?)
(At normal priority somebody else is going to need to do due diligence; I don't think I'll have time until next week.)

#4 Updated by Wido den Hollander about 11 years ago

I've seen this one again on a 0.56.3 cluster last week.

12 OSDs went down with exactly the same backtrace. I didn't get a chance to gather logs and all.

I upgraded the cluster to 0.56.4 today and we'll see if it comes by again.

The workload is just one RBD client with 2 VMs running on it (4 disks in total).

#5 Updated by Sage Weil about 11 years ago

  • Assignee deleted (Greg Farnum)

#6 Updated by Sage Weil almost 11 years ago

  • Status changed from New to Need More Info

#7 Updated by Wido den Hollander almost 11 years ago

So I had all my OSDs in this cluster go down again.

It's a 3 node cluster with 12 OSDs in total, see all the timestamps:

root@ceph01:~# ls -al /core*
-rw------- 1 root root 360800256 Mar 19 16:43 /core
-rw------- 1 root root 420282368 Apr  2 12:13 /core.ceph01.1090
-rw------- 1 root root 424177664 Apr  2 12:13 /core.ceph01.1166
-rw------- 1 root root 380547072 Apr  2 12:13 /core.ceph01.1267
-rw------- 1 root root 399876096 Apr  2 12:13 /core.ceph01.1342
root@ceph01:~#
root@ceph02:~# ls -al /core*
-rw------- 1 root root 349495296 Mar 19 16:44 /core
-rw------- 1 root root 380661760 Apr  2 12:00 /core.ceph02.1149
-rw------- 1 root root 394264576 Apr  2 12:00 /core.ceph02.1224
-rw------- 1 root root 406724608 Apr  2 12:00 /core.ceph02.1300
-rw------- 1 root root 398086144 Apr  2 12:01 /core.ceph02.1376
root@ceph02:~#

Somehow 'ceph03' didn't produce any core dumps (no idea why), but the logs say the same.

All 12 OSDs crashed between 12:00 and 12:15 today.

I attached the logs of osd.10.

I didn't start the OSDs yet since this isn't a production cluster, so I'm waiting for some feedback about desired logging levels before starting them up again.

#8 Updated by Wido den Hollander almost 11 years ago

I started all OSDs with "debug ms = 1" since it seems like this is messenger related.

It will take some time (days) before they crash again.

#9 Updated by Wido den Hollander almost 11 years ago

I saw this happening again, all 12 OSDs went down with the same backtrace.

What I started to notice is that it happens in a 7 day interval and always around 11:00 in the morning.

So you'd figure something special must be happening at that time, but there is no cron running or doing anything.

This cluster just runs one RBD backed VM whichs runs a cronjob every hour.

#10 Updated by Sage Weil almost 11 years ago

  • Priority changed from Normal to High

Wido, do you still hit this problem?

#11 Updated by Wido den Hollander almost 11 years ago

No, since the upgrade to Cuttlefish I haven't seen the same one, BUT my cluster crashed yesterday with a different backtrace:

2013-05-22 09:01:40.208758 7fc6532a7700 -1 osd.0 7392 heartbeat_check: no reply from osd.6 since 2013-05-22 08:59:28.154396 (cutoff 2013-05-22 09:01:20.208650)
2013-05-22 09:01:40.208778 7fc6532a7700 -1 osd.0 7392 heartbeat_check: no reply from osd.7 since 2013-05-22 08:59:28.154396 (cutoff 2013-05-22 09:01:20.208650)
2013-05-22 09:01:40.208795 7fc6532a7700 -1 osd.0 7392 heartbeat_check: no reply from osd.8 since 2013-05-22 08:59:28.154396 (cutoff 2013-05-22 09:01:20.208650)
2013-05-22 09:01:40.208812 7fc6532a7700 -1 osd.0 7392 heartbeat_check: no reply from osd.9 since 2013-05-22 08:59:28.154396 (cutoff 2013-05-22 09:01:20.208650)
2013-05-22 09:01:40.208828 7fc6532a7700 -1 osd.0 7392 heartbeat_check: no reply from osd.10 since 2013-05-22 08:59:28.154396 (cutoff 2013-05-22 09:01:20.208650)
2013-05-22 09:01:40.208850 7fc6532a7700 -1 osd.0 7392 heartbeat_check: no reply from osd.11 since 2013-05-22 08:59:28.154396 (cutoff 2013-05-22 09:01:20.208650)
2013-05-22 09:01:40.519282 7fc64e2ee700  0 -- [2a00:f10:113:0:58f1:f3ee:edc5:2d30]:6801/28919 >> [2a00:f10:113:0:8466:4397:52ab:bae1]:6807/25491 pipe(0x89ad280 sd=27 :37457 s=2 pgs=527 cs=181 l=0).fault with nothi
ng to send, going to standby
2013-05-22 09:01:40.668892 7fc65b2b7700  0 log [WRN] : map e7393 wrongly marked me down
2013-05-22 09:01:40.670420 7fc65b2b7700 -1 accepter.accepter.bind unable to bind to [2a00:f10:113:0:58f1:f3ee:edc5:2d30]:7100 on any port in range 6800-7100: Cannot assign requested address
2013-05-22 09:01:40.682370 7fc65b2b7700 -1 accepter.accepter.bind unable to bind to [2a00:f10:113:0:58f1:f3ee:edc5:2d30]:7100 on any port in range 6800-7100: Cannot assign requested address
2013-05-22 09:01:41.337783 7fc65b2b7700 -1 ./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int) const' thread 7fc65b2b7700 time 2013-05-22 09:01:41.329589
./osd/OSDMap.h: 351: FAILED assert(is_up(osd))

 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
 1: (OSDService::prepare_to_stop()+0xe1) [0x5ff1a1]
 2: (OSD::shutdown()+0x28) [0x60d1f8]
 3: (OSD::handle_osd_map(MOSDMap*)+0x1212) [0x6442c2]
 4: (OSD::_dispatch(Message*)+0x2fb) [0x646bfb]
 5: (OSD::ms_dispatch(Message*)+0x1d6) [0x6472e6]
 6: (DispatchQueue::entry()+0x3f1) [0x8e01a1]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x82b4dd]
 8: (()+0x7e9a) [0x7fc6682d4e9a]
 9: (clone()+0x6d) [0x7fc666b52ccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

All 12 OSDs went down again on a Wednesday morning around 08:00 ~ 09:00. So same symptons, different backtrace.

#12 Updated by Sage Weil almost 11 years ago

a variation of this i just fixed in c2e262fc9493b4bb22c2b7b4990aa1ee7846940e. but note the

2013-05-22 09:01:40.670420 7fc65b2b7700 -1 accepter.accepter.bind unable to bind to [2a00:f10:113:0:58f1:f3ee:edc5:2d30]:7100 on any port in range 6800-7100: Cannot assign requested address
2013-05-22 09:01:40.682370 7fc65b2b7700 -1 accepter.accepter.bind unable to bind to [2a00:f10:113:0:58f1:f3ee:edc5:2d30]:7100 on any port in range 6800-7100: Cannot assign requested address

you should probably increase that range. i'll increase the default 'ms bind port max' to 7300 (from 7100) too.

#13 Updated by Wido den Hollander almost 11 years ago

  • Status changed from Need More Info to Can't reproduce

I'm closing this one for now. It hasn't popped up anymore, when it does, I'll re-open.

Also available in: Atom PDF