Project

General

Profile

Bug #3787

Ceph OSD crashes on ceph tell osd.x

Added by Seb Mel over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I recently set up a small test cluster with 2 nodes to test the 0.48.3 -> 0.56.1 upgrade. After Upgrading one of the nodes to 0.56.1 (OSD, MON, MDS, RadosGW) we noticed that the ceph osd crashes after issuing a:
ceph tell osd.0 (where 0 is the updated osd)

6> 2013-01-11 08:02:33.338507 7f2a40788700 10 monclient: tick
-5> 2013-01-11 08:02:33.338540 7f2a40788700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-01-11 08:02:03.338538)
-4> 2013-01-11 08:02:33.338559 7f2a40788700 10 monclient: renew subs? (now: 2013-01-11 08:02:33.338558; renew after: 2013-01-11 08:04:40.569279) -
no
3> 2013-01-11 08:02:33.341715 7f2a4bf9f700 5 osd.0 34 tick
-2> 2013-01-11 08:02:33.947731 7f2a3ae75700 1 -
10.251.46.216:6800/8872 >> :/0 pipe(0x2f49d80 sd=34 :6800 pgs=0 cs=0 l=0).accept sd=34 10.251.46.216:35825/0
1> 2013-01-11 08:02:33.948104 7f2a43f8f700 1 - 10.251.46.216:6800/8872 <== client.? 10.251.46.216:0/8958 1 ==== command(tid 1: ) v1 ==== 20+0+0 (2743466195 0 0) 0x31f0e00 con 0x2f506e0
0> 2013-01-11 08:02:33.949617 7f2a3d782700 -1 ** Caught signal (Segmentation fault) *
in thread 7f2a3d782700
ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
1: /usr/bin/ceph-osd() [0x79c2e9]
2: (()+0xeff0) [0x7f2a504b1ff0]
3: (std::string::compare(char const*) const+0x16) [0x7f2a4f7aab66]
4: (OSD::do_command(Connection*, unsigned long, std::vector&lt;std::string, std::allocator&lt;std::string&gt; >&, ceph::buffer::list&)+0x311) [0x5fc741]
5: (OSD::CommandWQ::_process(OSD::Command*)+0x37) [0x64af47]
6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82b47b]
7: (ThreadPool::WorkThread::entry()+0x10) [0x82dc60]
8: (()+0x68ca) [0x7f2a504a98ca]
9: (clone()+0x6d) [0x7f2a4efd8b6d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 100000
max_new 1000

the setup is running on a standard Debian Squeeze installation with the packages from http://ceph.com/debian-bobtail/

Interestingly the ceph 0.48.3 OSD also crashes when issuing a ceph tell osd.1:
3> 2013-01-11 08:15:33.524493 7faa4ef46700 1 - 10.58.214.195:6803/9677 <== client.? 10.251.46.216:0/9448 1 ==== command(tid 1: ) v1 ==== 20+0+0 (2743466195 0 0) 0x318a1c0 con 0x298f3c0
-2> 2013-01-11 08:15:33.524524 7faa4ef46700 5 throttle(osd_client_bytes 0x7fff3498e0b0) put 20 (0xb68cc8 -> 0)
-1> 2013-01-11 08:15:33.524530 7faa4ef46700 5 throttle(msgr_dispatch_throttler-client 0x1f879e0) put 20 (0xb68cc8 -> 0)
0> 2013-01-11 08:15:33.525740 7faa49e3b700 -1 ** Caught signal (Segmentation fault) *
in thread 7faa49e3b700

ceph version 0.48.3argonaut (commit:920f82e805efec2cae05b79c155c07df0f3ed5dd)
1: /usr/bin/ceph-osd() [0x707249]
2: (()+0xeff0) [0x7faa5ca56ff0]
3: (std::string::compare(char const*) const+0x16) [0x7faa5bf64b66]
4: (OSD::do_command(Connection*, unsigned long, std::vector&lt;std::string, std::allocator&lt;std::string&gt; >&, ceph::buffer::list&)+0x383) [0x5b6103]
5: (OSD::CommandWQ::_process(OSD::Command*)+0x35) [0x5f9fd5]
6: (ThreadPool::worker()+0x76f) [0x78f61f]
7: (ThreadPool::WorkThread::entry()+0xd) [0x5e5efd]
8: (()+0x68ca) [0x7faa5ca4e8ca]
9: (clone()+0x6d) [0x7faa5b792b6d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- end dump of recent events ---

Other than that the cluster seems to work fine.

History

#1 Updated by Sage Weil over 7 years ago

  • Status changed from New to 12
  • Priority changed from Normal to Urgent

verified this happens on master. should be an easy fix. thanks for the report!

#2 Updated by Ian Colle over 7 years ago

  • Assignee set to Samuel Just

#3 Updated by Samuel Just over 7 years ago

  • Status changed from 12 to Fix Under Review

wip_3787

#4 Updated by Samuel Just over 7 years ago

  • Status changed from Fix Under Review to Resolved

8cf79f252a1bcea5713065390180a36f31d66dfd

#5 Updated by Ian Colle over 7 years ago

Should this be backported to Bobtail?

Also available in: Atom PDF