Support #22664: some random OSD are down (with a Abort signal on exception) after replace/rebuild of a defective OSD drive - RADOS - Ceph

Actions

Copy link

Support #22664

closed

some random OSD are down (with a Abort signal on exception) after replace/rebuild of a defective OSD drive

Added by Enrico Labedzki over 6 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v10.2.10

Component(RADOS):

Pull request ID:

Description

Hello,

currently we are facing with a strange behavior, where some OSDs are got ramdomly down with a Abort signal, as far we could determine, this just happens on replacement/rebuild of defective OSDs.

After restart of the down OSDs everything seems to work fine for 3-5 hours (depends on misplaced/degraded object count/percentage), then it happens agains (or maybe not, which is the rare case).

So we collected all logfiles from the OSDs which where down and all of them seems to wait for a heartbeat reply of the maintained OSD, but they crashed after a while with always same exception.

==================== %< ===========================
...
2018-01-11 12:42:03.900667 7f57878d4700 1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:43.900649)
2018-01-11 12:42:04.900819 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:44.900800)
2018-01-11 12:42:05.292625 7f57667a5700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:45.292617)
2018-01-11 12:42:05.795245 7f57667a5700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:45.795238)
2018-01-11 12:42:05.900963 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:45.900944)
2018-01-11 12:42:06.901098 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:46.901080)
2018-01-11 12:42:07.901818 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:47.901802)
2018-01-11 12:42:08.698679 7f57667a5700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:48.698646)
2018-01-11 12:42:08.901934 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:48.901920)
2018-01-11 12:42:09.902077 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:49.902060)
2018-01-11 12:42:10.902225 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:50.902208)
2018-01-11 12:42:11.902376 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:51.902359)
2018-01-11 12:42:12.902511 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:52.902495)
2018-01-11 12:42:13.401454 7f57667a5700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:53.401440)
2018-01-11 12:42:13.902652 7f57878d4700 -1 osd.105 2424118 heartbeat_check: no reply from 10.10.0.97:6802 osd.291 ever on either front or back, first ping sent 2018-01-11 12:41:26.160994 (cutoff 2018-01-11 12:41:53.902620)
2018-01-11 12:42:16.293007 7f572c08f700 0 - 10.10.9.10:6821/20479 >> 10.10.9.18:6842/23514 pipe(0x55a8e3366000 sd=849 :33852 s=2 pgs=50 cs=1 l=0 c=0x55a8f2e4c180).fault with nothing to send, going to standby
2018-01-11 12:42:16.294554 7f5726f16700 0 -- 10.10.0.148:0/20479 >> 10.10.0.97:6807/23514 pipe(0x55a8eaae2800 sd=909 :0 s=1 pgs=0 cs=0 l=1 c=0x55a8dea2f900).fault
2018-01-11 12:42:16.296506 7f572e52c700 0 -- 10.10.0.148:0/20479 >> 10.10.9.18:6843/23514 pipe(0x55a8ebb1a000 sd=908 :0 s=1 pgs=0 cs=0 l=1 c=0x55a8e2946480).fault
2018-01-11 12:42:16.489525 7f5770fba700 0 log_channel(cluster) log [INF] : 11.114 continuing backfill to osd.137 from (2309072'767485,2420345'770485] MIN to 2420345'770485
2018-01-11 12:42:16.489549 7f5770fba700 0 log_channel(cluster) log [INF] : 11.114 starting backfill to osd.291 from (0'0,0'0] MAX to 2420345'770485
2018-01-11 12:42:16.513792 7f57328ce700 0 -- 10.10.9.10:6821/20479 >> 10.10.9.18:6842/23514 pipe(0x55a8e3366000 sd=849 :33852 s=1 pgs=50 cs=1 l=0 c=0x55a8f2e4c180).fault
2018-01-11 12:42:39.010489 7f576f7b7700 0 log_channel(cluster) log [INF] : 11.114 continuing backfill to osd.137 from (2309072'767485,2420345'770485] MIN to 2420345'770485
2018-01-11 12:43:11.914106 7f57878d4700 -1 common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f57878d4700 time 2018-01-11 12:43:11.909228
common/Thread.cc: 160: FAILED assert(ret 0)

ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55a8c6dc349b]
 2: (Thread::create(char const*, unsigned long)+0xba) [0x55a8c6da6afa]
 3: (SimpleMessenger::connect_rank(entity_addr_t const&, int, PipeConnection*, Message*)+0x172) [0x55a8c6d9c112]
 4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x30e) [0x55a8c6d9c78e]
 5: (OSDService::get_con_osd_hb(int, unsigned int)+0x245) [0x55a8c673ad55]
 6: (OSD::_add_heartbeat_peer(int)+0x97) [0x55a8c6752007]
 7: (OSD::maybe_update_heartbeat_peers()+0xba8) [0x55a8c6753088]
 8: (OSD::tick()+0x236) [0x55a8c6782c56]
 9: (Context::complete(int)+0x9) [0x55a8c6789d59]
 10: (SafeTimer::timer_thread()+0xec) [0x55a8c6dac88c]
 11: (SafeTimerThread::entry()+0xd) [0x55a8c6dae22d]
 12: (()+0x8184) [0x7f579167b184]
 13: (clone()+0x6d) [0x7f578f7a4ffd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
~~10000> 2018-01-11 12:42:51.413371 7f577f966700 5 -~~ op tracker -- seq: 339778374, time: 2018-01-11 12:42:51.413371, event: journaled_completion_queued, op: osd_repop(client.211547439.0:753060205 14.2cf 14:f3456c03:::rbd_data.a00d45b238e1f29.0000000000006e2d:head v 24241
35'77575210)
~~9999> 2018-01-11 12:42:51.413410 7f577d962700 5 -~~ op tracker -- seq: 339778374, time: 2018-01-11 12:42:51.413409, event: commit_sent, op: osd_repop(client.211547439.0:753060205 14.2cf 14:f3456c03:::rbd_data.a00d45b238e1f29.0000000000006e2d:head v 2424135'77575210)
...
================== %< ===========================

We packed all logs in a zip for further investigation.

In this case we replaced osd.291 on the net-ceph8 node, as you can read out from the attached logs.

All our nodes are running with OS: Ubuntu 16.04 LTS (xenial), Ceph: 10.2.10 (jewel).

If you need more information, so please let us know.

Thank you.

Files

Download all files

osd.logs.7z.001 (1000 KB) osd.logs.7z.001	osd.logs.7z.001	Enrico Labedzki, 01/11/2018 02:38 PM
osd.logs.7z.002 (1000 KB) osd.logs.7z.002	osd.logs.7z.002	Enrico Labedzki, 01/11/2018 02:39 PM
osd.logs.7z.003 (1000 KB) osd.logs.7z.003	osd.logs.7z.003	Enrico Labedzki, 01/11/2018 02:39 PM
osd.logs.7z.004 (728 KB) osd.logs.7z.004	osd.logs.7z.004	Enrico Labedzki, 01/11/2018 02:39 PM

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Tracker changed from Bug to Support
Project changed from Ceph to RADOS
Category deleted (~~OSD~~)
Status changed from New to Closed

It's failing to create a new thread. You probably need to bump the ulimit; this is discussed in the documentation. :)

Actions

Copy link

Updated by Enrico Labedzki over 6 years ago

Hi Greg,

can you point me to the link, as far we have seen yet, all ulimit 10 times higher as needed on all nodes.

Yesterday we did a reboot on all affected nodes after a runtime of round about 2 years now, also yesterday evening no OSDs are down, also not on my replacement of a new HDD one hour ago.

It seems it's a long running issue.

Can someone explain this behavior?

Should we reboot the nodes after a upgrade (we didn't that on all upgrades in the past)!?

Thanks

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Support #22664

some random OSD are down (with a Abort signal on exception) after replace/rebuild of a defective OSD drive

Updated by Greg Farnum over 6 years ago

Updated by Enrico Labedzki over 6 years ago