Bug #46071: potential rocksdb failure: few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph - RADOS - Ceph

Actions

Copy link

Bug #46071

open

potential rocksdb failure: few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph

Added by Prayank Saxena almost 4 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.4

ceph-qa-suite:

ceph-disk

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Data node went down abruptly due to issue with SPS-BD Smart Array PCIe SAS Expander, once hardware was changed node came up and 19 of 22 osd's also came up. Remaining two osd's didn't come up properly and below logs are generated in /var/log/ceph/ceph-osd.<id>.log

-13> 2020-06-18 09:33:55.909586 7f43e1931e00  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1592472835909582, "job": 1, "event": "recovery_started", "log_files": [16448]}
   -12> 2020-06-18 09:33:55.909590 7f43e1931e00  4 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_open.cc:482] Recovering log #16448 mode 2
   -11> 2020-06-18 09:33:55.909653 7f43e1931e00  4 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/version_set.cc:2395] Creating manifest 16450

10> 2020-06-18 09:33:55.910367 7f43e1931e00  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1592472835910364, "job": 1, "event": "recovery_finished"}
    -9> 2020-06-18 09:33:55.910524 7f43e1931e00  5 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_files.cc:307] [JOB 2] Delete /var/lib/ceph/osd/ceph-136/current/omap//MANIFEST-016447 type=3 #16447 - OK

8> 2020-06-18 09:33:55.910545 7f43e1931e00  5 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_files.cc:307] [JOB 2] Delete /var/lib/ceph/osd/ceph-136/current/omap//016448.log type=0 #16448 - OK

-7> 2020-06-18 09:33:55.911557 7f43e1931e00  4 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_open.cc:1063] DB pointer 0x561931c26000
    -6> 2020-06-18 09:33:55.914731 7f43e1931e00  0 filestore(/var/lib/ceph/osd/ceph-136) mount(1757): enabling WRITEAHEAD journal mode: checkpoint is not enabled
    -5> 2020-06-18 09:33:55.920586 7f43e1931e00  2 journal open /var/lib/ceph/osd/ceph-136/journal fsid 287094cc-90f3-4618-b5c1-2e2b323eefd0 fs_op_seq 440674854
    -4> 2020-06-18 09:33:55.920674 7f43e1931e00  1 journal _open /var/lib/ceph/osd/ceph-136/journal fd 33: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
    -3> 2020-06-18 09:33:55.921597 7f43e1931e00  2 journal read_entry 645660672 : seq 440674860 177594 bytes
    -2> 2020-06-18 09:33:55.921632 7f43e1931e00 -1 journal do_read_entry(-1): bad header magic
    -1> 2020-06-18 09:33:55.921636 7f43e1931e00 -1 journal Unable to read past sequence 440674855 but header indicates the journal has committed up through 440674859, journal is corrupt
     0> 2020-06-18 09:33:55.923445 7f43e1931e00 -1 ** Caught signal (Aborted) *
in thread 7f43e1931e00 thread_name:ceph-osd

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0xa74234) [0x5619270f2234]
2: (()+0x11390) [0x7f43dfe14390]
3: (gsignal()+0x38) [0x7f43dedaf428]
23361,4 99%

In syslog below logs getting generated
<30>2020-06-18T10:36:39.187476+00:00 telegraf⁶⁴⁴⁵: 2020-06-18T10:36:39Z E! Error in plugin [inputs.ceph]: E! error reading from socket '/var/run/ceph/ceph-osd.115.asok': error running ceph dump: exit status 22
<30>2020-06-18T10:36:39.238492+00:00 telegraf⁶⁴⁴⁵: 2020-06-18T10:36:39Z E! Error in plugin [inputs.ceph]: E! error reading from socket '/var/run/ceph/ceph-osd.136.asok': error running ceph dump: exit status 22
<30>2020-06-18T10:36:39.649662+00:00 telegraf⁶⁴⁴⁵: 2020-06-18T10:36:39Z E! Error in plugin [inputs.ceph]: E! error reading from socket '/var/run/ceph/ceph-osd.237.asok': error running ceph dump: exit status 22

Note: New Node addition is also in progress when we hit with this hardware issue.

Actions

Copy link

Updated by Greg Farnum almost 3 years ago

Project changed from Ceph to RADOS
Subject changed from few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph to potential rocksdb failure: few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #46071

potential rocksdb failure: few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph

Updated by Greg Farnum almost 3 years ago