Project

General

Profile

Actions

Bug #46071

open

potential rocksdb failure: few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph

Added by Prayank Saxena almost 4 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Data node went down abruptly due to issue with SPS-BD Smart Array PCIe SAS Expander, once hardware was changed node came up and 19 of 22 osd's also came up. Remaining two osd's didn't come up properly and below logs are generated in /var/log/ceph/ceph-osd.<id>.log

-13> 2020-06-18 09:33:55.909586 7f43e1931e00  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1592472835909582, "job": 1, "event": "recovery_started", "log_files": [16448]}
-12> 2020-06-18 09:33:55.909590 7f43e1931e00 4 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_open.cc:482] Recovering log #16448 mode 2
-11> 2020-06-18 09:33:55.909653 7f43e1931e00 4 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/version_set.cc:2395] Creating manifest 16450
10> 2020-06-18 09:33:55.910367 7f43e1931e00  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1592472835910364, "job": 1, "event": "recovery_finished"}
-9> 2020-06-18 09:33:55.910524 7f43e1931e00 5 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_files.cc:307] [JOB 2] Delete /var/lib/ceph/osd/ceph-136/current/omap//MANIFEST-016447 type=3 #16447 -
OK
8> 2020-06-18 09:33:55.910545 7f43e1931e00  5 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_files.cc:307] [JOB 2] Delete /var/lib/ceph/osd/ceph-136/current/omap//016448.log type=0 #16448 - OK
-7> 2020-06-18 09:33:55.911557 7f43e1931e00  4 rocksdb: [/var/lib/jenkins/workspace/build-debian-ceph-luminous-ubuntu-xenial/build-area/ceph/src/rocksdb/db/db_impl_open.cc:1063] DB pointer 0x561931c26000
-6> 2020-06-18 09:33:55.914731 7f43e1931e00 0 filestore(/var/lib/ceph/osd/ceph-136) mount(1757): enabling WRITEAHEAD journal mode: checkpoint is not enabled
-5> 2020-06-18 09:33:55.920586 7f43e1931e00 2 journal open /var/lib/ceph/osd/ceph-136/journal fsid 287094cc-90f3-4618-b5c1-2e2b323eefd0 fs_op_seq 440674854
-4> 2020-06-18 09:33:55.920674 7f43e1931e00 1 journal _open /var/lib/ceph/osd/ceph-136/journal fd 33: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
-3> 2020-06-18 09:33:55.921597 7f43e1931e00 2 journal read_entry 645660672 : seq 440674860 177594 bytes
-2> 2020-06-18 09:33:55.921632 7f43e1931e00 -1 journal do_read_entry(-1): bad header magic
-1> 2020-06-18 09:33:55.921636 7f43e1931e00 -1 journal Unable to read past sequence 440674855 but header indicates the journal has committed up through 440674859, journal is corrupt
0> 2020-06-18 09:33:55.923445 7f43e1931e00 -1 ** Caught signal (Aborted) *
in thread 7f43e1931e00 thread_name:ceph-osd

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (()+0xa74234) [0x5619270f2234]
2: (()+0x11390) [0x7f43dfe14390]
3: (gsignal()+0x38) [0x7f43dedaf428]
23361,4 99%

In syslog below logs getting generated
<30>2020-06-18T10:36:39.187476+00:00 telegraf6445: 2020-06-18T10:36:39Z E! Error in plugin [inputs.ceph]: E! error reading from socket '/var/run/ceph/ceph-osd.115.asok': error running ceph dump: exit status 22
<30>2020-06-18T10:36:39.238492+00:00 telegraf6445: 2020-06-18T10:36:39Z E! Error in plugin [inputs.ceph]: E! error reading from socket '/var/run/ceph/ceph-osd.136.asok': error running ceph dump: exit status 22
<30>2020-06-18T10:36:39.649662+00:00 telegraf6445: 2020-06-18T10:36:39Z E! Error in plugin [inputs.ceph]: E! error reading from socket '/var/run/ceph/ceph-osd.237.asok': error running ceph dump: exit status 22

Note: New Node addition is also in progress when we hit with this hardware issue.

Actions #1

Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph to potential rocksdb failure: few osd's service not starting up after node reboot. Luminous 12.2.4 Ceph
Actions

Also available in: Atom PDF