Bug #40100: Missing block.wal and block.db symlinks on restart - ceph-volume - Ceph

Actions

Copy link

Bug #40100

closed

Missing block.wal and block.db symlinks on restart

Added by Corey Bryant almost 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

28791

Crash signature (v1):

Crash signature (v2):

Description

We are tracking a bug in Ubuntu (https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1828617) wherea race on system restart causes missing block.wal and block.db symlinks.

There is a loop for each OSD that calls 'ceph-volume lvm trigger' 30 times until the OSD is activated, for example:
[2019-05-31 01:27:29,235][ceph_volume.process][INFO ] Running command: ceph-volume lvm trigger 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,435][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,530][systemd][WARNING] command returned non-zero exit status: 1
[2019-05-31 01:27:35,531][systemd][WARNING] failed activating OSD, retries left: 30
[2019-05-31 01:27:44,122][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:44,174][systemd][WARNING] command returned non-zero exit status: 1
[2019-05-31 01:27:44,175][systemd][WARNING] failed activating OSD, retries left: 29
...

The race appears to exist where 'ceph-volume lvm trigger' succeeds yet the WAL and DB devices are not ready:
https://github.com/ceph/ceph/blob/luminous/src/ceph-volume/ceph_volume/systemd/main.py#L93

Then the symlinks don't get setup here:
https://github.com/ceph/ceph/blob/luminous/src/ceph-volume/ceph_volume/devices/lvm/activate.py#L154
https://github.com/ceph/ceph/blob/luminous/src/ceph-volume/ceph_volume/devices/lvm/activate.py#L177

I wonder if we can have similar 'ceph-volume lvm trigger'-ish calls/loops for WAL and DB devices per OSD in src/ceph-volume/ceph_volume/systemd/main.py. We can determine if an OSD has a DB or WAL device from the lvm tags.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » ceph-volume

Custom queries

Bug #40100

Missing block.wal and block.db symlinks on restart

Updated by Corey Bryant almost 5 years ago

Updated by Corey Bryant almost 5 years ago

Updated by Corey Bryant almost 5 years ago

Updated by Greg Farnum almost 5 years ago

Updated by David Casier almost 5 years ago

Updated by David Casier almost 5 years ago

Updated by Alfredo Deza almost 5 years ago

Updated by Kefu Chai almost 5 years ago

Updated by Jan Fajerski almost 5 years ago