Project

General

Profile

Actions

Bug #56830

closed

crash: cephfs::mirror::PeerReplayer::pick_directory()

Added by Telemetry Bot over 1 year ago. Updated about 1 year ago.

Status:
Can't reproduce
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Telemetry
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror, mgr/mirroring
Labels (FS):
crash
Pull request ID:
Crash signature (v1):

3dcea1d9286cb3e9672e269bbce51783268dd21af2356f1cb2fba3a7c666fe1a


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=d6f26d40363a53f0bed9a466dc262bb7e12ae4202af3300262e83f593a87af79

Sanitized backtrace:

    cephfs::mirror::PeerReplayer::pick_directory()
    cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*)
    cephfs::mirror::PeerReplayer::SnapshotReplayerThread::entry()

Crash dump sample:
{
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f56d4ec6ce0]",
        "gsignal()",
        "abort()",
        "/lib64/libstdc++.so.6(+0x9009b) [0x7f56d40c109b]",
        "/lib64/libstdc++.so.6(+0x9653c) [0x7f56d40c753c]",
        "/lib64/libstdc++.so.6(+0x96597) [0x7f56d40c7597]",
        "/lib64/libstdc++.so.6(+0x967f8) [0x7f56d40c77f8]",
        "/lib64/libstdc++.so.6(+0x9204b) [0x7f56d40c304b]",
        "(cephfs::mirror::PeerReplayer::pick_directory[abi:cxx11]()+0x4b1) [0x559df4857ed1]",
        "(cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*)+0x57b) [0x559df486b70b]",
        "(cephfs::mirror::PeerReplayer::SnapshotReplayerThread::entry()+0x14) [0x559df4871224]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f56d4ebc1ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.1",
    "crash_id": "2022-07-04T06:16:43.280098Z_9bbad5d1-d329-45fa-809a-cca469a1304c",
    "entity_name": "client.7569faf8458207653cfd1218614297ed3448f79b",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "cephfs-mirror",
    "stack_sig": "3dcea1d9286cb3e9672e269bbce51783268dd21af2356f1cb2fba3a7c666fe1a",
    "timestamp": "2022-07-04T06:16:43.280098Z",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.42.2.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 14:49:57 UTC 2021" 
}

Actions #1

Updated by Telemetry Bot over 1 year ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.2.1 added
Actions #2

Updated by Venky Shankar over 1 year ago

  • Category set to Correctness/Safety
  • Assignee set to Dhairya Parmar
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Component(FS) cephfs-mirror, mgr/mirroring added
  • Labels (FS) crash added

Dhairya,

Please take a look at this. I think there is some sort of race that is causing this crash while iterating the directory list in cephfs-mirror daemon.

Actions #3

Updated by Dhairya Parmar about 1 year ago

Venky Shankar wrote:

Dhairya,

Please take a look at this. I think there is some sort of race that is causing this crash while iterating the directory list in cephfs-mirror daemon.

Have we faced this issue anytime recently?

Actions #4

Updated by Venky Shankar about 1 year ago

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya,

Please take a look at this. I think there is some sort of race that is causing this crash while iterating the directory list in cephfs-mirror daemon.

Have we faced this issue anytime recently?

Not in our teuthology tests, but nothing major has changes in cephfs-mirror, so the race most likely still exists.

Actions #5

Updated by Dhairya Parmar about 1 year ago

Issue seems to be at:

std::rotate(m_directories.begin(), m_directories.begin() + 1, m_directories.end());

@ https://github.com/ceph/ceph/blob/main/src/tools/cephfs_mirror/PeerReplayer.cc#L315

if m_directories is empty, rotate tries to access the index that doesn't exist

From https://en.cppreference.com/w/cpp/algorithm/rotate:

Return value
An iterator that is equal to:
last, if first middle is true,
first, if middle last is true,
first + (last - middle)[1] otherwise, i.e. the new location of the element pointed by first.

It performs swaps to make element at index middle (here m_directories.begin() + 1) the first element of vector with range [first, last), where first = m_directories.begin() and last = m_directories.end(), if there is nothing in the vector m_directories, it basically tries to access value at an index that does not exist/doesn't belong to it and thus the segmentation fault. This is easy to fix, we only rotate when the vector size is more than 1, else don't touch it.

Actions #6

Updated by Venky Shankar about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 50333

See updated in PR.

Actions #7

Updated by Dhairya Parmar about 1 year ago

Dhairya Parmar wrote:

Issue seems to be at:
[...]
@ https://github.com/ceph/ceph/blob/main/src/tools/cephfs_mirror/PeerReplayer.cc#L315

if m_directories is empty, rotate tries to access the index that doesn't exist

From https://en.cppreference.com/w/cpp/algorithm/rotate:

Return value
An iterator that is equal to:
last, if first middle is true,
first, if middle last is true,
first + (last - middle)[1] otherwise, i.e. the new location of the element pointed by first.

It performs swaps to make element at index middle (here m_directories.begin() + 1) the first element of vector with range [first, last), where first = m_directories.begin() and last = m_directories.end(), if there is nothing in the vector m_directories, it basically tries to access value at an index that does not exist/doesn't belong to it and thus the segmentation fault. This is easy to fix, we only rotate when the vector size is more than 1, else don't touch it.

This doesn't make sense as m_directories is only iterated if its size() is non-zero.

Actions #8

Updated by Dhairya Parmar about 1 year ago

  • Status changed from Fix Under Review to Can't reproduce

After thoroughly assessing the issue with the limited available data in the tracker, it's hard to tell what lead to this failure. I'm closing this tracker for now, if anyone faces this again in future, feel free to re-open.

Actions

Also available in: Atom PDF