Bug #56830
closedcrash: cephfs::mirror::PeerReplayer::pick_directory()
0%
3dcea1d9286cb3e9672e269bbce51783268dd21af2356f1cb2fba3a7c666fe1a
Description
Sanitized backtrace:
cephfs::mirror::PeerReplayer::pick_directory() cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*) cephfs::mirror::PeerReplayer::SnapshotReplayerThread::entry()
Crash dump sample:
{ "backtrace": [ "/lib64/libpthread.so.0(+0x12ce0) [0x7f56d4ec6ce0]", "gsignal()", "abort()", "/lib64/libstdc++.so.6(+0x9009b) [0x7f56d40c109b]", "/lib64/libstdc++.so.6(+0x9653c) [0x7f56d40c753c]", "/lib64/libstdc++.so.6(+0x96597) [0x7f56d40c7597]", "/lib64/libstdc++.so.6(+0x967f8) [0x7f56d40c77f8]", "/lib64/libstdc++.so.6(+0x9204b) [0x7f56d40c304b]", "(cephfs::mirror::PeerReplayer::pick_directory[abi:cxx11]()+0x4b1) [0x559df4857ed1]", "(cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*)+0x57b) [0x559df486b70b]", "(cephfs::mirror::PeerReplayer::SnapshotReplayerThread::entry()+0x14) [0x559df4871224]", "/lib64/libpthread.so.0(+0x81ca) [0x7f56d4ebc1ca]", "clone()" ], "ceph_version": "17.2.1", "crash_id": "2022-07-04T06:16:43.280098Z_9bbad5d1-d329-45fa-809a-cca469a1304c", "entity_name": "client.7569faf8458207653cfd1218614297ed3448f79b", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "cephfs-mirror", "stack_sig": "3dcea1d9286cb3e9672e269bbce51783268dd21af2356f1cb2fba3a7c666fe1a", "timestamp": "2022-07-04T06:16:43.280098Z", "utsname_machine": "x86_64", "utsname_release": "3.10.0-1160.42.2.el7.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Tue Sep 7 14:49:57 UTC 2021" }
Updated by Telemetry Bot over 1 year ago
Updated by Venky Shankar over 1 year ago
- Category set to Correctness/Safety
- Assignee set to Dhairya Parmar
- Target version set to v18.0.0
- Backport set to pacific,quincy
- Component(FS) cephfs-mirror, mgr/mirroring added
- Labels (FS) crash added
Dhairya,
Please take a look at this. I think there is some sort of race that is causing this crash while iterating the directory list in cephfs-mirror daemon.
Updated by Dhairya Parmar about 1 year ago
Venky Shankar wrote:
Dhairya,
Please take a look at this. I think there is some sort of race that is causing this crash while iterating the directory list in cephfs-mirror daemon.
Have we faced this issue anytime recently?
Updated by Venky Shankar about 1 year ago
Dhairya Parmar wrote:
Venky Shankar wrote:
Dhairya,
Please take a look at this. I think there is some sort of race that is causing this crash while iterating the directory list in cephfs-mirror daemon.
Have we faced this issue anytime recently?
Not in our teuthology tests, but nothing major has changes in cephfs-mirror, so the race most likely still exists.
Updated by Dhairya Parmar about 1 year ago
Issue seems to be at:
std::rotate(m_directories.begin(), m_directories.begin() + 1, m_directories.end());
@ https://github.com/ceph/ceph/blob/main/src/tools/cephfs_mirror/PeerReplayer.cc#L315
if m_directories is empty, rotate tries to access the index that doesn't exist
From https://en.cppreference.com/w/cpp/algorithm/rotate:
Return value
An iterator that is equal to:
last, if first middle is true,
first, if middle last is true,
first + (last - middle)[1] otherwise, i.e. the new location of the element pointed by first.
It performs swaps to make element at index middle (here m_directories.begin() + 1) the first element of vector with range [first, last), where first = m_directories.begin() and last = m_directories.end(), if there is nothing in the vector m_directories, it basically tries to access value at an index that does not exist/doesn't belong to it and thus the segmentation fault. This is easy to fix, we only rotate when the vector size is more than 1, else don't touch it.
Updated by Venky Shankar about 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 50333
See updated in PR.
Updated by Dhairya Parmar about 1 year ago
Dhairya Parmar wrote:
Issue seems to be at:
[...]
@ https://github.com/ceph/ceph/blob/main/src/tools/cephfs_mirror/PeerReplayer.cc#L315if m_directories is empty, rotate tries to access the index that doesn't exist
From https://en.cppreference.com/w/cpp/algorithm/rotate:
Return value
An iterator that is equal to:
last, if first middle is true,
first, if middle last is true,
first + (last - middle)[1] otherwise, i.e. the new location of the element pointed by first.It performs swaps to make element at index middle (here m_directories.begin() + 1) the first element of vector with range [first, last), where first = m_directories.begin() and last = m_directories.end(), if there is nothing in the vector m_directories, it basically tries to access value at an index that does not exist/doesn't belong to it and thus the segmentation fault. This is easy to fix, we only rotate when the vector size is more than 1, else don't touch it.
This doesn't make sense as m_directories is only iterated if its size() is non-zero.
Updated by Dhairya Parmar about 1 year ago
- Status changed from Fix Under Review to Can't reproduce
After thoroughly assessing the issue with the limited available data in the tracker, it's hard to tell what lead to this failure. I'm closing this tracker for now, if anyone faces this again in future, feel free to re-open.