Project

General

Profile

Bug #57966

Ceph cluster osds failed when ms_cluster_type=async+rdma is used

Added by guoguo jie 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
common
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Currently, using iboip can run normally:
The steps are as follows:
Check cluster health:
Ceph health detail.
Ceph config set global mon_clock_drift_allowed 3.
Ceph config set global osd_pool_default_size 2.
Add an internal IB cluster network.
Ceph config set osd cluster_network 10.10.20.0/24.
Ceph config get osd cluster_network.
Be sure to check if the internal network is used after the reboot system:
Ceph osd metadata 0 | grep addr.
Ceph osd metadata 1 | grep addr.
Ceph osd dump | grep 10.10.
After the above operation is new, the ceph cluster is running normally.

Try to enable the rdma parameter:

Then operate in the following order: just set the OSD domain, and each node commands show_gids to get ms_async_rdma_device_name and ms_async_rdma_local_gid.

Ceph config set osd ms_async_rdma_device_name mlx5_0.
Ceph config set osd.0 ms_async_rdma_local_gid fe80:0000:0000:0000:480f:cfff:fff3:9974.
Ceph config set osd.1 ms_async_rdma_local_gid fe80:0000:0000:0000:7010:6fff:ffa2:1430.
Ceph config set osd ms_cluster_type async+rdma.
Ceph osd metadata 0 | grep addr.
Ceph osd metadata 1 | grep addr.
Ceph osd dump | grep 10.10.
Check IB Nic traffic sar-n DEV 1 | grep ib.
After restarting the osd and mon services.
Ceph-s failed

rdma.jpg View (230 KB) guoguo jie, 11/03/2022 02:32 AM

History

#1 Updated by guoguo jie 3 months ago

the same problem on ceph 17.2.5:
root@ceph01:~# ceph crash info 2022-11-07T03:29:36.731174Z_bb6f8fea-ea87-4f83-a28a-b4356a2d3196 {
"assert_condition": "m",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.cc",
"assert_func": "int Infiniband::MemoryManager::Cluster::fill(uint32_t)",
"assert_line": 783,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.cc: In function 'int Infiniband::MemoryManager::Cluster::fill(uint32_t)' thread 7fbeac832700 time 2022-11-07T03:29:36.715990+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.cc: 783: FAILED ceph_assert(m)\n",
"assert_thread_name": "msgr-worker-0",
"backtrace": [
"/lib64/libpthread.so.0(0x12cf0) [0x7fbeb3249cf0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x563a4b19e68b]",
"/usr/bin/ceph-osd(+0x5977f7) [0x563a4b19e7f7]",
"(Infiniband::MemoryManager::Cluster::fill(unsigned int)+0x20b) [0x563a4be1a1db]",
"(Infiniband::init()+0x276) [0x563a4be227b6]",
"(RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x563a4bbee1c0]",
"/usr/bin/ceph-osd(+0xfc9adf) [0x563a4bbd0adf]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x563a4bbe1f54]",
"/usr/bin/ceph-osd(+0xfdfc56) [0x563a4bbe6c56]",
"/lib64/libstdc
+.so.6(+0xc2ba3) [0x7fbeb2892ba3]",
"/lib64/libpthread.so.0(+0x81ca) [0x7fbeb323f1ca]",
"clone()"
],
"ceph_version": "17.2.5",
"crash_id": "2022-11-07T03:29:36.731174Z_bb6f8fea-ea87-4f83-a28a-b4356a2d3196",
"entity_name": "osd.0",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "2ccdd1f47aacf4a8f55c21837c3d39a6f36fa33824ba49578bb6cab9c3598254",
"timestamp": "2022-11-07T03:29:36.731174Z",
"utsname_hostname": "ceph01",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-131-generic",
"utsname_sysname": "Linux",
"utsname_version": "#147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022"
}

Also available in: Atom PDF