Project

General

Profile

Actions

Bug #57966

open

Ceph cluster osds failed when ms_cluster_type=async+rdma is used

Added by guoguo jie over 1 year ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
common
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):

2ccdd1f47aacf4a8f55c21837c3d39a6f36fa33824ba49578bb6cab9c3598254
3306ab83ccd165a6cccc4c91250879bb7262941666bb0d9472b38382a5c62a3d
33fdcea66d9494023cab7cacb0df26c3001b94d44d5c95bab5239fb11d66b2c3
4f70e5a6707159820b2b8da3fcca4d263f45a3f9eb1bdef54bb1545558ee1cb3
766be26bfe831e0682d2048a8dc960defeda12d09e3023e77c277e519784b73b
a594420536d5797db898deac34687a997b2ef5e8914abdb809f37169e5214cb7
aeeb799c9ee05a227eb75f0cb7663cda11a7b7f979e8b6bd736fd7c967d3fd0c
e9a587500cd0d3a0ca3003144680586ba4656cdec011bfd2a91e3f3334bfa213
f318bb7dd5ed05f869f475d1e04f0f440e8297d0dbc2b4f00d44d37b5b941c71
f514f2bcfff5e613764a936508a85d6b0f61441266b5bc22f710f2a95eabe04d
f665a1ed57f0db8ce90b2de73edc1fcabb16d36f4af7dbb424d1f9ebacfcd2d0


Description

Currently, using iboip can run normally:
The steps are as follows:
Check cluster health:
Ceph health detail.
Ceph config set global mon_clock_drift_allowed 3.
Ceph config set global osd_pool_default_size 2.
Add an internal IB cluster network.
Ceph config set osd cluster_network 10.10.20.0/24.
Ceph config get osd cluster_network.
Be sure to check if the internal network is used after the reboot system:
Ceph osd metadata 0 | grep addr.
Ceph osd metadata 1 | grep addr.
Ceph osd dump | grep 10.10.
After the above operation is new, the ceph cluster is running normally.

Try to enable the rdma parameter:

Then operate in the following order: just set the OSD domain, and each node commands show_gids to get ms_async_rdma_device_name and ms_async_rdma_local_gid.

Ceph config set osd ms_async_rdma_device_name mlx5_0.
Ceph config set osd.0 ms_async_rdma_local_gid fe80:0000:0000:0000:480f:cfff:fff3:9974.
Ceph config set osd.1 ms_async_rdma_local_gid fe80:0000:0000:0000:7010:6fff:ffa2:1430.
Ceph config set osd ms_cluster_type async+rdma.
Ceph osd metadata 0 | grep addr.
Ceph osd metadata 1 | grep addr.
Ceph osd dump | grep 10.10.
Check IB Nic traffic sar-n DEV 1 | grep ib.
After restarting the osd and mon services.
Ceph-s failed


Files

rdma.jpg (230 KB) rdma.jpg guoguo jie, 11/03/2022 02:32 AM
Actions #1

Updated by guoguo jie over 1 year ago

the same problem on ceph 17.2.5:
root@ceph01:~# ceph crash info 2022-11-07T03:29:36.731174Z_bb6f8fea-ea87-4f83-a28a-b4356a2d3196 {
"assert_condition": "m",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.cc",
"assert_func": "int Infiniband::MemoryManager::Cluster::fill(uint32_t)",
"assert_line": 783,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.cc: In function 'int Infiniband::MemoryManager::Cluster::fill(uint32_t)' thread 7fbeac832700 time 2022-11-07T03:29:36.715990+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.cc: 783: FAILED ceph_assert(m)\n",
"assert_thread_name": "msgr-worker-0",
"backtrace": [
"/lib64/libpthread.so.0(0x12cf0) [0x7fbeb3249cf0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x563a4b19e68b]",
"/usr/bin/ceph-osd(+0x5977f7) [0x563a4b19e7f7]",
"(Infiniband::MemoryManager::Cluster::fill(unsigned int)+0x20b) [0x563a4be1a1db]",
"(Infiniband::init()+0x276) [0x563a4be227b6]",
"(RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x563a4bbee1c0]",
"/usr/bin/ceph-osd(+0xfc9adf) [0x563a4bbd0adf]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x563a4bbe1f54]",
"/usr/bin/ceph-osd(+0xfdfc56) [0x563a4bbe6c56]",
"/lib64/libstdc
+.so.6(+0xc2ba3) [0x7fbeb2892ba3]",
"/lib64/libpthread.so.0(+0x81ca) [0x7fbeb323f1ca]",
"clone()"
],
"ceph_version": "17.2.5",
"crash_id": "2022-11-07T03:29:36.731174Z_bb6f8fea-ea87-4f83-a28a-b4356a2d3196",
"entity_name": "osd.0",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "2ccdd1f47aacf4a8f55c21837c3d39a6f36fa33824ba49578bb6cab9c3598254",
"timestamp": "2022-11-07T03:29:36.731174Z",
"utsname_hostname": "ceph01",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-131-generic",
"utsname_sysname": "Linux",
"utsname_version": "#147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022"
}

Actions #2

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v14.2.9, v15.2.13, v15.2.3, v15.2.4, v15.2.5, v15.2.6, v15.2.7, v15.2.9 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=3becd8a88a43122654d8611633ef59ac9fc4f251000fc5ea2c468e933beece2b

Assert condition: m
Assert function: int Infiniband::MemoryManager::Cluster::fill(uint32_t)

Sanitized backtrace:

    Infiniband::MemoryManager::Cluster::fill(unsigned int)
    Infiniband::init()
    RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)
    EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)

Crash dump sample:
{
    "archived": "2022-11-25 16:05:57.397653",
    "assert_condition": "m",
    "assert_file": "msg/async/rdma/Infiniband.cc",
    "assert_func": "int Infiniband::MemoryManager::Cluster::fill(uint32_t)",
    "assert_line": 783,
    "assert_msg": "msg/async/rdma/Infiniband.cc: In function 'int Infiniband::MemoryManager::Cluster::fill(uint32_t)' thread 7f78f5a60700 time 2022-11-25T15:43:39.852981+0000\nmsg/async/rdma/Infiniband.cc: 783: FAILED ceph_assert(m)",
    "assert_thread_name": "msgr-worker-0",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7f78fc477cf0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x5620fadef68b]",
        "/usr/bin/ceph-osd(+0x5977f7) [0x5620fadef7f7]",
        "(Infiniband::MemoryManager::Cluster::fill(unsigned int)+0x20b) [0x5620fba6b1db]",
        "(Infiniband::init()+0x276) [0x5620fba737b6]",
        "(RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x5620fb83f1c0]",
        "/usr/bin/ceph-osd(+0xfc9adf) [0x5620fb821adf]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x5620fb832f54]",
        "/usr/bin/ceph-osd(+0xfdfc56) [0x5620fb837c56]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f78fbac0ba3]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f78fc46d1ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.5",
    "crash_id": "2022-11-25T15:43:39.860119Z_ac381e49-70e6-4513-bd6a-28bf01110a32",
    "entity_name": "osd.e7b3f94cca3527912de62cf995260d506cc77f6a",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "2ccdd1f47aacf4a8f55c21837c3d39a6f36fa33824ba49578bb6cab9c3598254",
    "timestamp": "2022-11-25T15:43:39.860119Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.16.12",
    "utsname_sysname": "Linux",
    "utsname_version": "#7 SMP PREEMPT Thu Oct 27 04:27:09 CEST 2022" 
}

Actions #3

Updated by Ilya Dryomov 11 months ago

  • Target version deleted (v17.2.6)
Actions

Also available in: Atom PDF