Project

General

Profile

Actions

Bug #47661

closed

Cannot allocate memory appears when using io_uring osd

Added by Jiang Yu over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello everyone,
When I use ubuntu 20.04, try to use bluestore, and experiment with nvme io_uring, the problem of Cannot allocate memory appears.

Kernel version:
root@test:~# uname -a
Linux test 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

ceph version 15.2.5

My steps are as follows:
1. clone ceph, Modify CMakeLists.txt
include(CMakeDependentOption)
CMAKE_DEPENDENT_OPTION(WITH_LIBURING "Build with liburing library support" ON
"WITH_BLUESTORE;HAVE_LIBAIO" OFF)
set(HAVE_LIBURING ${WITH_LIBURING})
2.Execute ./make-debs.sh
3.Use ceph-ansible 5.0 branch to deploy an osd ceph
4.Stop the ceph process, modify /etc/ceph.conf, bluestore_ioring = true
5.Restart the ceph service, and osd Cannot allocate memory appears.
Is my step wrong?

osd log:
2020-09-27T07:23:37.355+0000 7fb8fb328dc0 0 set uid:gid to 64045:64045 (ceph:ceph)
2020-09-27T07:23:37.355+0000 7fb8fb328dc0 0 ceph version 15.2.5-1-g4b2918df4b (4b2918df4b5f89cae6d6377ff9bdc4d95f87b9e9) octopus (stable), process ceph-osd, pid 731734
2020-09-27T07:23:37.355+0000 7fb8fb328dc0 0 pidfile_write: ignore empty --pid-file
2020-09-27T07:23:37.355+0000 7fb8fb328dc0 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
2020-09-27T07:23:37.387+0000 7fb8fb328dc0 1 bdev(0x563828664000 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2020-09-27T07:23:37.387+0000 7fb8fb328dc0 -1 bdev(0x563828664000 /var/lib/ceph/osd/ceph-0/block) _aio_start io_setup(2) failed: (12) Cannot allocate memory
2020-09-27T07:23:37.387+0000 7fb8fb328dc0 0 starting osd.0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
2020-09-27T07:23:37.399+0000 7fb8fb328dc0 0 load: jerasure load: lrc load: isa
2020-09-27T07:23:37.399+0000 7fb8fb328dc0 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 1 bdev(0x563828664000 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 -1 bdev(0x563828664000 /var/lib/ceph/osd/ceph-0/block) _aio_start io_setup(2) failed: (12) Cannot allocate memory
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 0 osd.0:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 0 osd.0:1.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 0 osd.0:2.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 0 osd.0:3.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 0 osd.0:4.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 1 bluestore(/var/lib/ceph/osd/ceph-0) _mount path /var/lib/ceph/osd/ceph-0
2020-09-27T07:23:37.431+0000 7fb8fb328dc0 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
2020-09-27T07:23:37.459+0000 7fb8fb328dc0 1 bdev(0x563828664000 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2020-09-27T07:23:37.459+0000 7fb8fb328dc0 -1 bdev(0x563828664000 /var/lib/ceph/osd/ceph-0/block) _aio_start io_setup(2) failed: (12) Cannot allocate memory
2020-09-27T07:23:37.459+0000 7fb8fb328dc0 -1 osd.0 0 OSD:init: unable to mount object store
2020-09-27T07:23:37.459+0000 7fb8fb328dc0 -1 ** ERROR: osd init failed: (12) Cannot allocate memory


Files

syslog (631 KB) syslog Jiang Yu, 09/27/2020 10:11 AM
ceph-osd.0.log (113 KB) ceph-osd.0.log Jiang Yu, 09/27/2020 10:11 AM

Related issues 3 (0 open3 closed)

Has duplicate Ceph - Bug #48872: KernelDevice.cc: 556: ceph_abort_msg("got unexpected error from io_getevents")Duplicate

Actions
Copied to bluestore - Backport #49038: pacific: Cannot allocate memory appears when using io_uring osdResolvedsinguliere _Actions
Copied to bluestore - Backport #49039: octopus: Cannot allocate memory appears when using io_uring osdResolvedActions
Actions #2

Updated by Roman Penyaev over 3 years ago

Could you please try to set your MEMLOCK rlimits to unlimited? If you use systemd to start your osds try the following change:

diff --git a/systemd/ceph-osd@.service.in b/systemd/ceph-osd@.service.in
index 4a2254e99478..001a1503545c 100644
--- a/systemd/ceph-osd@.service.in
+++ b/systemd/ceph-osd@.service.in
@@ -10,6 +10,7 @@ EnvironmentFile=-@SYSTEMD_ENV_FILE@
 ExecReload=/bin/kill -HUP $MAINPID
 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
 ExecStartPre=@CMAKE_INSTALL_FULL_LIBEXECDIR@/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i
+LimitMEMLOCK=infinity
 LimitNOFILE=1048576
 LimitNPROC=1048576

Or update the limits.conf file accordingly.

Actions #3

Updated by Jiang Yu over 3 years ago

Thank you very much for your reply, I will try as soon as possible after the holiday.

Actions #4

Updated by Jiang Yu over 3 years ago


[Tue Oct 13 10:26:03 2020] pps pps0: new PPS source ptp2
[Tue Oct 13 10:26:03 2020] ixgbe 0000:04:00.0: registered PHC device on ens802f0
[Tue Oct 13 10:26:06 2020] ixgbe 0000:04:00.0 ens802f0: NIC Link is Up 1 Gbps, Flow Control: None
[Tue Oct 13 10:26:06 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ens802f0: link becomes ready
[Tue Oct 13 10:26:13 2020] aufs 5.4.3-20200302
[Wed Oct 14 01:45:47 2020] BUG: kernel NULL pointer dereference, address: 0000000000000008
[Wed Oct 14 01:45:47 2020] #PF: supervisor write access in kernel mode
[Wed Oct 14 01:45:47 2020] #PF: error_code(0x0002) - not-present page
[Wed Oct 14 01:45:47 2020] PGD 0 P4D 0
[Wed Oct 14 01:45:47 2020] Oops: 0002 [#1] SMP PTI
[Wed Oct 14 01:45:47 2020] CPU: 19 PID: 2909 Comm: kworker/u65:2 Tainted: G I 5.4.0-48-generic #52-Ubuntu
[Wed Oct 14 01:45:47 2020] Hardware name: Intel Corporation S2600TP/S2600TP, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[Wed Oct 14 01:45:47 2020] Workqueue: io_ring-wq io_sq_wq_submit_work
[Wed Oct 14 01:45:47 2020] RIP: 0010:io_sq_wq_submit_work+0x19d/0x550
[Wed Oct 14 01:45:47 2020] Code: 49 39 c5 74 4b 48 8d bb f8 02 00 00 89 55 8c 48 89 7d b0 e8 45 59 79 00 49 8b 8f e0 00 00 00 49 8b 87 e8 00 00 00 48 8b 7d b0 <48> 89 41 08 48 89 08 4d 89 af e0 00 00 00 4d 89 af e8 00 00 00 c6
[Wed Oct 14 01:45:47 2020] RSP: 0018:ffffac404783fdd0 EFLAGS: 00010046
[Wed Oct 14 01:45:47 2020] RAX: 0000000000000000 RBX: ffff9669991ef800 RCX: 0000000000000000
[Wed Oct 14 01:45:47 2020] RDX: 0000000000000001 RSI: fffffffffffffdef RDI: ffff9669991efaf8
[Wed Oct 14 01:45:47 2020] RBP: ffffac404783fe60 R08: 0000000000001000 R09: 0000000000000000
[Wed Oct 14 01:45:47 2020] R10: 0000000000001000 R11: 0000000000001000 R12: ffff9669991efab0
[Wed Oct 14 01:45:47 2020] R13: ffff96619fb807e0 R14: ffff966194e90d00 R15: ffff96619fb80700
[Wed Oct 14 01:45:47 2020] FS: 0000000000000000(0000) GS:ffff9661a7ac0000(0000) knlGS:0000000000000000
[Wed Oct 14 01:45:47 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Oct 14 01:45:47 2020] CR2: 0000000000000008 CR3: 0000000e596f2006 CR4: 00000000001606e0
[Wed Oct 14 01:45:47 2020] Call Trace:
[Wed Oct 14 01:45:47 2020] process_one_work+0x1eb/0x3b0
[Wed Oct 14 01:45:47 2020] worker_thread+0x4d/0x400
[Wed Oct 14 01:45:47 2020] kthread+0x104/0x140
[Wed Oct 14 01:45:47 2020] ? process_one_work+0x3b0/0x3b0
[Wed Oct 14 01:45:47 2020] ? kthread_park+0x90/0x90
[Wed Oct 14 01:45:47 2020] ret_from_fork+0x35/0x40
[Wed Oct 14 01:45:47 2020] Modules linked in: aufs overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp input_leds joydev coretemp mei_me kvm_intel mei kvm ipmi_si ipmi_devintf ipmi_msghandler ioatdma rapl intel_cstate mac_hid acpi_pad acpi_power_meter sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel drm_vram_helper ttm aesni_intel drm_kms_helper crypto_simd cryptd syscopyarea glue_helper sysfillrect sysimgblt fb_sys_fops nvme megaraid_sas ixgbe drm igb ahci xfrm_algo lpc_ich i2c_i801 libahci i2c_algo_bit dca nvme_core mdio wmi
[Wed Oct 14 01:45:47 2020] CR2: 0000000000000008
[Wed Oct 14 01:45:47 2020] ---[ end trace 54e0d3a358e93a71 ]---
[Wed Oct 14 01:45:47 2020] RIP: 0010:io_sq_wq_submit_work+0x19d/0x550
[Wed Oct 14 01:45:47 2020] Code: 49 39 c5 74 4b 48 8d bb f8 02 00 00 89 55 8c 48 89 7d b0 e8 45 59 79 00 49 8b 8f e0 00 00 00 49 8b 87 e8 00 00 00 48 8b 7d b0 <48> 89 41 08 48 89 08 4d 89 af e0 00 00 00 4d 89 af e8 00 00 00 c6
[Wed Oct 14 01:45:47 2020] RSP: 0018:ffffac404783fdd0 EFLAGS: 00010046
[Wed Oct 14 01:45:47 2020] RAX: 0000000000000000 RBX: ffff9669991ef800 RCX: 0000000000000000
[Wed Oct 14 01:45:47 2020] RDX: 0000000000000001 RSI: fffffffffffffdef RDI: ffff9669991efaf8
[Wed Oct 14 01:45:47 2020] RBP: ffffac404783fe60 R08: 0000000000001000 R09: 0000000000000000
[Wed Oct 14 01:45:47 2020] R10: 0000000000001000 R11: 0000000000001000 R12: ffff9669991efab0
[Wed Oct 14 01:45:47 2020] R13: ffff96619fb807e0 R14: ffff966194e90d00 R15: ffff96619fb80700
[Wed Oct 14 01:45:47 2020] FS: 0000000000000000(0000) GS:ffff9661a7ac0000(0000) knlGS:0000000000000000
[Wed Oct 14 01:45:47 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Oct 14 01:45:47 2020] CR2: 0000000000000008 CR3: 0000000e596f2006 CR4: 00000000001606e0

Hello, I added LimitMEMLOCK=infinity to systemd, and the Cannot allocate memory problem can be solved.
But the kernel crashed, the above is the dmesg log.

Actions #5

Updated by Jiang Yu over 3 years ago

The kernel panic problem can be solved by upgrading to 5.4.0-49.
But ceph osd will crash abnormally after running for about 12 hours

@
2020-10-14T08:49:26.070+0000 7f5d7d3c1700 1 osd.0 pg_epoch: 114 pg[2.e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [] -> [0], acting [] -> [0], acting_primary ? -> 0, up_primary ? -> 0, role -1 -> 0, features acting 4540138292836696063 upacting 4540138292836696063
2020-10-14T08:49:26.070+0000 7f5d7d3c1700 1 osd.0 pg_epoch: 114 pg[2.e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2020-10-14T08:49:26.070+0000 7f5d813c9700 1 osd.0 pg_epoch: 114 pg[2.1e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [] -> [0], acting [] -> [0], acting_primary ? -> 0, up_primary ? -> 0, role -1 -> 0, features acting 4540138292836696063 upacting 4540138292836696063
2020-10-14T08:49:26.070+0000 7f5d813c9700 1 osd.0 pg_epoch: 114 pg[2.1e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2020-10-14T08:50:43.237+0000 7f5d963f3700 -1 bdev(0x560427a64700 /var/lib/ceph/osd/ceph-0/block) _aio_thread got (4) Interrupted system call
2020-10-14T08:50:43.237+0000 7f5d97bf6700 -1 bdev(0x560427a64000 /var/lib/ceph/osd/ceph-0/block) _aio_thread got (4) Interrupted system call
2020-10-14T08:50:43.245+0000 7f5d963f3700 -1 /tmp/release/Ubuntu/WORKDIR/ceph-15.2.5-1-gb487d67976/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f5d963f3700 time 2020-10-14T08:50:43.241901+0000
/tmp/release/Ubuntu/WORKDIR/ceph-15.2.5-1-gb487d67976/src/os/bluestore/KernelDevice.cc: 518: ceph_abort_msg("got unexpected error from io_getevents")

ceph version 15.2.5-1-gb487d67976 (b487d679763071929892379bcebfcd5825ba43a4) octopus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x56041ba2e7e6]
2: (KernelDevice::_aio_thread()+0x1a67) [0x56041c18caf7]
3: (KernelDevice::AioCompletionThread::entry()+0x11) [0x56041c18e171]
4: (()+0x9609) [0x7f5d9f58b609]
5: (clone()+0x43) [0x7f5d9f0f9293]
@

Actions #6

Updated by Yanhu Cao about 3 years ago

Jiang Yu wrote:

The kernel panic problem can be solved by upgrading to 5.4.0-49.
But ceph osd will crash abnormally after running for about 12 hours

@
2020-10-14T08:49:26.070+0000 7f5d7d3c1700 1 osd.0 pg_epoch: 114 pg[2.e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [] -> [0], acting [] -> [0], acting_primary ? -> 0, up_primary ? -> 0, role -1 -> 0, features acting 4540138292836696063 upacting 4540138292836696063
2020-10-14T08:49:26.070+0000 7f5d7d3c1700 1 osd.0 pg_epoch: 114 pg[2.e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2020-10-14T08:49:26.070+0000 7f5d813c9700 1 osd.0 pg_epoch: 114 pg[2.1e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [] -> [0], acting [] -> [0], acting_primary ? -> 0, up_primary ? -> 0, role -1 -> 0, features acting 4540138292836696063 upacting 4540138292836696063
2020-10-14T08:49:26.070+0000 7f5d813c9700 1 osd.0 pg_epoch: 114 pg[2.1e( empty local-lis/les=111/112 n=0 ec=16/16 lis/c=111/111 les/c/f=112/112/0 sis=114) [0] r=0 lpr=114 pi=[111,114)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2020-10-14T08:50:43.237+0000 7f5d963f3700 -1 bdev(0x560427a64700 /var/lib/ceph/osd/ceph-0/block) _aio_thread got (4) Interrupted system call
2020-10-14T08:50:43.237+0000 7f5d97bf6700 -1 bdev(0x560427a64000 /var/lib/ceph/osd/ceph-0/block) _aio_thread got (4) Interrupted system call
2020-10-14T08:50:43.245+0000 7f5d963f3700 -1 /tmp/release/Ubuntu/WORKDIR/ceph-15.2.5-1-gb487d67976/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f5d963f3700 time 2020-10-14T08:50:43.241901+0000
/tmp/release/Ubuntu/WORKDIR/ceph-15.2.5-1-gb487d67976/src/os/bluestore/KernelDevice.cc: 518: ceph_abort_msg("got unexpected error from io_getevents")

ceph version 15.2.5-1-gb487d67976 (b487d679763071929892379bcebfcd5825ba43a4) octopus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x56041ba2e7e6]
2: (KernelDevice::_aio_thread()+0x1a67) [0x56041c18caf7]
3: (KernelDevice::AioCompletionThread::entry()+0x11) [0x56041c18e171]
4: (()+0x9609) [0x7f5d9f58b609]
5: (clone()+0x43) [0x7f5d9f0f9293]
@

fixed by https://github.com/ceph/ceph/pull/38901

Actions #7

Updated by Igor Fedotov about 3 years ago

  • Pull request ID set to 38901
Actions #8

Updated by Igor Fedotov about 3 years ago

  • Status changed from New to Pending Backport
  • Backport set to pacific, octopus
Actions #9

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49038: pacific: Cannot allocate memory appears when using io_uring osd added
Actions #10

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49039: octopus: Cannot allocate memory appears when using io_uring osd added
Actions #11

Updated by singuliere _ about 3 years ago

  • Has duplicate Bug #48872: KernelDevice.cc: 556: ceph_abort_msg("got unexpected error from io_getevents") added
Actions #13

Updated by Igor Fedotov about 3 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF