Bug #40586
OSDs get killed by OOM due to a broken switch
0%
Description
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136278] Node 1 Normal: 26515*4kB (UEM) 1226*8kB (UEM) 40*16kB (UEM) 84*32kB (EM) 9*64kB (E) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 119772kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136291] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136293] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136296] 145262 total pagecache pages
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136298] 0 pages in swap cache
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136300] Swap cache stats: add 0, delete 0, find 0/0
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136301] Free swap = 0kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136302] Total swap = 0kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136304] 16670077 pages RAM
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136306] 0 pages HighMem/MovableOnly
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136307] 397117 pages reserved
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136309] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136857] [1471476] 167 1471476 1205637 775120 1958 0 0 ceph-osd
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.137077] Out of memory: Kill process 1471476 (ceph-osd) score 47 or sacrifice child
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.146054] Killed process 1471476 (ceph-osd) total-vm:4822548kB, anon-rss:3097860kB, file-rss:2556kB, shmem-rss:0kB
There are a lot of blocked heartbeat messages on the OSD side:
2019-06-12 09:06:22.272360 7fc6c775d700 -1 osd.35 17677 heartbeat_check: no reply from [2025:3100::15]:6817 osd.56 since back 2019-06-11 21:09:50.945481 front 2019-06-11 21:09:50.944307 (oldest deadline 2019-06-11 21:27:45.460051)
Related issues
History
#1 Updated by Greg Farnum over 4 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD) - Component(RADOS) OSD added
Is this something you're working on, Xie?
#2 Updated by xie xingguo over 4 years ago
Greg Farnum wrote:
Is this something you're working on, Xie?
Ah, sorry, forgot to link the pr, should be all set now.
#3 Updated by xie xingguo over 4 years ago
- Pull request ID set to 28752
#4 Updated by Kefu Chai over 4 years ago
- Status changed from New to Pending Backport
#5 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #40625: nautilus: OSDs get killed by OOM due to a broken switch added
#6 Updated by Neha Ojha over 4 years ago
- Status changed from Pending Backport to Resolved