Project

General

Profile

Bug #40586

OSDs get killed by OOM due to a broken switch

Added by xie xingguo 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
06/29/2019
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136278] Node 1 Normal: 26515*4kB (UEM) 1226*8kB (UEM) 40*16kB (UEM) 84*32kB (EM) 9*64kB (E) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 119772kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136291] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136293] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136296] 145262 total pagecache pages
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136298] 0 pages in swap cache
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136300] Swap cache stats: add 0, delete 0, find 0/0
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136301] Free swap = 0kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136302] Total swap = 0kB
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136304] 16670077 pages RAM
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136306] 0 pages HighMem/MovableOnly
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136307] 397117 pages reserved
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136309] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.136857] [1471476] 167 1471476 1205637 775120 1958 0 0 ceph-osd
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.137077] Out of memory: Kill process 1471476 (ceph-osd) score 47 or sacrifice child
Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.146054] Killed process 1471476 (ceph-osd) total-vm:4822548kB, anon-rss:3097860kB, file-rss:2556kB, shmem-rss:0kB

There are a lot of blocked heartbeat messages on the OSD side:

2019-06-12 09:06:22.272360 7fc6c775d700 -1 osd.35 17677 heartbeat_check: no reply from [2025:3100::15]:6817 osd.56 since back 2019-06-11 21:09:50.945481 front 2019-06-11 21:09:50.944307 (oldest deadline 2019-06-11 21:27:45.460051)


Related issues

Copied to RADOS - Backport #40625: nautilus: OSDs get killed by OOM due to a broken switch Resolved

History

#1 Updated by Greg Farnum 3 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Component(RADOS) OSD added

Is this something you're working on, Xie?

#2 Updated by xie xingguo 3 months ago

Greg Farnum wrote:

Is this something you're working on, Xie?

Ah, sorry, forgot to link the pr, should be all set now.

#3 Updated by xie xingguo 3 months ago

  • Pull request ID set to 28752

#4 Updated by Kefu Chai 3 months ago

  • Status changed from New to Pending Backport

#5 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #40625: nautilus: OSDs get killed by OOM due to a broken switch added

#6 Updated by Neha Ojha about 1 month ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF