Support #39594: OSD marked as down, had timed out after 15, handle_connect_reply connect got RESETSESSION - RADOS - Ceph

Actions

Copy link

Support #39594

open

OSD marked as down, had timed out after 15, handle_connect_reply connect got RESETSESSION

Added by Alon Avrahami almost 5 years ago. Updated almost 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v12.2.10

Component(RADOS):

OSD

Pull request ID:

Description

Hi,

recently we saw random slow requests in our cluster. in monitor ceph.log I could see that at the same time OSD.138 marked as down.
from the OSD log on the OSD node, I could see a lot of timeout errors and resetsession.

our ceph cluster:

version 12.2.10

cluster:
id: 220fe94c-bd2a-475b-aeca-6ba8baaf67f3
health: HEALTH_OK

services:
    mon: 3 daemons, quorum ecprdbcph10-opens,ecprdbcph11-opens,ecprdbcph12-opens
    mgr: ecprdbcph11-opens(active), standbys: ecprdbcph10-opens, ecprdbcph12-opens
    osd: 432 osds: 432 up, 432 in

data:
    pools:   5 pools, 14196 pgs
    objects: 21.16M objects, 81.7TiB
    usage:   243TiB used, 386TiB / 629TiB avail
    pgs:     14193 active+clean
             3     active+clean+scrubbing+deep

io:
    client:   565KiB/s rd, 300MiB/s wr, 1.35kop/s rd, 29.49kop/s wr

OSD nodes using Intel s3520 SSD. (All-Flash Ceph)

from the monitor logs:

2019-05-04 14:24:48.277652 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091063 : cluster [WRN] Health check failed: 6 slow requests are blocked > 32 sec. Implicated osds 2,26,253,310 (REQUEST_SLOW)
2019-05-04 14:24:53.483776 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091064 : cluster [WRN] Health check update: 657 slow requests are blocked > 32 sec. Implicated osds 2,5,26,33,64,77,82,88,89,92,138,150,198,202,213,253,295,300,307,310,317,318,320,331,371,383 (REQUEST_SLOW)
2019-05-04 14:24:59.078858 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091067 : cluster [WRN] Health check update: 720 slow requests are blocked > 32 sec. Implicated osds 2,5,15,17,26,33,52,64,77,82,88,89,92,96,106,138,150,179,180,198,202,213,244,253,294,295,300,306,307,310,317,318,320,331,338,339,346,371,383,393 (REQUEST_SLOW)
2019-05-04 14:25:00.759688 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091070 : cluster [INF] osd.138 failed (root=default,host=ecprdbcph03-opens) (2 reporters from different host after 20.000287 >= grace 20.000000)
2019-05-04 14:25:01.307509 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091074 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-05-04 14:25:02.329018 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091076 : cluster [WRN] Health check failed: Reduced data availability: 6 pgs peering (PG_AVAILABILITY)
2019-05-04 14:25:03.397927 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091080 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-05-04 14:25:03.432752 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091081 : cluster [INF] osd.138 10.63.4.10:6812/939555 boot
2019-05-04 14:25:04.079552 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091083 : cluster [WRN] Health check update: 450 slow requests are blocked > 32 sec. Implicated osds 5,15,17,26,52,77,82,88,89,92,96,106,108,180,196,198,201,202,226,244,253,295,300,306,307,310,317,318,320,331,338,339,343,346,360,383,393 (REQUEST_SLOW)
2019-05-04 14:25:04.411669 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091084 : cluster [WRN] Health check failed: Degraded data redundancy: 23869/63260499 objects degraded (0.038%), 13 pgs degraded (PG_DEGRADED)
2019-05-04 14:25:02.357542 osd.138 osd.138 10.63.4.10:6812/939555 5137 : cluster [WRN] Monitor daemon marked osd.138 down, but it is still running
2019-05-04 14:25:08.345950 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091086 : cluster [WRN] Health check update: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2019-05-04 14:25:09.080107 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091087 : cluster [WRN] Health check update: 2 slow requests are blocked > 32 sec. Implicated osds 108 (REQUEST_SLOW)
2019-05-04 14:25:10.402311 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091088 : cluster [WRN] Health check update: Degraded data redundancy: 2/63261411 objects degraded (0.000%), 1 pg degraded (PG_DEGRADED)
2019-05-04 14:25:10.402373 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091089 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2019-05-04 14:25:10.402437 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091090 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are blocked > 32 sec. Implicated osds 108)
2019-05-04 14:25:12.412343 mon.ecprdbcph10-opens mon.0 10.63.4.53:6789/0 1091091 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2/63261411 objects degraded (0.000%), 1 pg degraded)

osd.138 logs from the OSD node:

2019-05-04 14:24:59.399647 7f7e75b0d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.464991 7f7e75b0d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.465374 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.468935 7f7e75b0d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.468941 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.559877 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.559967 7f7e75b0d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.575380 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.575407 7f7e75b0d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.578161 7f7e75b0d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.578648 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.609742 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.609920 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.621853 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.621892 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.639703 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.639890 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.654052 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.654103 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.654132 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.654161 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.714984 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.715023 7f7e7630e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.794605 7f7e75b0d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.794654 7f7e7530c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15
2019-05-04 14:24:59.803753 7f7e5d2f1700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7e5d2f1700' had timed out after 15

2019-05-04 14:25:01.554932 7f7e75b0d700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.12:6806/4135337 conn(0x562e1b85e800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2206 cs=5 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.555088 7f7e75b0d700  0 -- 10.63.4.10:6837/1939555 >> 10.63.5.17:6801/4683887 conn(0x562e30277800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2610 cs=3 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.555129 7f7e7630e700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.204:6849/2919375 conn(0x562e1b693800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=16851 cs=11 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.555178 7f7e7630e700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.208:6845/1757525 conn(0x562e2e3f9000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=9317 cs=81 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.555217 7f7e7630e700  0 -- 10.63.4.10:6837/1939555 >> 10.63.5.65:6829/662782 conn(0x562e6e115800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2714 cs=7 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.556989 7f7e7530c700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.208:6801/1757869 conn(0x562e2c91b800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2302 cs=5 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.557055 7f7e75b0d700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.53:6810/3860755 conn(0x562e1b498800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1429 cs=3 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.557126 7f7e75b0d700  0 -- 10.63.4.10:6837/1939555 >> 10.63.5.65:6813/662321 conn(0x562e35f3d800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2986 cs=3 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.557226 7f7e7630e700  0 -- 10.63.4.10:6837/1939555 >> 10.63.5.11:6849/3928333 conn(0x562e1bfaa000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1592 cs=3 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.557746 7f7e75b0d700  0 -- 10.63.4.10:6837/1939555 >> 10.63.5.17:6813/5682552 conn(0x562e1b626800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=4014 cs=11 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.558012 7f7e75b0d700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.57:6861/2447153 conn(0x562e1ce66000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=3976 cs=3 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.558061 7f7e7530c700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.205:6861/3922940 conn(0x562e1c143000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2341 cs=3 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.558199 7f7e7630e700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.58:6838/3212110 conn(0x562e21a4e000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=16191 cs=21 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:01.558353 7f7e7630e700  0 -- 10.63.4.10:6837/1939555 >> 10.63.4.5:6849/2451876 conn(0x562e53383000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=13675 cs=817 l=0).handle_connect_reply connect got RESETSESSION
2019-05-04 14:25:02.357530 7f7e68307700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.138 down, but it is still running
2019-05-04 14:25:02.357550 7f7e68307700  0 log_channel(cluster) log [DBG] : map e282767 wrongly marked me down at e282766
2019-05-04 14:25:02.357554 7f7e68307700  1 osd.138 282767 start_waiting_for_healthy

Can't find a reason why the OSD marked as down, it happens from time to time and causes a lot of slow requests in our ceph.
I will upload any requested file for RCA.

Files

Download all files

ceph.log.1 (6.22 KB) ceph.log.1	Monitor logs	Alon Avrahami, 05/06/2019 09:52 AM
ceph-osd.138.log.1.gz (346 KB) ceph-osd.138.log.1.gz	OSD logs	Alon Avrahami, 05/06/2019 09:52 AM

Actions

Copy link

Updated by Neha Ojha almost 5 years ago

Tracker changed from Bug to Support

The ceph-users mailing list might be good place to seek help on this kind of issue.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Support #39594

OSD marked as down, had timed out after 15, handle_connect_reply connect got RESETSESSION

Updated by Neha Ojha almost 5 years ago