Project

General

Profile

Bug #36364

Bluestore OSD IO Hangs near Flush (flush in 90.330556)

Added by Gavin Baker about 4 years ago. Updated almost 4 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Our Mimic Ceph cluster has been having an issue with intermittent slow OPs (metadata, etc) that seem to hang various OSDs. These OSDs are eventaully marked down after op_thread timeouts. This issue appears to have cropped up since our conversion from Filestore to Bluestore, but pre-dates our switch to Mimic. The issue affects all of our Bluestore OSDs and happens intermittently making it hard to troubleshoot. We most frequently experience this when backfilling (rebalancing or adding OSDs) or when scrubbing, but it also occurs with only client IO. CephFS client load does not seem to directly cause the issue, but does seem to make the hung OSDs occur more frequently.

Our Cluster consists of 7 OSD nodes (3x 60 10TB OSDs, 4x 24 10TB OSDs) for a total of 276 OSDs. Each OSD node has a single 40GbE NIC for public/cluster network traffic and runs CentOS 7.5. Most of our client IO comes from CephFS traffic for computational clients, but we do have a few RBDs that run low-load admin services. Our cluster skews to storing smallish files and contains 300+ M objects with 303 TiB used (we also have some very large files).

In the interest of ruling out OS/hardware issues we have extensively tested and monitored the network stack and OS. Network drivers have been upgraded and the OS/Ceph packages are kept up with the most recent stable releases. We did take care of a minor issue with dropped RX packets that had no impact on the issue described above (fixed with increased RX ring buffer). We also tested turning off the Meltdown/Spectre microcode patches to determine if increased I/O overhead was to blame. This seemed to provide limited to no improvement for our OSD stability. We also monitor smartctl to keep an eye out for failing drives. Drives with pending sectors or large numbers of errors/realloced sectors are removed and replaced.

In the end we have traced the issue to the Bluestore OSDs (as far as we can tell). iostat indicates that when OSDs are hung utilization for that disk is 100% despite all other values (except ) are 0. When the OSD returns the OSD logs indicate a flush has taken a very long time to complete("bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) flush in 90.330556").

  data:
    pools:   4 pools, 17664 pgs
    objects: 307.9 M objects, 303 TiB
    usage:   937 TiB used, 1.5 PiB / 2.5 PiB avail

Here is some relevant cluster state/config info and logs from one crashed OSD with debug_bluestore set to 10.

# ceph -v
ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)

# uname -r
3.10.0-862.14.4.el7.x86_64

# cat /etc/*release
CentOS Linux release 7.5.1804 (Core) 
NAME="CentOS Linux" 
VERSION="7 (Core)" 
ID="centos" 
ID_LIKE="rhel fedora" 
VERSION_ID="7" 
PRETTY_NAME="CentOS Linux 7 (Core)" 
ANSI_COLOR="0;31" 
CPE_NAME="cpe:/o:centos:centos:7" 
HOME_URL="https://www.centos.org/" 
BUG_REPORT_URL="https://bugs.centos.org/" 

CENTOS_MANTISBT_PROJECT="CentOS-7" 
CENTOS_MANTISBT_PROJECT_VERSION="7" 
REDHAT_SUPPORT_PRODUCT="centos" 
REDHAT_SUPPORT_PRODUCT_VERSION="7" 

CentOS Linux release 7.5.1804 (Core) 
CentOS Linux release 7.5.1804 (Core)    

# ceph osd count-metadata osd_objectstore
{
    "bluestore": 276
}

# ceph config dump
WHO    MASK LEVEL    OPTION                     VALUE                                                        RO 
...
global      advanced osd_heartbeat_grace        40                                                              
global      advanced osd_heartbeat_interval     12                                                              
global      advanced osd_mon_heartbeat_interval 60                                                              
global      advanced osd_mon_report_interval    12                                                              
global      advanced osd_pool_default_size      3                                                               
 ...                                                 
  osd       advanced debug_bluestore            10/20                                                                                                                       
  osd       advanced osd_op_complaint_time      45.000000                                                       
  osd       advanced osd_op_queue               prioritized                                                  *  
  osd       advanced osd_op_thread_timeout      30                                                              
  osd       advanced osd_recovery_sleep_hdd     0.200000                                                        
  osd       advanced osd_scrub_begin_hour       16                                                              
  osd       advanced osd_scrub_end_hour         8                                                              
  osd       advanced osd_scrub_sleep            0.200000                                                                                                            
#tail /var/log/messages    
Oct  9 11:59:57 bstore1 ceph-osd: 2018-10-09 11:59:57.119 7fcb15662700 -1 osd.237 813090 get_health_metrics reporting 9 slow ops, oldest is osd_op(mds.0.385003:3737195 2.805 2:a011d57d:::200.005d20cb:head [write 2485722~2968 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e813081)
# iostat -x 

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.56    0.00    3.59   35.49    0.00   56.37

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
...
sdh               0.33    23.67   43.00   72.67  9440.00  7245.33   288.51     1.90   16.41   27.45    9.88   3.26  37.70
sds               0.00    19.00    0.00   67.00     0.00  5898.67   176.08     0.50    7.61    0.00    7.61   4.10  27.47
sdu               0.00    25.67   50.67  100.33 11744.00 15214.67   357.07     2.39   15.84   25.34   11.05   3.27  49.43
sdd               0.67    27.33   36.33  120.67  8838.67 17849.33   339.97     2.37   15.12   28.97   10.95   3.92  61.57
sdi               0.67   126.33   30.00  127.67  6976.00 17444.00   309.77     2.11   13.42   29.83    9.56   3.49  55.10
sdx               1.67     6.33   69.33   11.67 14969.33    86.67   371.75     1.59   18.23   20.03    7.54   2.99  24.20
sdo               1.00     5.67   49.67    2.00 11644.00    26.67   451.77     0.92   17.82   18.30    6.00   2.95  15.27
sdw               0.00    22.33    0.00  115.00     0.00 18294.67   318.17     0.97    8.46    0.00    8.46   3.33  38.30
sdj               0.00    27.00  341.33  103.67  5588.00 12138.67    79.67     2.29    5.00    2.17   14.34   1.23  54.93
sdav              0.00    13.33    0.00   57.00     0.00  7234.67   253.85     0.76   13.29    0.00   13.29   4.23  24.10
sdbf              1.33    26.33  107.33  117.00  6156.00 17884.00   214.32     1.78    7.93    4.88   10.74   2.10  47.10
sdm               0.00    29.33    0.67  113.67     2.67 15373.33   268.97     1.11    9.69   12.50    9.67   3.58  40.90
...
sdbg              0.00     0.00    0.00    0.00     0.00     0.00     0.00     7.00    0.00    0.00    0.00   0.00 100.00
...
              
# Debug logs of hung OSD, threads eventually time out
2018-10-09 10:01:47.507 7f3399d30700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) flush in 0.008229
2018-10-09 10:01:47.508 7f3399d30700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) _sync_write 0x45dca418000~7000 buffered
2018-10-09 10:01:47.522 7f3399d30700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) flush in 0.013645
2018-10-09 10:01:47.650 7f3399d30700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) _sync_write 0x45dca41e000~3000 buffered
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56ef0000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56ef0000~80000 aio 0x55b0ee084010
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56f70000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56f70000~80000 aio 0x55b0bb31eb50
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea57070000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea57070000~80000 aio 0x55b0bb31f7d0
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56ff0000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56ff0000~80000 aio 0x55b0bdefca10
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56d70000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56d70000~80000 aio 0x55b0bdefd7d0
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56cf0000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56cf0000~80000 aio 0x55b0bdefc790
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56e70000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56e70000~80000 aio 0x55b0ee084a10
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56df0000~80000
2018-10-09 10:01:47.926 7f338f51b700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_read 0xea56df0000~80000 aio 0x55b0ee084f10
2018-10-09 10:01:48.326 7f338ed1a700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) read 0x2316c7f0000~1000 (direct)
2018-10-09 10:01:54.317 7f338dd18700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) read_random 0x46344535077~ebf
2018-10-09 10:02:04.855 7f338dd18700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d1c0000~10000 aio 0x55b0e40c17d0
2018-10-09 10:02:09.397 7f338dd18700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) read_random 0x46359976598~124d
2018-10-09 10:02:14.443 7f338d517700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) read_random 0x4635dd3bc14~f35
2018-10-09 10:02:18.342 7f33a9840700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:02:18.342 7f33a903f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:02:18.429 7f33a903f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:02:18.429 7f33aa041700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:02:18.488 7f33a903f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:02:18.488 7f33a9840700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:02:18.537 7f33a903f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30

# Flush appears to have taken 90 seconds here
2018-10-09 10:03:07.902 7f33a9840700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:03:07.921 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 16 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:08.881 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 16 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:09.836 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 20 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:10.795 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 20 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:11.667 7f33a883e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338ed1a700' had timed out after 30
2018-10-09 10:03:11.667 7f33a883e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:03:11.756 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 20 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:12.739 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 20 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:13.740 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 20 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:14.747 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 23 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:15.758 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 23 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:16.667 7f33a883e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338ed1a700' had timed out after 30
2018-10-09 10:03:16.667 7f33a883e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:03:16.730 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 23 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:16.803 7f33a903f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338ed1a700' had timed out after 30
2018-10-09 10:03:16.803 7f33a903f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:03:16.803 7f33a9840700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338ed1a700' had timed out after 30
2018-10-09 10:03:16.803 7f33a9840700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:03:17.778 7f33a4d46700 -1 osd.133 812202 get_health_metrics reporting 23 slow ops, oldest is osd_repop(client.43887872.0:35847302 1.1aaa e812202/812001)
2018-10-09 10:03:17.899 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d1f0000~10000 aio 0x55b09d829a50
2018-10-09 10:03:17.899 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d200000~f000 aio 0x55b09d828dd0
2018-10-09 10:03:17.899 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d220000~10000 aio 0x55b08a4e4510
2018-10-09 10:03:17.899 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d230000~f000 aio 0x55b08a4e4010
2018-10-09 10:03:17.904 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d250000~10000 aio 0x55b0833c2650
2018-10-09 10:03:17.904 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d260000~f000 aio 0x55b0833c2290
2018-10-09 10:03:17.904 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d280000~10000 aio 0x55b0833c3b90
2018-10-09 10:03:17.904 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d290000~f000 aio 0x55b0833c3910
2018-10-09 10:03:17.910 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d2b0000~10000 aio 0x55b0e81b9a50
2018-10-09 10:03:17.910 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d2c0000~f000 aio 0x55b0b824ec90
2018-10-09 10:03:17.910 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d2e0000~10000 aio 0x55b0ebf91e10
2018-10-09 10:03:17.910 7f339c89e700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d2f0000~f000 aio 0x55b0ebf91550
2018-10-09 10:03:17.943 7f338f51b700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f338f51b700' had timed out after 30
2018-10-09 10:03:17.944 7f338f51b700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) read_random 0x463712eb51a~10e4
2018-10-09 10:03:17.944 7f33a903f700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.3:6888/4079901 conn(0x55b086b76000 :7033 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2234 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:17.944 7f338f51b700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) read_random 0x463392854f6~102f
2018-10-09 10:03:17.945 7f338f51b700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) read_random 0x463837f0448~ef7
2018-10-09 10:03:17.951 7f338ed1a700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f338ed1a700' had timed out after 30
2018-10-09 10:03:17.952 7f33aa041700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.2:6805/53756 conn(0x55b0cce36c00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1540 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:17.955 7f339d8a0700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d320000~10000 aio 0x55b0ae23cb50
2018-10-09 10:03:17.955 7f339d8a0700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d330000~f000 aio 0x55b0ae23da50
2018-10-09 10:03:17.955 7f339d8a0700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d350000~10000 aio 0x55b0ae23dcd0
2018-10-09 10:03:17.955 7f339d8a0700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d360000~f000 aio 0x55b0ae23ca10
2018-10-09 10:03:17.955 7f339d8a0700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d380000~10000 aio 0x55b0ae23d690
2018-10-09 10:03:17.955 7f339d8a0700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) aio_write 0x6326d390000~f000 aio 0x55b0ae23c8d0
2018-10-09 10:03:17.980 7f3399d30700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) flush in 90.330556
2018-10-09 10:03:17.980 7f33a9840700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.2:6858/53974 conn(0x55b0b1a03200 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1650 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:17.984 7f3399d30700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) flush in 0.004259
2018-10-09 10:03:17.985 7f3399d30700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) _sync_write 0x45dca420000~51000 buffered
2018-10-09 10:03:18.011 7f3399d30700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) flush in 0.026079
2018-10-09 10:03:18.012 7f3399d30700  5 bdev(0x55b0795dc000 /var/lib/ceph/osd/ceph-133/block) flush in 0.000225
2018-10-09 10:03:18.012 7f3399d30700  5 bdev(0x55b0795dca80 /var/lib/ceph/osd/ceph-133/block) _sync_write 0x45dca470000~7000 buffered
2018-10-09 10:03:18.012 7f33aa041700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.3:6942/1079283 conn(0x55b0a79d4c00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1594 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:18.012 7f33a903f700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.2:6966/1054543 conn(0x55b0a7dbe000 :7033 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1565 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:18.012 7f33aa041700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.2:6947/3053834 conn(0x55b0be469800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2366 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:18.012 7f33a9840700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.2:6894/2053764 conn(0x55b0e1122400 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1950 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:18.012 7f33a9840700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.3:7008/5079187 conn(0x55b0db953000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2647 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:18.012 7f33a903f700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.6:6890/15309 conn(0x55b0a7c38a00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1563 cs=1 l=0).handle_connect_reply connect got RESETSESSION
2018-10-09 10:03:18.012 7f33a9840700  0 -- 192.168.6.1:7033/3251877 >> 192.168.6.3:6807/1079941 conn(0x55b0a7d77200 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=1746 cs=1 l=0).handle_connect_reply connect got RESETSESSION

History

#1 Updated by Igor Fedotov about 4 years ago

Gavin,
some questions, please.

What kind of disks are used for BlueStore?
What's the HW config for OSD nodes?

You have just main device per OSD, no DB/WAL, right?

#2 Updated by Igor Fedotov about 4 years ago

  • Project changed from Ceph to bluestore
  • Category deleted (OSD)

#3 Updated by Gavin Baker about 4 years ago

We used the default configuration for the Bluestore OSD creation, so all DB/WAL on the OSDs. Our cluster does not contain any SSDs at the moment to offload those features. All OSDs are 10TB Seagate Enterprise HDDs. CPU load is minimal on the entire cluster, even under heavy client/backfill load. Network traffic never gets close to maxing out the NICs.

60 Drive Nodes: (we used to run with 40 drives, had the same issues)

2x Intel Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (10 cores/socket)
512GB Memory
2x SSDs (logs/OS)
1x 40GbE NIC

24 Drive Nodes:

Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz
256GB Memory
2x SSDs (logs/OS)
1x 40GbE NIC

#4 Updated by Gavin Baker about 4 years ago

To clarify the behaviour I see on iostat... The disk util of a hung disk goes to 100, while the average queue length stays at the same value (avgqu-sz == some positive value). All other iostat values are 0 for the affected disk. Its as if no IO is happening on the disk despite the high utilization. This eventually hangs all the IO in the Ceph Storage cluster making the other disk's stats drop to zero. There is no dmesg or /var/log/message (other than Ceph entries) output that would indicate a hardware or external software issue.

#5 Updated by Sage Weil about 4 years ago

  • Status changed from New to Need More Info

This flush time is suspiciously close to 90s (flush in 90.330556)...

These aren't SMR drives, right?

#6 Updated by Sage Weil about 4 years ago

The code is just timing fdatasync(2), so the problem is almost certainly going to be below ceph (kernel or hardware)

  utime_t start = ceph_clock_now();
  int r = ::fdatasync(fd_direct);
  utime_t end = ceph_clock_now();
  utime_t dur = end - start;
  if (r < 0) {
    r = -errno;
    derr << __func__ << " fdatasync got: " << cpp_strerror(r) << dendl;
    ceph_abort();
  }
  dout(5) << __func__ << " in " << dur << dendl;;

#7 Updated by Gavin Baker about 4 years ago

No not SMR, these drives are Seagate Exos 10TB Enterprise sata drives. We are seeing this behavior on multiple types of chassis/mobo and raid card models. It could be the drives, but I dont see any obvious issues smart issues with them when I run health checks. I have gone through and updated raid/bios firmware and drivers on all of our storage servers. One thing that bothers me is that when a drive has this issue it seems to cause a lot of IO to hang (slow ops) on various OSDs. Maybe thats just a result of a high PG count? Do you have any other advice on further troubleshooting steps?

#8 Updated by Gavin Baker about 4 years ago

I also less frequently get these dmesg errors. Not sure if they are related.

[2041833.966150] bstore_kv_sync D ffff9b0dcffa0000 0 79243 1 0x00000080
[2041833.966153] Call Trace:
[2041833.966161] [<ffffffff9d72094d>] ? blk_peek_request+0x9d/0x2a0
[2041833.966166] [<ffffffff9db18f39>] schedule+0x29/0x70
[2041833.966169] [<ffffffff9db168a9>] schedule_timeout+0x239/0x2c0
[2041833.966171] [<ffffffff9d71d5e9>] ? __blk_run_queue+0x39/0x50
[2041833.966173] [<ffffffff9d721273>] ? blk_queue_bio+0x3b3/0x400
[2041833.966177] [<ffffffff9d4fa982>] ? ktime_get_ts64+0x52/0xf0
[2041833.966180] [<ffffffff9db1844d>] io_schedule_timeout+0xad/0x130
[2041833.966182] [<ffffffff9db1956d>] wait_for_completion_io+0xfd/0x140
[2041833.966185] [<ffffffff9d4d2010>] ? wake_up_state+0x20/0x20
[2041833.966187] [<ffffffff9d722c44>] blkdev_issue_flush+0xb4/0x110
[2041833.966191] [<ffffffff9d65d305>] blkdev_fsync+0x35/0x50
[2041833.966194] [<ffffffff9d653277>] do_fsync+0x67/0xb0
[2041833.966196] [<ffffffff9d653583>] SyS_fdatasync+0x13/0x20
[2041833.966200] [<ffffffff9db2579b>] system_call_fastpath+0x22/0x27

#9 Updated by Gavin Baker about 4 years ago

[2041833.966145] INFO: task bstore_kv_sync:79243 blocked for more than 120 seconds.
[2041833.966148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2041833.966150] bstore_kv_sync D ffff9b0dcffa0000 0 79243 1 0x00000080
[2041833.966153] Call Trace:
[2041833.966161] [<ffffffff9d72094d>] ? blk_peek_request+0x9d/0x2a0
[2041833.966166] [<ffffffff9db18f39>] schedule+0x29/0x70
[2041833.966169] [<ffffffff9db168a9>] schedule_timeout+0x239/0x2c0
[2041833.966171] [<ffffffff9d71d5e9>] ? __blk_run_queue+0x39/0x50
[2041833.966173] [<ffffffff9d721273>] ? blk_queue_bio+0x3b3/0x400
[2041833.966177] [<ffffffff9d4fa982>] ? ktime_get_ts64+0x52/0xf0
[2041833.966180] [<ffffffff9db1844d>] io_schedule_timeout+0xad/0x130
[2041833.966182] [<ffffffff9db1956d>] wait_for_completion_io+0xfd/0x140
[2041833.966185] [<ffffffff9d4d2010>] ? wake_up_state+0x20/0x20
[2041833.966187] [<ffffffff9d722c44>] blkdev_issue_flush+0xb4/0x110
[2041833.966191] [<ffffffff9d65d305>] blkdev_fsync+0x35/0x50
[2041833.966194] [<ffffffff9d653277>] do_fsync+0x67/0xb0
[2041833.966196] [<ffffffff9d653583>] SyS_fdatasync+0x13/0x20
[2041833.966200] [<ffffffff9db2579b>] system_call_fastpath+0x22/0x27

#10 Updated by Igor Fedotov about 4 years ago

May be benchmark this drive using FIO?
And try to simulate the use pattern: mixed read + write + fdatasync.

#11 Updated by Igor Fedotov about 4 years ago

BTW - do these drives/controllers have write caching enabled? May be try to disable if so? AFAIR there were some talks at ceph-users about issues with write caching recently.

#12 Updated by Gavin Baker about 4 years ago

Igor Fedotov wrote:

BTW - do these drives/controllers have write caching enabled? May be try to disable if so? AFAIR there were some talks at ceph-users about issues with write caching recently.

All drives and controllers have write caching turned off. All controllers are in HBA mode.

#13 Updated by Gavin Baker about 4 years ago

Igor Fedotov wrote:

May be benchmark this drive using FIO?
And try to simulate the use pattern: mixed read + write + fdatasync.

I will be working on testing this after I drain a few OSDs in the next few days.

#14 Updated by Sage Weil almost 4 years ago

  • Status changed from Need More Info to Can't reproduce

Also available in: Atom PDF