Project

General

Profile

Actions

Bug #21318

closed

segv in rocksdb::BlockBasedTable::NewIndexIterator

Added by Tomasz Kusmierz over 6 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi, so if you want to get more info there is a thread on ceph-user:
"OSD's flapping on ordinary scrub with cluster being static (after upgrade to 12.1.1"
I was told:
1. go away and install 12.2 so I waited until it was available for proxmox from their own tree and upgraded today to stable release
2. I was told to open a bug here.

So problem is that OSD will crash on normal scrub and on a deep scrub:

So some data on main issue:

FAIL ON DEEP SCRUB
2017-08-26 01:50:04.421944 osd.0 osd.0 192.168.1.240:6814/12991 7 : cluster [INF] 6.5 scrub ok
2017-08-26 01:50:09.331095 osd.0 osd.0 192.168.1.240:6814/12991 8 : cluster [INF] 1.1c scrub starts
2017-08-26 01:51:03.339509 osd.0 osd.0 192.168.1.240:6814/12991 9 : cluster [INF] 1.1c scrub ok
2017-08-26 02:21:00.706695 osd.10 osd.10 192.168.1.240:6806/21564 7 : cluster [INF] 1.d1 scrub starts
2017-08-26 02:21:34.066183 osd.10 osd.10 192.168.1.240:6806/21564 8 : cluster [INF] 1.d1 scrub ok
2017-08-26 02:21:56.943046 osd.8 osd.8 192.168.1.240:6810/22002 7 : cluster [INF] 1.17 scrub starts
2017-08-26 02:23:06.341409 osd.8 osd.8 192.168.1.240:6810/22002 8 : cluster [INF] 1.17 scrub ok
2017-08-26 02:35:51.099649 osd.8 osd.8 192.168.1.240:6810/22002 9 : cluster [INF] 1.89 scrub starts
2017-08-26 02:36:42.605600 osd.8 osd.8 192.168.1.240:6810/22002 10 : cluster [INF] 1.89 scrub ok
2017-08-26 02:38:27.132698 osd.8 osd.8 192.168.1.240:6810/22002 11 : cluster [INF] 1.ce scrub starts
2017-08-26 02:38:49.820489 osd.8 osd.8 192.168.1.240:6810/22002 12 : cluster [INF] 1.ce scrub ok
2017-08-26 03:23:27.619669 osd.8 osd.8 192.168.1.240:6810/22002 13 : cluster [INF] 1.8c scrub starts
2017-08-26 03:23:49.679403 osd.8 osd.8 192.168.1.240:6810/22002 14 : cluster [INF] 1.8c scrub ok
2017-08-26 03:32:19.475812 osd.0 osd.0 192.168.1.240:6814/12991 10 : cluster [INF] 1.d4 deep-scrub starts
2017-08-26 03:38:46.708163 mon.0 mon.0 192.168.1.240:6789/0 1201 : cluster [INF] osd.0 failed (root=default,host=proxmox1) (connection refused reported by osd.8)
2017-08-26 03:38:46.759470 mon.0 mon.0 192.168.1.240:6789/0 1207 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2017-08-26 03:38:49.820122 mon.0 mon.0 192.168.1.240:6789/0 1212 : cluster [WRN] Health check failed: Reduced data availability: 12 pgs inactive (PG_AVAILABILITY)
2017-08-26 03:38:49.820165 mon.0 mon.0 192.168.1.240:6789/0 1213 : cluster [WRN] Health check failed: Degraded data redundancy: 292260/3786364 objects degraded (7.719%), 38 pgs unclean, 38 pgs degraded (PG_DEGRADED)
2017-08-26 03:38:51.088934 mon.0 mon.0 192.168.1.240:6789/0 1214 : cluster [WRN] Health check update: Reduced data availability: 16 pgs inactive (PG_AVAILABILITY)
2017-08-26 03:38:51.088975 mon.0 mon.0 192.168.1.240:6789/0 1215 : cluster [WRN] Health check update: Degraded data redundancy: 392568/3786364 objects degraded (10.368%), 52 pgs unclean, 52 pgs degraded (PG_DEGRADED)
2017-08-26 03:38:53.090178 mon.0 mon.0 192.168.1.240:6789/0 1216 : cluster [WRN] Health check update: Reduced data availability: 29 pgs inactive (PG_AVAILABILITY)
2017-08-26 03:38:53.090216 mon.0 mon.0 192.168.1.240:6789/0 1217 : cluster [WRN] Health check update: Degraded data redundancy: 592033/3786364 objects degraded (15.636%), 82 pgs unclean, 82 pgs degraded (PG_DEGRADED)
2017-08-26 03:39:37.928816 mon.0 mon.0 192.168.1.240:6789/0 1220 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2017-08-26 03:39:37.941007 mon.0 mon.0 192.168.1.240:6789/0 1221 : cluster [INF] osd.0 192.168.1.240:6814/15727 boot
2017-08-26 03:39:39.949551 mon.0 mon.0 192.168.1.240:6789/0 1226 : cluster [WRN] Health check update: Degraded data redundancy: 436309/3786364 objects degraded (11.523%), 82 pgs unclean, 60 pgs degraded (PG_DEGRADED)
2017-08-26 03:39:41.974996 mon.0 mon.0 192.168.1.240:6789/0 1227 : cluster [WRN] Health check update: Degraded data redundancy: 379236/3786364 objects degraded (10.016%), 74 pgs unclean, 52 pgs degraded (PG_DEGRADED)
2017-08-26 03:39:43.120495 mon.0 mon.0 192.168.1.240:6789/0 1228 : cluster [WRN] Health check update: Degraded data redundancy: 22 pgs unclean (PG_DEGRADED)
2017-08-26 03:39:43.120534 mon.0 mon.0 192.168.1.240:6789/0 1229 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 29 pgs inactive)
2017-08-26 03:39:45.121340 mon.0 mon.0 192.168.1.240:6789/0 1230 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 22 pgs unclean)
2017-08-26 03:39:45.121386 mon.0 mon.0 192.168.1.240:6789/0 1231 : cluster [INF] Cluster is now healthy
2017-08-26 03:40:11.568499 osd.10 osd.10 192.168.1.240:6806/21564 9 : cluster [INF] 1.b5 scrub starts
2017-08-26 03:40:51.874519 osd.10 osd.10 192.168.1.240:6806/21564 10 : cluster [INF] 1.b5 scrub ok
2017-08-26 03:41:15.794026 osd.8 osd.8 192.168.1.240:6810/22002 15 : cluster [INF] 1.77 scrub starts
2017-08-26 03:42:19.561924 osd.8 osd.8 192.168.1.240:6810/22002 16 : cluster [INF] 1.77 scrub ok
2017-08-26 03:42:30.895351 osd.0 osd.0 192.168.1.240:6814/15727 1 : cluster [INF] 1.d4 deep-scrub starts
2017-08-26 03:42:30.842869 osd.8 osd.8 192.168.1.240:6810/22002 17 : cluster [INF] 1.12 scrub starts
2017-08-26 03:43:15.478366 osd.8 osd.8 192.168.1.240:6810/22002 18 : cluster [INF] 1.12 scrub ok
2017-08-26 03:47:17.962016 osd.0 osd.0 192.168.1.240:6814/15727 2 : cluster [INF] 1.d4 deep-scrub ok
2017-08-26 03:48:30.668792 osd.10 osd.10 192.168.1.240:6806/21564 11 : cluster [INF] 1.1 scrub starts
2017-08-26 03:49:05.879546 osd.10 osd.10 192.168.1.240:6806/21564 12 : cluster [INF] 1.1 scrub ok
2017-08-26 03:50:53.709500 osd.10 osd.10 192.168.1.240:6806/21564 13 : cluster [INF] 1.9d scrub starts
2017-08-26 03:52:13.278975 osd.10 osd.10 192.168.1.240:6806/21564 14 : cluster [INF] 1.9d scrub ok
2017-08-26 04:31:37.144944 osd.10 osd.10 192.168.1.240:6806/21564 15 : cluster [INF] 1.82 scrub starts
2017-08-26 04:32:35.917646 osd.10 osd.10 192.168.1.240:6806/21564 16 : cluster [INF] 1.82 scrub ok
2017-08-26 04:33:03.930435 osd.9 osd.9 192.168.1.240:6802/32411 36 : cluster [INF] 1.f4 scrub starts
2017-08-26 04:34:08.360134 osd.9 osd.9 192.168.1.240:6802/32411 37 : cluster [INF] 1.f4 scrub ok

FAIL ON NORMAL SCRUB
2017-08-25 23:28:55.310602 osd.8 osd.8 192.168.1.240:6806/2820 29 : cluster [INF] 6.3 deep-scrub starts
2017-08-25 23:28:55.415144 osd.8 osd.8 192.168.1.240:6806/2820 30 : cluster [INF] 6.3 deep-scrub ok
2017-08-25 23:29:01.273979 osd.8 osd.8 192.168.1.240:6806/2820 31 : cluster [INF] 1.d2 scrub starts
2017-08-25 23:30:47.518484 osd.8 osd.8 192.168.1.240:6806/2820 32 : cluster [INF] 1.d2 scrub ok
2017-08-25 23:31:40.311045 osd.8 osd.8 192.168.1.240:6806/2820 33 : cluster [INF] 1.6e scrub starts
2017-08-25 23:32:22.150274 osd.8 osd.8 192.168.1.240:6806/2820 34 : cluster [INF] 1.6e scrub ok
2017-08-25 23:32:58.297062 osd.9 osd.9 192.168.1.240:6802/7091 32 : cluster [INF] 1.d5 scrub starts
2017-08-25 23:35:19.285841 osd.9 osd.9 192.168.1.240:6802/7091 33 : cluster [INF] 1.d5 scrub ok
2017-08-25 23:36:38.375447 osd.8 osd.8 192.168.1.240:6806/2820 35 : cluster [INF] 1.3 scrub starts
2017-08-25 23:37:25.012116 osd.8 osd.8 192.168.1.240:6806/2820 36 : cluster [INF] 1.3 scrub ok
2017-08-25 23:38:29.406144 osd.8 osd.8 192.168.1.240:6806/2820 37 : cluster [INF] 1.45 scrub starts
2017-08-25 23:38:53.020365 mon.0 mon.0 192.168.1.240:6789/0 831 : cluster [INF] osd.9 failed (root=default,host=proxmox1) (connection refused reported by osd.8)
2017-08-25 23:38:53.166364 mon.0 mon.0 192.168.1.240:6789/0 832 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2017-08-25 23:38:56.200767 mon.0 mon.0 192.168.1.240:6789/0 837 : cluster [WRN] Health check failed: Degraded data redundancy: 100309/3786338 objects degraded (2.649%), 14 pgs unclean, 14 pgs degraded (PG_DEGRADED)
2017-08-25 23:38:58.155562 mon.0 mon.0 192.168.1.240:6789/0 838 : cluster [WRN] Health check failed: Reduced data availability: 1 pg inactive (PG_AVAILABILITY)
2017-08-25 23:38:58.155601 mon.0 mon.0 192.168.1.240:6789/0 839 : cluster [WRN] Health check update: Degraded data redundancy: 715775/3786338 objects degraded (18.904%), 101 pgs unclean, 102 pgs degraded (PG_DEGRADED)
2017-08-25 23:39:30.172451 mon.0 mon.0 192.168.1.240:6789/0 840 : cluster [WRN] Health check update: Degraded data redundancy: 715775/3786338 objects degraded (18.904%), 102 pgs unclean, 102 pgs degraded (PG_DEGRADED)
2017-08-25 23:39:47.851497 mon.0 mon.0 192.168.1.240:6789/0 843 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2017-08-25 23:39:47.864774 mon.0 mon.0 192.168.1.240:6789/0 844 : cluster [INF] osd.9 192.168.1.240:6802/32411 boot
2017-08-25 23:39:50.876761 mon.0 mon.0 192.168.1.240:6789/0 849 : cluster [WRN] Health check update: Degraded data redundancy: 672540/3786338 objects degraded (17.762%), 96 pgs unclean, 96 pgs degraded (PG_DEGRADED)
2017-08-25 23:39:52.184954 mon.0 mon.0 192.168.1.240:6789/0 850 : cluster [WRN] Health check update: Degraded data redundancy: 476349/3786338 objects degraded (12.581%), 69 pgs unclean, 69 pgs degraded (PG_DEGRADED)
2017-08-25 23:39:50.533429 osd.0 osd.0 192.168.1.240:6814/16223 13 : cluster [INF] 1.80 scrub starts
2017-08-25 23:39:55.056537 mon.0 mon.0 192.168.1.240:6789/0 851 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive)
2017-08-25 23:39:55.056574 mon.0 mon.0 192.168.1.240:6789/0 852 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 476349/3786338 objects degraded (12.581%), 69 pgs unclean, 69 pgs degraded)
2017-08-25 23:39:55.056591 mon.0 mon.0 192.168.1.240:6789/0 853 : cluster [INF] Cluster is now healthy
2017-08-25 23:40:17.806395 osd.0 osd.0 192.168.1.240:6814/16223 14 : cluster [INF] 1.80 scrub ok
2017-08-25 23:40:19.775012 osd.9 osd.9 192.168.1.240:6802/32411 1 : cluster [INF] 1.5a scrub starts
2017-08-25 23:40:46.458847 osd.9 osd.9 192.168.1.240:6802/32411 2 : cluster [INF] 1.5a scrub ok
2017-08-25 23:40:53.807218 osd.9 osd.9 192.168.1.240:6802/32411 3 : cluster [INF] 1.56 scrub starts
2017-08-25 23:41:16.197304 osd.9 osd.9 192.168.1.240:6802/32411 4 : cluster [INF] 1.56 scrub ok
2017-08-25 23:41:24.814502 osd.9 osd.9 192.168.1.240:6802/32411 5 : cluster [INF] 1.92 deep-scrub starts
2017-08-25 23:51:35.881952 osd.9 osd.9 192.168.1.240:6802/32411 6 : cluster [INF] 1.92 deep-scrub ok
2017-08-25 23:52:54.476268 osd.10 osd.10 192.168.1.240:6810/4355 39 : cluster [INF] 1.f2 scrub starts
2017-08-25 23:53:21.208291 osd.10 osd.10 192.168.1.240:6810/4355 40 : cluster [INF] 1.f2 scrub ok
2017-08-25 23:53:47.475879 osd.10 osd.10 192.168.1.240:6810/4355 41 : cluster [INF] 1.c8 deep-scrub starts
2017-08-26 00:01:08.611371 osd.10 osd.10 192.168.1.240:6810/4355 42 : cluster [INF] 1.c8 deep-scrub ok
20

root@proxmox1:/# ceph pg dump | egrep -v '^(0\.|1\.|2\.|3\.)' | egrep -v '(^pool\ (0|1|2|3))' | column -t
dumped all
version 9678
stamp 2017-08-27 01:27:53.321763
last_osdmap_epoch 0
last_pg_scan 0
full_ratio 0
nearfull_ratio 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
6.4 0 0 0 0 0 0 0 0 active+clean 2017-08-26 22:55:06.289033 0'0 4725:2848 [10,8] 10 [10,8] 10 0'0 2017-08-26 22:55:06.288961 0'0 2017-08-20 23:32:51.270895
6.5 0 0 0 0 0 0 0 0 active+clean 2017-08-26 23:03:07.062129 0'0 4726:2101 [0,10] 0 [0,10] 0 0'0 2017-08-26 01:50:04.421951 0'0 2017-08-22 14:26:19.915612
6.6 0 0 0 0 0 0 0 0 active+clean 2017-08-27 00:46:09.548107 0'0 4726:2344 [10,9] 10 [10,9] 10 0'0 2017-08-27 00:46:09.548029 0'0 2017-08-24 13:08:56.447183
6.7 0 0 0 0 0 0 0 0 active+clean 2017-08-26 22:52:44.635393 0'0 4725:1481 [10,8] 10 [10,8] 10 0'0 2017-08-25 22:02:26.297723 0'0 2017-08-23 15:55:58.299570
6.3 0 0 0 0 0 0 0 0 active+clean 2017-08-26 22:52:44.632667 0'0 4725:1971 [8,10] 8 [8,10] 8 0'0 2017-08-25 23:28:55.415148 0'0 2017-08-25 23:28:55.415148
5.0 18661 0 0 0 0 12583538 1563 1563 active+clean 2017-08-26 22:03:03.809158 4652'1197298 4725:1382436 [10,9] 10 [10,9] 10 4623'1197263 2017-08-26 19:49:19.819627 4270'1161119 2017-08-20 02:04:03.373813
6.2 0 0 0 0 0 0 0 0 active+clean 2017-08-26 22:52:45.677622 0'0 4725:1440 [9,8] 9 [9,8] 9 0'0 2017-08-26 20:58:34.722865 0'0 2017-08-26 20:58:34.722865
5.1 18878 0 0 0 0 12583048 1573 1573 active+clean 2017-08-26 23:03:07.062298 4640'959478 4726:1131301 [0,8] 0 [0,8] 0 4596'958844 2017-08-26 13:47:19.329350 4393'956123 2017-08-25 09:32:09.556396
6.1 0 0 0 0 0 0 0 0 active+clean 2017-08-26 22:52:44.736333 0'0 4725:1615 [8,9] 8 [8,9] 8 0'0 2017-08-26 01:28:24.476136 0'0 2017-08-22 16:20:13.243273
5.2 18472 0 0 0 0 32462655 1592 1592 active+clean 2017-08-26 22:52:44.634997 4652'952265 4725:1174014 [10,8] 10 [10,8] 10 4652'952265 2017-08-26 22:45:06.916647 4270'930889 2017-08-23 05:50:46.370503
6.0 0 0 0 0 0 0 0 0 active+clean 2017-08-26 23:03:07.061426 0'0 4726:2441 [10,0] 10 [10,0] 10 0'0 2017-08-26 21:59:03.746276 0'0 2017-08-23 02:26:18.206975
5.3 18512 0 0 0 0 10928869 1519 1519 active+clean 2017-08-26 23:03:07.062484 4639'984496 4726:1199339 [0,8] 0 [0,8] 0 4531'983789 2017-08-26 00:09:32.283691 4270'975964 2017-08-23 16:15:09.546043
5 74523 0 0 0 0 68558110 6247 6247
6 0 0 0 0 0 0 0 0
1 1821197 0 0 0 0 6962542387273 401319 401319
sum 1895720 0 0 0 0 6962610945383 407566 407566
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
8 4288G 3163G 7451G [0,9,10] 177 93
10 4240G 3211G 7451G [0,8,9] 175 93
0 1984G 809G 2794G [8,9,10] 82 37
9 2492G 1233G 3725G [0,8,10] 102 45
sum 13005G 8418G 21424G

root@proxmox1:~# ceph versions {
"mon": {
"ceph version 12.1.2 (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc)": 1
},
"mgr": {
"ceph version 12.1.2 (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc)": 1
},
"osd": {
"ceph version 12.1.2 (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc)": 4
},
"mds": {
"ceph version 12.1.2 (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc)": 1
},
"overall": {
"ceph version 12.1.2 (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc)": 7
}
}

Crush map:
  1. begin crush map
    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    tunable chooseleaf_stable 1
    tunable straw_calc_version 1
    tunable allowed_bucket_algs 54
  1. devices
    device 0 osd.0 class hdd
    device 1 device1
    device 2 device2
    device 3 device3
    device 4 device4
    device 5 device5
    device 6 device6
    device 7 device7
    device 8 osd.8 class hdd
    device 9 osd.9 class hdd
    device 10 osd.10 class hdd
  1. types
    type 0 osd
    type 1 host
    type 2 chassis
    type 3 rack
    type 4 row
    type 5 pdu
    type 6 pod
    type 7 room
    type 8 datacenter
    type 9 region
    type 10 root
  1. buckets
    host proxmox1 {
    id -2 # do not change unnecessarily
    id -3 class hdd # do not change unnecessarily # weight 20.922
    alg straw
    hash 0 # rjenkins1
    item osd.10 weight 7.277
    item osd.9 weight 3.639
    item osd.0 weight 2.729
    item osd.8 weight 7.277
    }
    root default {
    id -1 # do not change unnecessarily
    id -4 class hdd # do not change unnecessarily # weight 20.922
    alg straw
    hash 0 # rjenkins1
    item proxmox1 weight 20.922
    }
  1. rules
    rule replicated_ruleset {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type osd
    step emit
    }
  1. end crush map

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #20557: segmentation fault with rocksdb|BlueStore and jemallocClosed07/10/2017

Actions
Actions

Also available in: Atom PDF