1
|
The transcript below shows a test on a cluster with new OSDs during rebalancing.
|
2
|
The test shows the effect of
|
3
|
|
4
|
- stopping+starting a new OSD (osd-phy6, ID 289),
|
5
|
- stopping+starting an old OSD (osd-phy9, ID 74).
|
6
|
|
7
|
In any test, we wait for peering to complete before taking "ceph status".
|
8
|
|
9
|
Set noout and norebalance to avoid disturbances during test, wait for recovery
|
10
|
to cease. Status at this point:
|
11
|
|
12
|
# ceph status
|
13
|
cluster:
|
14
|
id: xxx-x-xxx
|
15
|
health: HEALTH_WARN
|
16
|
noout,norebalance flag(s) set
|
17
|
8235970/1498751416 objects misplaced (0.550%)
|
18
|
1 pools nearfull
|
19
|
|
20
|
services:
|
21
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
22
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
23
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
24
|
osd: 297 osds: 272 up, 272 in; 46 remapped pgs
|
25
|
flags noout,norebalance
|
26
|
|
27
|
data:
|
28
|
pools: 11 pools, 3215 pgs
|
29
|
objects: 178.0 M objects, 491 TiB
|
30
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
31
|
pgs: 8235970/1498751416 objects misplaced (0.550%)
|
32
|
3163 active+clean
|
33
|
40 active+remapped+backfill_wait
|
34
|
6 active+remapped+backfilling
|
35
|
5 active+clean+scrubbing+deep
|
36
|
1 active+clean+snaptrim
|
37
|
|
38
|
io:
|
39
|
client: 74 MiB/s rd, 42 MiB/s wr, 1.19 kop/s rd, 889 op/s wr
|
40
|
|
41
|
# docker stop osd-phy6
|
42
|
osd-phy6
|
43
|
|
44
|
# ceph status
|
45
|
cluster:
|
46
|
id: xxx-x-xxx
|
47
|
health: HEALTH_WARN
|
48
|
noout,norebalance flag(s) set
|
49
|
1 osds down
|
50
|
8342724/1498792326 objects misplaced (0.557%)
|
51
|
Degraded data redundancy: 5717609/1498792326 objects degraded (0.381%), 74 pgs degraded
|
52
|
1 pools nearfull
|
53
|
|
54
|
services:
|
55
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
56
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
57
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
58
|
osd: 297 osds: 271 up, 272 in; 46 remapped pgs
|
59
|
flags noout,norebalance
|
60
|
|
61
|
data:
|
62
|
pools: 11 pools, 3215 pgs
|
63
|
objects: 178.0 M objects, 491 TiB
|
64
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
65
|
pgs: 5717609/1498792326 objects degraded (0.381%)
|
66
|
8342724/1498792326 objects misplaced (0.557%)
|
67
|
3089 active+clean
|
68
|
74 active+undersized+degraded
|
69
|
31 active+remapped+backfill_wait
|
70
|
11 active+remapped+backfilling
|
71
|
5 active+clean+scrubbing+deep
|
72
|
4 active+clean+remapped+snaptrim
|
73
|
1 active+clean+scrubbing
|
74
|
|
75
|
io:
|
76
|
client: 69 MiB/s rd, 45 MiB/s wr, 1.28 kop/s rd, 838 op/s wr
|
77
|
|
78
|
# ceph health detail
|
79
|
HEALTH_WARN noout,norebalance flag(s) set; 1 osds down; 8342692/1498794289 objects misplaced (0.557%); Degraded data redundancy: 5717610/1498794289 objects degraded (0.381%), 74 pgs degraded, 74 pgs undersized; 1 pools nearfull
|
80
|
OSDMAP_FLAGS noout,norebalance flag(s) set
|
81
|
OSD_DOWN 1 osds down
|
82
|
osd.289 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is down
|
83
|
OBJECT_MISPLACED 8342692/1498794289 objects misplaced (0.557%)
|
84
|
PG_DEGRADED Degraded data redundancy: 5717610/1498794289 objects degraded (0.381%), 74 pgs degraded, 74 pgs undersized
|
85
|
pg 11.2 is stuck undersized for 70.197385, current state active+undersized+degraded, last acting [87,292,2147483647,296,229,168,0,263]
|
86
|
pg 11.16 is stuck undersized for 70.178478, current state active+undersized+degraded, last acting [2147483647,181,60,233,237,294,293,292]
|
87
|
pg 11.1f is stuck undersized for 70.190040, current state active+undersized+degraded, last acting [230,238,182,292,84,2147483647,86,239]
|
88
|
pg 11.39 is stuck undersized for 70.193683, current state active+undersized+degraded, last acting [158,148,293,73,168,2,2147483647,236]
|
89
|
pg 11.3b is stuck undersized for 70.200823, current state active+undersized+degraded, last acting [2147483647,85,229,145,170,172,0,230]
|
90
|
pg 11.47 is stuck undersized for 70.196419, current state active+undersized+degraded, last acting [3,296,2147483647,0,233,84,182,238]
|
91
|
pg 11.59 is stuck undersized for 70.190002, current state active+undersized+degraded, last acting [2147483647,76,73,235,156,263,234,172]
|
92
|
pg 11.63 is stuck undersized for 70.160846, current state active+undersized+degraded, last acting [0,146,1,156,2147483647,228,172,238]
|
93
|
pg 11.66 is stuck undersized for 70.086237, current state active+undersized+degraded, last acting [291,159,296,233,2147483647,293,170,145]
|
94
|
pg 11.6d is stuck undersized for 70.210387, current state active+undersized+degraded, last acting [84,235,73,290,295,2147483647,0,183]
|
95
|
pg 11.7b is stuck undersized for 70.202578, current state active+undersized+degraded, last acting [2147483647,146,293,294,296,181,0,263]
|
96
|
pg 11.7d is stuck undersized for 70.178488, current state active+undersized+degraded, last acting [294,2,263,2147483647,170,237,292,235]
|
97
|
pg 11.7f is active+undersized+degraded, acting [148,232,2147483647,230,87,236,168,72]
|
98
|
pg 11.146 is stuck undersized for 70.197744, current state active+undersized+degraded, last acting [235,183,156,295,2147483647,294,146,260]
|
99
|
pg 11.155 is stuck undersized for 70.203091, current state active+undersized+degraded, last acting [73,72,170,259,260,63,84,2147483647]
|
100
|
pg 11.15d is stuck undersized for 70.135909, current state active+undersized+degraded, last acting [259,182,0,63,234,294,233,2147483647]
|
101
|
pg 11.171 is stuck undersized for 70.209391, current state active+undersized+degraded, last acting [170,168,232,72,231,172,2147483647,237]
|
102
|
pg 11.176 is stuck undersized for 70.202583, current state active+undersized+degraded, last acting [146,237,181,2147483647,294,72,236,293]
|
103
|
pg 11.177 is stuck undersized for 70.192564, current state active+undersized+degraded, last acting [156,146,236,235,63,2147483647,3,291]
|
104
|
pg 11.179 is stuck undersized for 70.190284, current state active+undersized+degraded, last acting [87,156,233,86,2147483647,172,259,158]
|
105
|
pg 11.17e is stuck undersized for 70.188938, current state active+undersized+degraded, last acting [3,231,290,260,76,183,2147483647,293]
|
106
|
pg 11.181 is stuck undersized for 70.175985, current state active+undersized+degraded, last acting [2147483647,290,239,148,1,228,145,2]
|
107
|
pg 11.188 is stuck undersized for 70.208638, current state active+undersized+degraded, last acting [2147483647,170,237,172,291,168,232,85]
|
108
|
pg 11.18b is stuck undersized for 70.186336, current state active+undersized+degraded, last acting [233,148,228,87,2147483647,182,235,0]
|
109
|
pg 11.18f is stuck undersized for 70.197416, current state active+undersized+degraded, last acting [73,237,238,2147483647,156,0,292,182]
|
110
|
pg 11.19d is stuck undersized for 70.083071, current state active+undersized+degraded, last acting [291,172,146,145,238,2147483647,296,231]
|
111
|
pg 11.1a5 is stuck undersized for 70.184859, current state active+undersized+degraded, last acting [293,145,2,230,159,239,85,2147483647]
|
112
|
pg 11.1a6 is stuck undersized for 70.209851, current state active+undersized+degraded, last acting [229,145,158,296,0,292,2147483647,239]
|
113
|
pg 11.1ac is stuck undersized for 70.192130, current state active+undersized+degraded, last acting [234,84,2147483647,86,239,183,294,232]
|
114
|
pg 11.1b0 is stuck undersized for 70.180993, current state active+undersized+degraded, last acting [168,293,290,2,2147483647,159,296,73]
|
115
|
pg 11.1b1 is stuck undersized for 70.175329, current state active+undersized+degraded, last acting [172,259,168,260,73,2147483647,146,263]
|
116
|
pg 11.1b7 is stuck undersized for 70.208713, current state active+undersized+degraded, last acting [263,172,2147483647,259,0,87,145,228]
|
117
|
pg 11.1c1 is stuck undersized for 70.170314, current state active+undersized+degraded, last acting [182,148,263,293,2,2147483647,228,294]
|
118
|
pg 11.1c3 is stuck undersized for 70.192088, current state active+undersized+degraded, last acting [234,290,63,239,85,156,76,2147483647]
|
119
|
pg 11.1c7 is stuck undersized for 70.192194, current state active+undersized+degraded, last acting [1,2147483647,263,232,86,234,84,172]
|
120
|
pg 11.1dd is stuck undersized for 70.183525, current state active+undersized+degraded, last acting [293,172,295,156,170,237,2147483647,86]
|
121
|
pg 11.1de is stuck undersized for 69.972952, current state active+undersized+degraded, last acting [296,293,76,63,231,146,2147483647,168]
|
122
|
pg 11.1e8 is stuck undersized for 70.172003, current state active+undersized+degraded, last acting [172,3,290,229,236,156,2147483647,228]
|
123
|
pg 11.1f2 is stuck undersized for 70.196870, current state active+undersized+degraded, last acting [234,0,159,2147483647,232,73,290,181]
|
124
|
pg 11.1f5 is stuck undersized for 70.190841, current state active+undersized+degraded, last acting [238,234,73,2147483647,158,291,172,168]
|
125
|
pg 11.1fc is stuck undersized for 70.181133, current state active+undersized+degraded, last acting [172,86,85,230,182,2147483647,238,233]
|
126
|
pg 11.1fd is stuck undersized for 70.221124, current state active+undersized+degraded, last acting [72,145,237,293,2147483647,60,87,172]
|
127
|
pg 11.203 is stuck undersized for 70.193700, current state active+undersized+degraded, last acting [2147483647,235,168,60,87,63,295,230]
|
128
|
pg 11.20d is stuck undersized for 70.197909, current state active+undersized+degraded, last acting [236,172,73,182,228,168,2147483647,293]
|
129
|
pg 11.20f is stuck undersized for 70.196571, current state active+undersized+degraded, last acting [85,84,76,60,238,233,159,2147483647]
|
130
|
pg 11.211 is stuck undersized for 70.197522, current state active+undersized+degraded, last acting [156,2147483647,170,234,0,238,1,231]
|
131
|
pg 11.212 is stuck undersized for 70.201683, current state active+undersized+degraded, last acting [148,2147483647,85,182,84,232,86,230]
|
132
|
pg 11.21e is stuck undersized for 70.202044, current state active+undersized+degraded, last acting [146,156,159,2147483647,230,238,239,2]
|
133
|
pg 11.224 is stuck undersized for 70.095494, current state active+undersized+degraded, last acting [291,148,237,2147483647,170,1,156,233]
|
134
|
pg 11.22d is stuck undersized for 70.195735, current state active+undersized+degraded, last acting [3,168,296,158,292,236,0,2147483647]
|
135
|
pg 11.22f is stuck undersized for 70.192480, current state active+undersized+degraded, last acting [1,2147483647,292,60,296,231,259,72]
|
136
|
POOL_NEAR_FULL 1 pools nearfull
|
137
|
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
|
138
|
|
139
|
# ceph pg 11.2 query | jq ".acting,.up,.recovery_state"
|
140
|
[
|
141
|
87,
|
142
|
292,
|
143
|
2147483647,
|
144
|
296,
|
145
|
229,
|
146
|
168,
|
147
|
0,
|
148
|
263
|
149
|
]
|
150
|
[
|
151
|
87,
|
152
|
292,
|
153
|
2147483647,
|
154
|
296,
|
155
|
229,
|
156
|
168,
|
157
|
0,
|
158
|
263
|
159
|
]
|
160
|
[
|
161
|
{
|
162
|
"name": "Started/Primary/Active",
|
163
|
"enter_time": "2020-08-11 10:35:49.109129",
|
164
|
"might_have_unfound": [],
|
165
|
"recovery_progress": {
|
166
|
"backfill_targets": [],
|
167
|
"waiting_on_backfill": [],
|
168
|
"last_backfill_started": "MIN",
|
169
|
"backfill_info": {
|
170
|
"begin": "MIN",
|
171
|
"end": "MIN",
|
172
|
"objects": []
|
173
|
},
|
174
|
"peer_backfill_info": [],
|
175
|
"backfills_in_flight": [],
|
176
|
"recovering": [],
|
177
|
"pg_backend": {
|
178
|
"recovery_ops": [],
|
179
|
"read_ops": []
|
180
|
}
|
181
|
},
|
182
|
"scrub": {
|
183
|
"scrubber.epoch_start": "0",
|
184
|
"scrubber.active": false,
|
185
|
"scrubber.state": "INACTIVE",
|
186
|
"scrubber.start": "MIN",
|
187
|
"scrubber.end": "MIN",
|
188
|
"scrubber.max_end": "MIN",
|
189
|
"scrubber.subset_last_update": "0'0",
|
190
|
"scrubber.deep": false,
|
191
|
"scrubber.waiting_on_whom": []
|
192
|
}
|
193
|
},
|
194
|
{
|
195
|
"name": "Started",
|
196
|
"enter_time": "2020-08-11 10:35:48.137595"
|
197
|
}
|
198
|
]
|
199
|
|
200
|
# docker start osd-phy6
|
201
|
osd-phy6
|
202
|
|
203
|
# After starting the OSD again, the cluster almost recovers. The PG showing
|
204
|
# up as backfill_toofull is due to a known bug (fixed in 13.2.10?). No
|
205
|
# degraded objects, just misplaced ones.
|
206
|
|
207
|
# ceph status
|
208
|
cluster:
|
209
|
id: xxx-x-xxx
|
210
|
health: HEALTH_ERR
|
211
|
noout,norebalance flag(s) set
|
212
|
8181843/1498795556 objects misplaced (0.546%)
|
213
|
Degraded data redundancy (low space): 1 pg backfill_toofull
|
214
|
1 pools nearfull
|
215
|
|
216
|
services:
|
217
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
218
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
219
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
220
|
osd: 297 osds: 272 up, 272 in; 46 remapped pgs
|
221
|
flags noout,norebalance
|
222
|
|
223
|
data:
|
224
|
pools: 11 pools, 3215 pgs
|
225
|
objects: 178.0 M objects, 491 TiB
|
226
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
227
|
pgs: 8181843/1498795556 objects misplaced (0.546%)
|
228
|
3163 active+clean
|
229
|
39 active+remapped+backfill_wait
|
230
|
6 active+remapped+backfilling
|
231
|
5 active+clean+scrubbing+deep
|
232
|
1 active+remapped+backfill_toofull
|
233
|
1 active+clean+snaptrim
|
234
|
|
235
|
io:
|
236
|
client: 35 MiB/s rd, 23 MiB/s wr, 672 op/s rd, 686 op/s wr
|
237
|
|
238
|
# docker stop osd-phy9
|
239
|
osd-phy9
|
240
|
|
241
|
# After stopping an old OSD, we observe an immediate degradation. The cluster seems
|
242
|
# to loose track of objects already here. The recovery operation does not stop in
|
243
|
# contrast to the situation when stopping a new OSD as seen above, where it shows up
|
244
|
# only temporarily.
|
245
|
|
246
|
# ceph status
|
247
|
cluster:
|
248
|
id: xxx-x-xxx
|
249
|
health: HEALTH_WARN
|
250
|
noout,norebalance flag(s) set
|
251
|
1 osds down
|
252
|
7967641/1498798381 objects misplaced (0.532%)
|
253
|
Degraded data redundancy: 5763425/1498798381 objects degraded (0.385%), 75 pgs degraded
|
254
|
1 pools nearfull
|
255
|
|
256
|
services:
|
257
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
258
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
259
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
260
|
osd: 297 osds: 271 up, 272 in; 46 remapped pgs
|
261
|
flags noout,norebalance
|
262
|
|
263
|
data:
|
264
|
pools: 11 pools, 3215 pgs
|
265
|
objects: 178.0 M objects, 491 TiB
|
266
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
267
|
pgs: 5763425/1498798381 objects degraded (0.385%)
|
268
|
7967641/1498798381 objects misplaced (0.532%)
|
269
|
3092 active+clean
|
270
|
70 active+undersized+degraded
|
271
|
41 active+remapped+backfill_wait
|
272
|
4 active+clean+scrubbing+deep
|
273
|
4 active+undersized+degraded+remapped+backfilling
|
274
|
2 active+clean+scrubbing
|
275
|
1 active+undersized+degraded+remapped+backfill_wait
|
276
|
1 active+clean+snaptrim
|
277
|
|
278
|
io:
|
279
|
client: 76 MiB/s rd, 76 MiB/s wr, 736 op/s rd, 881 op/s wr
|
280
|
recovery: 93 MiB/s, 23 objects/s
|
281
|
|
282
|
# ceph health detail
|
283
|
HEALTH_WARN noout,norebalance flag(s) set; 1 osds down; 7966306/1498798501 objects misplaced (0.532%); Degraded data redundancy: 5762977/1498798501 objects degraded (0.385%), 75 pgs degraded; 1 pools nearfull
|
284
|
OSDMAP_FLAGS noout,norebalance flag(s) set
|
285
|
OSD_DOWN 1 osds down
|
286
|
osd.74 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is down
|
287
|
OBJECT_MISPLACED 7966306/1498798501 objects misplaced (0.532%)
|
288
|
PG_DEGRADED Degraded data redundancy: 5762977/1498798501 objects degraded (0.385%), 75 pgs degraded
|
289
|
pg 11.4 is active+undersized+degraded, acting [86,2147483647,237,235,182,63,231,84]
|
290
|
pg 11.5 is active+undersized+degraded, acting [1,2147483647,183,0,239,145,293,170]
|
291
|
pg 11.a is active+undersized+degraded+remapped+backfilling, acting [170,156,148,2147483647,234,86,236,232]
|
292
|
pg 11.b is active+undersized+degraded, acting [3,228,0,146,239,292,145,2147483647]
|
293
|
pg 11.e is active+undersized+degraded, acting [237,181,73,183,72,290,2147483647,295]
|
294
|
pg 11.1b is active+undersized+degraded, acting [85,183,72,2147483647,156,232,263,146]
|
295
|
pg 11.1d is active+undersized+degraded, acting [290,296,183,86,293,2147483647,236,3]
|
296
|
pg 11.37 is active+undersized+degraded, acting [60,233,181,183,2147483647,296,87,86]
|
297
|
pg 11.50 is active+undersized+degraded+remapped+backfill_wait, acting [231,259,228,87,182,156,2147483647,172]
|
298
|
pg 11.52 is active+undersized+degraded, acting [237,60,1,2147483647,233,232,292,86]
|
299
|
pg 11.57 is active+undersized+degraded, acting [231,259,230,170,72,87,181,2147483647]
|
300
|
pg 11.5a is active+undersized+degraded, acting [290,2147483647,237,183,293,84,295,1]
|
301
|
pg 11.5c is active+undersized+degraded, acting [84,2147483647,259,0,85,234,146,148]
|
302
|
pg 11.62 is active+undersized+degraded, acting [182,294,293,2147483647,63,234,181,0]
|
303
|
pg 11.68 is active+undersized+degraded, acting [158,296,229,168,76,3,159,2147483647]
|
304
|
pg 11.6a is active+undersized+degraded, acting [2147483647,288,238,172,1,237,0,290]
|
305
|
pg 11.73 is active+undersized+degraded, acting [236,234,259,2147483647,170,63,3,0]
|
306
|
pg 11.77 is active+undersized+degraded, acting [2147483647,72,87,183,236,156,290,293]
|
307
|
pg 11.78 is active+undersized+degraded, acting [172,230,236,156,294,60,2147483647,76]
|
308
|
pg 11.84 is active+undersized+degraded, acting [86,84,239,296,294,182,2147483647,293]
|
309
|
pg 11.87 is active+undersized+degraded+remapped+backfilling, acting [148,60,231,260,235,87,2147483647,181]
|
310
|
pg 11.8b is active+undersized+degraded, acting [263,170,2147483647,259,296,172,73,76]
|
311
|
pg 11.15a is active+undersized+degraded, acting [2147483647,260,182,0,263,73,159,288]
|
312
|
pg 11.15f is active+undersized+degraded+remapped+backfilling, acting [146,233,2147483647,76,234,172,181,229]
|
313
|
pg 11.162 is active+undersized+degraded, acting [84,294,230,2,293,290,2147483647,295]
|
314
|
pg 11.16d is active+undersized+degraded, acting [236,230,2147483647,183,0,1,235,181]
|
315
|
pg 11.172 is active+undersized+degraded, acting [181,148,237,3,231,293,76,2147483647]
|
316
|
pg 11.185 is active+undersized+degraded, acting [296,0,236,238,2147483647,294,181,146]
|
317
|
pg 11.18a is active+undersized+degraded, acting [0,2147483647,159,145,293,233,85,146]
|
318
|
pg 11.192 is active+undersized+degraded, acting [148,76,170,296,295,2147483647,3,235]
|
319
|
pg 11.193 is active+undersized+degraded, acting [2147483647,148,295,230,232,168,76,290]
|
320
|
pg 11.198 is active+undersized+degraded, acting [260,76,87,2147483647,145,183,229,239]
|
321
|
pg 11.19a is active+undersized+degraded, acting [146,294,230,238,2147483647,0,295,288]
|
322
|
pg 11.1a1 is active+undersized+degraded, acting [84,183,294,2147483647,234,170,263,238]
|
323
|
pg 11.1a7 is active+undersized+degraded, acting [63,236,158,84,86,237,87,2147483647]
|
324
|
pg 11.1ae is active+undersized+degraded, acting [296,172,238,2147483647,170,288,294,295]
|
325
|
pg 11.1c5 is active+undersized+degraded, acting [76,172,236,232,2147483647,296,288,170]
|
326
|
pg 11.1c6 is active+undersized+degraded, acting [236,72,230,170,2147483647,238,181,148]
|
327
|
pg 11.1d1 is active+undersized+degraded, acting [259,170,291,3,156,2147483647,292,296]
|
328
|
pg 11.1d4 is active+undersized+degraded, acting [263,228,182,84,2,2147483647,259,87]
|
329
|
pg 11.1e1 is active+undersized+degraded, acting [158,145,233,1,259,296,2,2147483647]
|
330
|
pg 11.1ea is active+undersized+degraded, acting [84,183,260,259,85,60,2,2147483647]
|
331
|
pg 11.1ec is active+undersized+degraded, acting [292,293,233,2,2147483647,85,288,146]
|
332
|
pg 11.1ed is active+undersized+degraded, acting [156,237,293,233,148,2147483647,291,85]
|
333
|
pg 11.1ee is active+undersized+degraded, acting [1,229,0,63,228,2147483647,233,156]
|
334
|
pg 11.201 is active+undersized+degraded, acting [229,239,296,63,76,294,182,2147483647]
|
335
|
pg 11.206 is active+undersized+degraded, acting [235,288,76,158,296,263,85,2147483647]
|
336
|
pg 11.20a is active+undersized+degraded, acting [158,1,263,232,0,230,292,2147483647]
|
337
|
pg 11.218 is active+undersized+degraded, acting [0,296,87,2147483647,263,148,156,232]
|
338
|
pg 11.21a is active+undersized+degraded, acting [2147483647,230,159,231,60,235,73,291]
|
339
|
pg 11.21b is active+undersized+degraded, acting [84,159,238,87,291,230,2147483647,182]
|
340
|
POOL_NEAR_FULL 1 pools nearfull
|
341
|
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
|
342
|
|
343
|
# ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
|
344
|
[
|
345
|
86,
|
346
|
2147483647,
|
347
|
237,
|
348
|
235,
|
349
|
182,
|
350
|
63,
|
351
|
231,
|
352
|
84
|
353
|
]
|
354
|
[
|
355
|
86,
|
356
|
2147483647,
|
357
|
237,
|
358
|
235,
|
359
|
182,
|
360
|
63,
|
361
|
231,
|
362
|
84
|
363
|
]
|
364
|
[
|
365
|
{
|
366
|
"name": "Started/Primary/Active",
|
367
|
"enter_time": "2020-08-11 10:41:55.983304",
|
368
|
"might_have_unfound": [],
|
369
|
"recovery_progress": {
|
370
|
"backfill_targets": [],
|
371
|
"waiting_on_backfill": [],
|
372
|
"last_backfill_started": "MIN",
|
373
|
"backfill_info": {
|
374
|
"begin": "MIN",
|
375
|
"end": "MIN",
|
376
|
"objects": []
|
377
|
},
|
378
|
"peer_backfill_info": [],
|
379
|
"backfills_in_flight": [],
|
380
|
"recovering": [],
|
381
|
"pg_backend": {
|
382
|
"recovery_ops": [],
|
383
|
"read_ops": [
|
384
|
{
|
385
|
"tid": 8418509,
|
386
|
"to_read": "{11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head=read_request_t(to_read=[1646592,24576,0], need={63(5)=[0,1],86(0)=[0,1],182(4)=[0,1],231(6)=[0,1],235(3)=[0,1],237(2)=[0,1]}, want_attrs=0)}",
|
387
|
"complete": "{11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head=read_result_t(r=0, errors={}, noattrs, returned=(1646592, 24576, [63(5),4096, 86(0),4096, 231(6),4096, 235(3),4096, 237(2),4096]))}",
|
388
|
"priority": 127,
|
389
|
"obj_to_source": "{11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head=63(5),86(0),182(4),231(6),235(3),237(2)}",
|
390
|
"source_to_obj": "{63(5)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,86(0)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,182(4)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,231(6)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,235(3)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head,237(2)=11:2003536f:::rbd_data.1.a508f96b8b4567.000000000000182c:head}",
|
391
|
"in_progress": "182(4)"
|
392
|
}
|
393
|
]
|
394
|
}
|
395
|
},
|
396
|
"scrub": {
|
397
|
"scrubber.epoch_start": "0",
|
398
|
"scrubber.active": false,
|
399
|
"scrubber.state": "INACTIVE",
|
400
|
"scrubber.start": "MIN",
|
401
|
"scrubber.end": "MIN",
|
402
|
"scrubber.max_end": "MIN",
|
403
|
"scrubber.subset_last_update": "0'0",
|
404
|
"scrubber.deep": false,
|
405
|
"scrubber.waiting_on_whom": []
|
406
|
}
|
407
|
},
|
408
|
{
|
409
|
"name": "Started",
|
410
|
"enter_time": "2020-08-11 10:41:55.202575"
|
411
|
}
|
412
|
]
|
413
|
|
414
|
# docker start osd-phy9
|
415
|
osd-phy9
|
416
|
|
417
|
# After starting the old OSD, a lot of objects remain degraded. It looks like PGs that were
|
418
|
# in state "...+remapped+backfilling" are affected. All others seem to recover, see below.
|
419
|
|
420
|
# ceph status
|
421
|
cluster:
|
422
|
id: xxx-x-xxx
|
423
|
health: HEALTH_ERR
|
424
|
noout,norebalance flag(s) set
|
425
|
7954306/1498800854 objects misplaced (0.531%)
|
426
|
Degraded data redundancy: 208493/1498800854 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized
|
427
|
Degraded data redundancy (low space): 4 pgs backfill_toofull
|
428
|
1 pools nearfull
|
429
|
|
430
|
services:
|
431
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
432
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
433
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
434
|
osd: 297 osds: 272 up, 272 in; 46 remapped pgs
|
435
|
flags noout,norebalance
|
436
|
|
437
|
data:
|
438
|
pools: 11 pools, 3215 pgs
|
439
|
objects: 178.0 M objects, 491 TiB
|
440
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
441
|
pgs: 208493/1498800854 objects degraded (0.014%)
|
442
|
7954306/1498800854 objects misplaced (0.531%)
|
443
|
3162 active+clean
|
444
|
39 active+remapped+backfill_wait
|
445
|
6 active+clean+scrubbing+deep
|
446
|
4 active+remapped+backfill_toofull
|
447
|
3 active+undersized+degraded+remapped+backfilling
|
448
|
1 active+clean+snaptrim
|
449
|
|
450
|
io:
|
451
|
client: 111 MiB/s rd, 42 MiB/s wr, 763 op/s rd, 750 op/s wr
|
452
|
recovery: 66 MiB/s, 16 objects/s
|
453
|
|
454
|
# ceph health detail
|
455
|
HEALTH_ERR noout,norebalance flag(s) set; 7953632/1498800881 objects misplaced (0.531%); Degraded data redundancy: 208184/1498800881 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized; Degraded data redundancy (low space): 4 pgs backfill_toofull; 1 pools nearfull
|
456
|
OSDMAP_FLAGS noout,norebalance flag(s) set
|
457
|
OBJECT_MISPLACED 7953632/1498800881 objects misplaced (0.531%)
|
458
|
PG_DEGRADED Degraded data redundancy: 208184/1498800881 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized
|
459
|
pg 11.a is stuck undersized for 311.488352, current state active+undersized+degraded+remapped+backfilling, last acting [170,156,148,2147483647,234,86,236,232]
|
460
|
pg 11.87 is stuck undersized for 311.487625, current state active+undersized+degraded+remapped+backfilling, last acting [148,60,231,260,235,87,2147483647,181]
|
461
|
pg 11.ed is stuck undersized for 311.465765, current state active+undersized+degraded+remapped+backfilling, last acting [233,2147483647,156,259,159,182,230,85]
|
462
|
PG_DEGRADED_FULL Degraded data redundancy (low space): 4 pgs backfill_toofull
|
463
|
pg 11.8 is active+remapped+backfill_wait+backfill_toofull, acting [86,158,237,85,159,259,144,263]
|
464
|
pg 11.5d is active+remapped+backfill_wait+backfill_toofull, acting [263,158,230,73,183,84,2,169]
|
465
|
pg 11.165 is active+remapped+backfill_wait+backfill_toofull, acting [60,148,234,73,2,229,84,180]
|
466
|
pg 11.1f0 is active+remapped+backfill_wait+backfill_toofull, acting [237,148,2,238,169,231,60,87]
|
467
|
POOL_NEAR_FULL 1 pools nearfull
|
468
|
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)
|
469
|
|
470
|
# ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
|
471
|
[
|
472
|
86,
|
473
|
74,
|
474
|
237,
|
475
|
235,
|
476
|
182,
|
477
|
63,
|
478
|
231,
|
479
|
84
|
480
|
]
|
481
|
[
|
482
|
86,
|
483
|
74,
|
484
|
237,
|
485
|
235,
|
486
|
182,
|
487
|
63,
|
488
|
231,
|
489
|
84
|
490
|
]
|
491
|
[
|
492
|
{
|
493
|
"name": "Started/Primary/Active",
|
494
|
"enter_time": "2020-08-11 10:46:20.882683",
|
495
|
"might_have_unfound": [],
|
496
|
"recovery_progress": {
|
497
|
"backfill_targets": [],
|
498
|
"waiting_on_backfill": [],
|
499
|
"last_backfill_started": "MIN",
|
500
|
"backfill_info": {
|
501
|
"begin": "MIN",
|
502
|
"end": "MIN",
|
503
|
"objects": []
|
504
|
},
|
505
|
"peer_backfill_info": [],
|
506
|
"backfills_in_flight": [],
|
507
|
"recovering": [],
|
508
|
"pg_backend": {
|
509
|
"recovery_ops": [],
|
510
|
"read_ops": []
|
511
|
}
|
512
|
},
|
513
|
"scrub": {
|
514
|
"scrubber.epoch_start": "0",
|
515
|
"scrubber.active": false,
|
516
|
"scrubber.state": "INACTIVE",
|
517
|
"scrubber.start": "MIN",
|
518
|
"scrubber.end": "MIN",
|
519
|
"scrubber.max_end": "MIN",
|
520
|
"scrubber.subset_last_update": "0'0",
|
521
|
"scrubber.deep": false,
|
522
|
"scrubber.waiting_on_whom": []
|
523
|
}
|
524
|
},
|
525
|
{
|
526
|
"name": "Started",
|
527
|
"enter_time": "2020-08-11 10:46:19.736862"
|
528
|
}
|
529
|
]
|
530
|
|
531
|
# ceph pg 11.a query | jq ".acting,.up,.recovery_state"
|
532
|
[
|
533
|
170,
|
534
|
156,
|
535
|
148,
|
536
|
2147483647,
|
537
|
234,
|
538
|
86,
|
539
|
236,
|
540
|
232
|
541
|
]
|
542
|
[
|
543
|
170,
|
544
|
156,
|
545
|
292,
|
546
|
289,
|
547
|
234,
|
548
|
86,
|
549
|
236,
|
550
|
232
|
551
|
]
|
552
|
[
|
553
|
{
|
554
|
"name": "Started/Primary/Active",
|
555
|
"enter_time": "2020-08-11 10:41:55.982261",
|
556
|
"might_have_unfound": [
|
557
|
{
|
558
|
"osd": "74(3)",
|
559
|
"status": "already probed"
|
560
|
}
|
561
|
],
|
562
|
"recovery_progress": {
|
563
|
"backfill_targets": [
|
564
|
"289(3)",
|
565
|
"292(2)"
|
566
|
],
|
567
|
"waiting_on_backfill": [],
|
568
|
"last_backfill_started": "11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head",
|
569
|
"backfill_info": {
|
570
|
"begin": "11:5003695e:::rbd_data.1.318f016b8b4567.000000000004aa48:head",
|
571
|
"end": "11:5003e7ba:::rbd_data.1.ac314b6b8b4567.00000000000a3032:head",
|
572
|
"objects": [
|
573
|
{
|
574
|
"object": "11:5003695e:::rbd_data.1.318f016b8b4567.000000000004aa48:head",
|
575
|
"version": "66191'195037"
|
576
|
},
|
577
|
|
578
|
[... many many similar entries removed ...]
|
579
|
|
580
|
{
|
581
|
"object": "11:5003e76b:::rbd_data.1.b023c26b8b4567.0000000000b8efd1:head",
|
582
|
"version": "181835'908372"
|
583
|
}
|
584
|
]
|
585
|
},
|
586
|
"peer_backfill_info": [
|
587
|
"289(3)",
|
588
|
{
|
589
|
"begin": "MAX",
|
590
|
"end": "MAX",
|
591
|
"objects": []
|
592
|
},
|
593
|
"292(2)",
|
594
|
{
|
595
|
"begin": "MAX",
|
596
|
"end": "MAX",
|
597
|
"objects": []
|
598
|
}
|
599
|
],
|
600
|
"backfills_in_flight": [
|
601
|
"11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head"
|
602
|
],
|
603
|
"recovering": [
|
604
|
"11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head"
|
605
|
],
|
606
|
"pg_backend": {
|
607
|
"recovery_ops": [
|
608
|
{
|
609
|
"hoid": "11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head",
|
610
|
"v": "178993'819609",
|
611
|
"missing_on": "289(3),292(2)",
|
612
|
"missing_on_shards": "2,3",
|
613
|
"recovery_info": "ObjectRecoveryInfo(11:500368d3:::rbd_data.1.af0b536b8b4567.000000000062aef2:head@178993'819609, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:{})",
|
614
|
"recovery_progress": "ObjectRecoveryProgress(!first, data_recovered_to:4202496, data_complete:true, omap_recovered_to:, omap_complete:true, error:false)",
|
615
|
"state": "WRITING",
|
616
|
"waiting_on_pushes": "289(3),292(2)",
|
617
|
"extent_requested": "0,8404992"
|
618
|
}
|
619
|
],
|
620
|
"read_ops": []
|
621
|
}
|
622
|
},
|
623
|
"scrub": {
|
624
|
"scrubber.epoch_start": "0",
|
625
|
"scrubber.active": false,
|
626
|
"scrubber.state": "INACTIVE",
|
627
|
"scrubber.start": "MIN",
|
628
|
"scrubber.end": "MIN",
|
629
|
"scrubber.max_end": "MIN",
|
630
|
"scrubber.subset_last_update": "0'0",
|
631
|
"scrubber.deep": false,
|
632
|
"scrubber.waiting_on_whom": []
|
633
|
}
|
634
|
},
|
635
|
{
|
636
|
"name": "Started",
|
637
|
"enter_time": "2020-08-11 10:41:55.090085"
|
638
|
}
|
639
|
]
|
640
|
|
641
|
|
642
|
# Below we temporarily change placement of new OSDs back and forth. The
|
643
|
# initiated peering leads to re-discovery of all objects.
|
644
|
|
645
|
# ceph osd crush move osd.288 host=bb-04
|
646
|
moved item id 288 name 'osd.288' to location {host=bb-04} in crush map
|
647
|
# ceph osd crush move osd.289 host=bb-05
|
648
|
moved item id 289 name 'osd.289' to location {host=bb-05} in crush map
|
649
|
# ceph osd crush move osd.290 host=bb-06
|
650
|
moved item id 290 name 'osd.290' to location {host=bb-06} in crush map
|
651
|
# ceph osd crush move osd.291 host=bb-21
|
652
|
moved item id 291 name 'osd.291' to location {host=bb-21} in crush map
|
653
|
# ceph osd crush move osd.292 host=bb-07
|
654
|
moved item id 292 name 'osd.292' to location {host=bb-07} in crush map
|
655
|
# ceph osd crush move osd.293 host=bb-18
|
656
|
moved item id 293 name 'osd.293' to location {host=bb-18} in crush map
|
657
|
# ceph osd crush move osd.295 host=bb-19
|
658
|
moved item id 295 name 'osd.295' to location {host=bb-19} in crush map
|
659
|
# ceph osd crush move osd.294 host=bb-22
|
660
|
moved item id 294 name 'osd.294' to location {host=bb-22} in crush map
|
661
|
# ceph osd crush move osd.296 host=bb-20
|
662
|
moved item id 296 name 'osd.296' to location {host=bb-20} in crush map
|
663
|
|
664
|
# All objects found at this point. Notice that a slow op shows up for
|
665
|
# one of the mons (see very end). It does not clear itself, a restart
|
666
|
# is required.
|
667
|
|
668
|
# ceph status
|
669
|
cluster:
|
670
|
id: xxx-x-xxx
|
671
|
health: HEALTH_WARN
|
672
|
noout,norebalance flag(s) set
|
673
|
59942033/1498816658 objects misplaced (3.999%)
|
674
|
1 pools nearfull
|
675
|
1 slow ops, oldest one blocked for 62 sec, mon.ceph-03 has slow ops
|
676
|
|
677
|
services:
|
678
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
679
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
680
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
681
|
osd: 297 osds: 272 up, 272 in; 419 remapped pgs
|
682
|
flags noout,norebalance
|
683
|
|
684
|
data:
|
685
|
pools: 11 pools, 3215 pgs
|
686
|
objects: 178.0 M objects, 491 TiB
|
687
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
688
|
pgs: 59942033/1498816658 objects misplaced (3.999%)
|
689
|
2747 active+clean
|
690
|
348 active+remapped+backfill_wait
|
691
|
71 active+remapped+backfilling
|
692
|
34 active+clean+snaptrim
|
693
|
12 active+clean+snaptrim_wait
|
694
|
3 active+clean+scrubbing+deep
|
695
|
|
696
|
io:
|
697
|
client: 130 MiB/s rd, 113 MiB/s wr, 1.38 kop/s rd, 1.53 kop/s wr
|
698
|
|
699
|
# ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
|
700
|
[
|
701
|
86,
|
702
|
74,
|
703
|
237,
|
704
|
235,
|
705
|
182,
|
706
|
63,
|
707
|
231,
|
708
|
84
|
709
|
]
|
710
|
[
|
711
|
86,
|
712
|
74,
|
713
|
237,
|
714
|
235,
|
715
|
182,
|
716
|
63,
|
717
|
231,
|
718
|
84
|
719
|
]
|
720
|
[
|
721
|
{
|
722
|
"name": "Started/Primary/Active",
|
723
|
"enter_time": "2020-08-11 10:50:26.411524",
|
724
|
"might_have_unfound": [],
|
725
|
"recovery_progress": {
|
726
|
"backfill_targets": [],
|
727
|
"waiting_on_backfill": [],
|
728
|
"last_backfill_started": "MIN",
|
729
|
"backfill_info": {
|
730
|
"begin": "MIN",
|
731
|
"end": "MIN",
|
732
|
"objects": []
|
733
|
},
|
734
|
"peer_backfill_info": [],
|
735
|
"backfills_in_flight": [],
|
736
|
"recovering": [],
|
737
|
"pg_backend": {
|
738
|
"recovery_ops": [],
|
739
|
"read_ops": []
|
740
|
}
|
741
|
},
|
742
|
"scrub": {
|
743
|
"scrubber.epoch_start": "0",
|
744
|
"scrubber.active": false,
|
745
|
"scrubber.state": "INACTIVE",
|
746
|
"scrubber.start": "MIN",
|
747
|
"scrubber.end": "MIN",
|
748
|
"scrubber.max_end": "MIN",
|
749
|
"scrubber.subset_last_update": "0'0",
|
750
|
"scrubber.deep": false,
|
751
|
"scrubber.waiting_on_whom": []
|
752
|
}
|
753
|
},
|
754
|
{
|
755
|
"name": "Started",
|
756
|
"enter_time": "2020-08-11 10:50:20.931555"
|
757
|
}
|
758
|
]
|
759
|
|
760
|
# ceph pg 11.a query | jq ".acting,.up,.recovery_state"
|
761
|
[
|
762
|
170,
|
763
|
156,
|
764
|
148,
|
765
|
74,
|
766
|
234,
|
767
|
86,
|
768
|
236,
|
769
|
232
|
770
|
]
|
771
|
[
|
772
|
170,
|
773
|
156,
|
774
|
148,
|
775
|
74,
|
776
|
234,
|
777
|
86,
|
778
|
236,
|
779
|
232
|
780
|
]
|
781
|
[
|
782
|
{
|
783
|
"name": "Started/Primary/Active",
|
784
|
"enter_time": "2020-08-11 10:50:18.781335",
|
785
|
"might_have_unfound": [
|
786
|
{
|
787
|
"osd": "74(3)",
|
788
|
"status": "already probed"
|
789
|
},
|
790
|
{
|
791
|
"osd": "86(5)",
|
792
|
"status": "already probed"
|
793
|
},
|
794
|
{
|
795
|
"osd": "148(2)",
|
796
|
"status": "already probed"
|
797
|
},
|
798
|
{
|
799
|
"osd": "156(1)",
|
800
|
"status": "already probed"
|
801
|
},
|
802
|
{
|
803
|
"osd": "232(7)",
|
804
|
"status": "already probed"
|
805
|
},
|
806
|
{
|
807
|
"osd": "234(4)",
|
808
|
"status": "already probed"
|
809
|
},
|
810
|
{
|
811
|
"osd": "236(6)",
|
812
|
"status": "already probed"
|
813
|
},
|
814
|
{
|
815
|
"osd": "289(3)",
|
816
|
"status": "not queried"
|
817
|
},
|
818
|
{
|
819
|
"osd": "292(2)",
|
820
|
"status": "not queried"
|
821
|
}
|
822
|
],
|
823
|
"recovery_progress": {
|
824
|
"backfill_targets": [],
|
825
|
"waiting_on_backfill": [],
|
826
|
"last_backfill_started": "MIN",
|
827
|
"backfill_info": {
|
828
|
"begin": "MIN",
|
829
|
"end": "MIN",
|
830
|
"objects": []
|
831
|
},
|
832
|
"peer_backfill_info": [],
|
833
|
"backfills_in_flight": [],
|
834
|
"recovering": [],
|
835
|
"pg_backend": {
|
836
|
"recovery_ops": [],
|
837
|
"read_ops": []
|
838
|
}
|
839
|
},
|
840
|
"scrub": {
|
841
|
"scrubber.epoch_start": "0",
|
842
|
"scrubber.active": false,
|
843
|
"scrubber.state": "INACTIVE",
|
844
|
"scrubber.start": "MIN",
|
845
|
"scrubber.end": "MIN",
|
846
|
"scrubber.max_end": "MIN",
|
847
|
"scrubber.subset_last_update": "0'0",
|
848
|
"scrubber.deep": false,
|
849
|
"scrubber.waiting_on_whom": []
|
850
|
}
|
851
|
},
|
852
|
{
|
853
|
"name": "Started",
|
854
|
"enter_time": "2020-08-11 10:50:18.043630"
|
855
|
}
|
856
|
]
|
857
|
|
858
|
# ceph osd crush move osd.288 host=ceph-04
|
859
|
moved item id 288 name 'osd.288' to location {host=ceph-04} in crush map
|
860
|
# ceph osd crush move osd.289 host=ceph-05
|
861
|
moved item id 289 name 'osd.289' to location {host=ceph-05} in crush map
|
862
|
# ceph osd crush move osd.290 host=ceph-06
|
863
|
moved item id 290 name 'osd.290' to location {host=ceph-06} in crush map
|
864
|
# ceph osd crush move osd.291 host=ceph-21
|
865
|
moved item id 291 name 'osd.291' to location {host=ceph-21} in crush map
|
866
|
# ceph osd crush move osd.292 host=ceph-07
|
867
|
moved item id 292 name 'osd.292' to location {host=ceph-07} in crush map
|
868
|
# ceph osd crush move osd.293 host=ceph-18
|
869
|
moved item id 293 name 'osd.293' to location {host=ceph-18} in crush map
|
870
|
# ceph osd crush move osd.295 host=ceph-19
|
871
|
moved item id 295 name 'osd.295' to location {host=ceph-19} in crush map
|
872
|
# ceph osd crush move osd.294 host=ceph-22
|
873
|
moved item id 294 name 'osd.294' to location {host=ceph-22} in crush map
|
874
|
# ceph osd crush move osd.296 host=ceph-20
|
875
|
moved item id 296 name 'osd.296' to location {host=ceph-20} in crush map
|
876
|
|
877
|
|
878
|
# After these placement operations, we start observing slow ops. Not sute what
|
879
|
# is going on here, but something seems not to work the way it should. We recorded
|
880
|
# here 2 ceph status reports to show the transition. In between these two there
|
881
|
# was a PG going down as well, it was shown as 1 pg inactive for a short time.
|
882
|
# Didn't manage to catch that for the record.
|
883
|
|
884
|
# ceph status
|
885
|
cluster:
|
886
|
id: xxx-x-xxx
|
887
|
health: HEALTH_WARN
|
888
|
noout,norebalance flag(s) set
|
889
|
8630330/1498837232 objects misplaced (0.576%)
|
890
|
1 pools nearfull
|
891
|
8 slow ops, oldest one blocked for 212 sec, daemons [osd.169,osd.234,osd.288,osd.63,mon.ceph-03] have slow ops.
|
892
|
|
893
|
services:
|
894
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
895
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
896
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
897
|
osd: 297 osds: 272 up, 272 in; 46 remapped pgs
|
898
|
flags noout,norebalance
|
899
|
|
900
|
data:
|
901
|
pools: 11 pools, 3215 pgs
|
902
|
objects: 178.0 M objects, 491 TiB
|
903
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
904
|
pgs: 0.156% pgs not active
|
905
|
8630330/1498837232 objects misplaced (0.576%)
|
906
|
3158 active+clean
|
907
|
41 active+remapped+backfill_wait
|
908
|
6 active+clean+scrubbing+deep
|
909
|
4 active+remapped+backfilling
|
910
|
4 activating
|
911
|
1 activating+remapped
|
912
|
1 active+clean+snaptrim
|
913
|
|
914
|
io:
|
915
|
client: 85 MiB/s rd, 127 MiB/s wr, 534 op/s rd, 844 op/s wr
|
916
|
|
917
|
# ceph status
|
918
|
cluster:
|
919
|
id: xxx-x-xxx
|
920
|
health: HEALTH_WARN
|
921
|
noout,norebalance flag(s) set
|
922
|
8630330/1498844491 objects misplaced (0.576%)
|
923
|
1 pools nearfull
|
924
|
1 slow ops, oldest one blocked for 247 sec, mon.ceph-03 has slow ops
|
925
|
|
926
|
services:
|
927
|
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
|
928
|
mgr: ceph-01(active), standbys: ceph-03, ceph-02
|
929
|
mds: con-fs2-1/1/1 up {0=ceph-12=up:active}, 1 up:standby-replay
|
930
|
osd: 297 osds: 272 up, 272 in; 46 remapped pgs
|
931
|
flags noout,norebalance
|
932
|
|
933
|
data:
|
934
|
pools: 11 pools, 3215 pgs
|
935
|
objects: 178.0 M objects, 491 TiB
|
936
|
usage: 685 TiB used, 1.2 PiB / 1.9 PiB avail
|
937
|
pgs: 8630330/1498844491 objects misplaced (0.576%)
|
938
|
3162 active+clean
|
939
|
42 active+remapped+backfill_wait
|
940
|
6 active+clean+scrubbing+deep
|
941
|
4 active+remapped+backfilling
|
942
|
1 active+clean+snaptrim
|
943
|
|
944
|
io:
|
945
|
client: 51 MiB/s rd, 66 MiB/s wr, 1.22 kop/s rd, 1.18 kop/s wr
|
946
|
|
947
|
# ceph pg 11.4 query | jq ".acting,.up,.recovery_state"
|
948
|
[
|
949
|
86,
|
950
|
74,
|
951
|
237,
|
952
|
235,
|
953
|
182,
|
954
|
63,
|
955
|
231,
|
956
|
84
|
957
|
]
|
958
|
[
|
959
|
86,
|
960
|
74,
|
961
|
237,
|
962
|
235,
|
963
|
182,
|
964
|
63,
|
965
|
231,
|
966
|
84
|
967
|
]
|
968
|
[
|
969
|
{
|
970
|
"name": "Started/Primary/Active",
|
971
|
"enter_time": "2020-08-11 10:53:20.059512",
|
972
|
"might_have_unfound": [],
|
973
|
"recovery_progress": {
|
974
|
"backfill_targets": [],
|
975
|
"waiting_on_backfill": [],
|
976
|
"last_backfill_started": "MIN",
|
977
|
"backfill_info": {
|
978
|
"begin": "MIN",
|
979
|
"end": "MIN",
|
980
|
"objects": []
|
981
|
},
|
982
|
"peer_backfill_info": [],
|
983
|
"backfills_in_flight": [],
|
984
|
"recovering": [],
|
985
|
"pg_backend": {
|
986
|
"recovery_ops": [],
|
987
|
"read_ops": []
|
988
|
}
|
989
|
},
|
990
|
"scrub": {
|
991
|
"scrubber.epoch_start": "0",
|
992
|
"scrubber.active": false,
|
993
|
"scrubber.state": "INACTIVE",
|
994
|
"scrubber.start": "MIN",
|
995
|
"scrubber.end": "MIN",
|
996
|
"scrubber.max_end": "MIN",
|
997
|
"scrubber.subset_last_update": "0'0",
|
998
|
"scrubber.deep": false,
|
999
|
"scrubber.waiting_on_whom": []
|
1000
|
}
|
1001
|
},
|
1002
|
{
|
1003
|
"name": "Started",
|
1004
|
"enter_time": "2020-08-11 10:53:08.947343"
|
1005
|
}
|
1006
|
]
|
1007
|
|
1008
|
# ceph pg 11.a query | jq ".acting,.up,.recovery_state"
|
1009
|
[
|
1010
|
170,
|
1011
|
156,
|
1012
|
148,
|
1013
|
74,
|
1014
|
234,
|
1015
|
86,
|
1016
|
236,
|
1017
|
232
|
1018
|
]
|
1019
|
[
|
1020
|
170,
|
1021
|
156,
|
1022
|
292,
|
1023
|
289,
|
1024
|
234,
|
1025
|
86,
|
1026
|
236,
|
1027
|
232
|
1028
|
]
|
1029
|
[
|
1030
|
{
|
1031
|
"name": "Started/Primary/Active",
|
1032
|
"enter_time": "2020-08-11 10:53:08.842425",
|
1033
|
"might_have_unfound": [],
|
1034
|
"recovery_progress": {
|
1035
|
"backfill_targets": [
|
1036
|
"289(3)",
|
1037
|
"292(2)"
|
1038
|
],
|
1039
|
"waiting_on_backfill": [],
|
1040
|
"last_backfill_started": "MIN",
|
1041
|
"backfill_info": {
|
1042
|
"begin": "MIN",
|
1043
|
"end": "MIN",
|
1044
|
"objects": []
|
1045
|
},
|
1046
|
"peer_backfill_info": [],
|
1047
|
"backfills_in_flight": [],
|
1048
|
"recovering": [],
|
1049
|
"pg_backend": {
|
1050
|
"recovery_ops": [],
|
1051
|
"read_ops": []
|
1052
|
}
|
1053
|
},
|
1054
|
"scrub": {
|
1055
|
"scrubber.epoch_start": "0",
|
1056
|
"scrubber.active": false,
|
1057
|
"scrubber.state": "INACTIVE",
|
1058
|
"scrubber.start": "MIN",
|
1059
|
"scrubber.end": "MIN",
|
1060
|
"scrubber.max_end": "MIN",
|
1061
|
"scrubber.subset_last_update": "0'0",
|
1062
|
"scrubber.deep": false,
|
1063
|
"scrubber.waiting_on_whom": []
|
1064
|
}
|
1065
|
},
|
1066
|
{
|
1067
|
"name": "Started",
|
1068
|
"enter_time": "2020-08-11 10:53:04.962749"
|
1069
|
}
|
1070
|
]
|
1071
|
|
1072
|
|
1073
|
# This operation got stuck. In other experiments I saw hundreds stuck and
|
1074
|
# always needed a mon restart.
|
1075
|
|
1076
|
# ceph daemon mon.ceph-03 ops
|
1077
|
{
|
1078
|
"ops": [
|
1079
|
{
|
1080
|
"description": "osd_pgtemp(e193916 {11.43=[236,168,85,86,228,169,60,148]} v193916)",
|
1081
|
"initiated_at": "2020-08-11 10:50:20.712996",
|
1082
|
"age": 310.065493,
|
1083
|
"duration": 310.065506,
|
1084
|
"type_data": {
|
1085
|
"events": [
|
1086
|
{
|
1087
|
"time": "2020-08-11 10:50:20.712996",
|
1088
|
"event": "initiated"
|
1089
|
},
|
1090
|
{
|
1091
|
"time": "2020-08-11 10:50:20.712996",
|
1092
|
"event": "header_read"
|
1093
|
},
|
1094
|
{
|
1095
|
"time": "2020-08-11 10:50:20.713012",
|
1096
|
"event": "throttled"
|
1097
|
},
|
1098
|
{
|
1099
|
"time": "2020-08-11 10:50:20.713015",
|
1100
|
"event": "all_read"
|
1101
|
},
|
1102
|
{
|
1103
|
"time": "2020-08-11 10:50:20.713110",
|
1104
|
"event": "dispatched"
|
1105
|
},
|
1106
|
{
|
1107
|
"time": "2020-08-11 10:50:20.713113",
|
1108
|
"event": "mon:_ms_dispatch"
|
1109
|
},
|
1110
|
{
|
1111
|
"time": "2020-08-11 10:50:20.713113",
|
1112
|
"event": "mon:dispatch_op"
|
1113
|
},
|
1114
|
{
|
1115
|
"time": "2020-08-11 10:50:20.713113",
|
1116
|
"event": "psvc:dispatch"
|
1117
|
},
|
1118
|
{
|
1119
|
"time": "2020-08-11 10:50:20.713125",
|
1120
|
"event": "osdmap:preprocess_query"
|
1121
|
},
|
1122
|
{
|
1123
|
"time": "2020-08-11 10:50:20.713155",
|
1124
|
"event": "forward_request_leader"
|
1125
|
},
|
1126
|
{
|
1127
|
"time": "2020-08-11 10:50:20.713184",
|
1128
|
"event": "forwarded"
|
1129
|
}
|
1130
|
],
|
1131
|
"info": {
|
1132
|
"seq": 56785507,
|
1133
|
"src_is_mon": false,
|
1134
|
"source": "osd.85 192.168.32.69:6820/2538607",
|
1135
|
"forwarded_to_leader": true
|
1136
|
}
|
1137
|
}
|
1138
|
}
|
1139
|
],
|
1140
|
"num_ops": 1
|
1141
|
}
|
1142
|
|