Project

General

Profile

Bug #18251

There are some scrub errors after reset a node.

Added by de lan over 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

my ceph environment:
3 mons at tree nodes:192.7.7.177 192.7.7.180 192.7.7.181
27 osds.
ceph version: 10.2.3.3

before node reset.the ceph is OK:
[root@ceph177 ~]# ceph -s
cluster 89b09a3d-9609-59e8-ce17-0b4170d65212
health HEALTH_WARN
too many PGs per OSD (601 > max 300)
monmap e1: 3 mons at {192.7.7.177=111.111.111.177:6789/0,192.7.7.180=111.111.111.180:6789/0,192.7.7.181=111.111.111.181:6789/0}
election epoch 134, quorum 0,1,2 192.7.7.177,192.7.7.180,192.7.7.181
osdmap e4586: 27 osds: 27 up, 27 in
flags sortbitwise
pgmap v1271837: 5416 pgs, 8 pools, 686 GB data, 284 kobjects
2216 GB used, 22922 GB / 25138 GB avail
5416 active+clean

when i reset the 192.7.7.181 node:
client io 373 kB/s rd, 0 B/s wr, 417 op/s rd, 0 op/s wr
[root@ceph177 ~]# ceph -s
cluster 89b09a3d-9609-59e8-ce17-0b4170d65212
health HEALTH_ERR
12 pgs inconsistent
38 scrub errors
too many PGs per OSD (601 > max 300)
monmap e1: 3 mons at {192.7.7.177=111.111.111.177:6789/0,192.7.7.180=111.111.111.180:6789/0,192.7.7.181=111.111.111.181:6789/0}
election epoch 142, quorum 0,1,2 192.7.7.177,192.7.7.180,192.7.7.181
osdmap e4627: 27 osds: 27 up, 27 in
flags sortbitwise
pgmap v1274284: 5416 pgs, 8 pools, 686 GB data, 284 kobjects
2216 GB used, 22922 GB / 25138 GB avail
5404 active+clean
12 active+clean+inconsistent

History

#1 Updated by de lan over 7 years ago

I can repair the inconsistent pg by "ceph pg repair":

@[root@ceph177 ~]# ceph health detail
HEALTH_ERR 12 pgs inconsistent; 38 scrub errors; too many PGs per OSD (601 > max 300)
pg 2.3b5 is active+clean+inconsistent, acting [5,26,7]
pg 4.37a is active+clean+inconsistent, acting [23,26,13]
pg 2.335 is active+clean+inconsistent, acting [5,4,6]
pg 2.f2 is active+clean+inconsistent, acting [14,17,16]
pg 2.a2 is active+clean+inconsistent, acting [5,9,7]
pg 4.4a is active+clean+inconsistent, acting [14,4,15]
pg 2.13a is active+clean+inconsistent, acting [2,26,4]
pg 2.11b is active+clean+inconsistent, acting [11,26,7]
pg 2.1a2 is active+clean+inconsistent, acting [5,19,26]
pg 4.20f is active+clean+inconsistent, acting [14,10,22]
pg 2.222 is active+clean+inconsistent, acting [14,26,1]
pg 2.25d is active+clean+inconsistent, acting [5,0,20]
38 scrub errors
too many PGs per OSD (601 > max 300)

[root@ceph177 ~]# for i in 2.3b5 4.37a 2.335 2.f2 2.a2 4.4a 2.13a 2.11b 2.1a2 4.20f 2.222 2.25d;do ceph pg repair $i;done
instructing pg 2.3b5 on osd.5 to repair
instructing pg 4.37a on osd.23 to repair
instructing pg 2.335 on osd.5 to repair
instructing pg 2.f2 on osd.14 to repair
instructing pg 2.a2 on osd.5 to repair
instructing pg 4.4a on osd.14 to repair
instructing pg 2.13a on osd.2 to repair
instructing pg 2.11b on osd.11 to repair
instructing pg 2.1a2 on osd.5 to repair
instructing pg 4.20f on osd.14 to repair
instructing pg 2.222 on osd.14 to repair
instructing pg 2.25d on osd.5 to repair@

but when i reset the 192.7.7.181, the same scrub errors appear in the same pg .
why?

#2 Updated by Samuel Just over 7 years ago

Can you describe in much more detail how you are "resetting" that node? Also, can you include the mount options for the osds?

#3 Updated by Sage Weil over 7 years ago

  • Status changed from New to Need More Info

Look in ceph.log for the actual scrub error.

Also, make sure you aren't running with nobarrier -- that will corrupt your data on power loss.

#4 Updated by Greg Farnum almost 7 years ago

  • Status changed from Need More Info to Closed

Also available in: Atom PDF