Project

General

Profile

Actions

Bug #19974

closed

"HEALTH_WARN too few PGs per OSD" test_wait_for_health_ok

Added by Kefu Chai almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1: /var/ceph/ceph/qa/workunits/ceph-helpers.sh:1313: wait_for_health:  grep HEALTH_OK
1: /var/ceph/ceph/qa/workunits/ceph-helpers.sh:1314: wait_for_health:  ((  12 >= 12  ))
1: /var/ceph/ceph/qa/workunits/ceph-helpers.sh:1315: wait_for_health:  ceph health detail
1: HEALTH_ERR 4 pgs are stuck inactive for more than 300 seconds; 4 pgs stuck inactive; 4 pgs stuck unclean; too few PGs per OSD (4 < min 30)
1: pg 1.0 is stuck inactive for 324.041240, current state creating, last acting [0]
1: pg 1.1 is stuck inactive for 324.041247, current state creating, last acting [0]
1: pg 1.3 is stuck inactive for 324.041250, current state creating, last acting [0]
1: pg 1.2 is stuck inactive for 324.041253, current state creating, last acting [0]
1: pg 1.0 is stuck unclean for 324.041266, current state creating, last acting [0]
1: pg 1.1 is stuck unclean for 324.041270, current state creating, last acting [0]
1: pg 1.2 is stuck unclean for 324.041273, current state creating, last acting [0]
1: pg 1.3 is stuck unclean for 324.041276, current state creating, last acting [0]
1: too few PGs per OSD (4 < min 30)
1: /var/ceph/ceph/qa/workunits/ceph-helpers.sh:1316: wait_for_health:  return 1
1: /var/ceph/ceph/qa/workunits/ceph-helpers.sh:1324: wait_for_health_ok:  return 1
1: /var/ceph/ceph/qa/workunits/ceph-helpers.sh:1335: test_wait_for_health_ok:  return 1
1: /var/ceph/ceph/qa/workunits/ceph-helpers.sh:1721: run_tests:  return 1

mon.a.log

$ grep osd_pg_created mon.a.log| grep prepare_pg_created
2017-05-18 20:27:41.098481 7faaad62f700 10 mon.a@0(leader).osd e8 prepare_pg_created osd_pg_created(1.3) v1
2017-05-18 20:27:41.154452 7faaad62f700 10 mon.a@0(leader).osd e8 prepare_pg_created osd_pg_created(1.2) v1
2017-05-18 20:27:41.154565 7faaad62f700 10 mon.a@0(leader).osd e8 prepare_pg_created osd_pg_created(1.0) v1
2017-05-18 20:27:41.154700 7faaad62f700 10 mon.a@0(leader).osd e8 prepare_pg_created osd_pg_created(1.1) v1

mgr.x.log:

2017-05-18 08:04:30.537180 7fa1716eb700  5 -- 127.0.0.1:6800/5719 >> 127.0.0.1:6801/6074 conn(0x5582d0347000 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx osd.0 seq 887 0x5582d076fe40 pg_stats(4 pgs tid 0 v 0) v1
2017-05-18 08:04:30.538225 7fa16aede700  1 -- 127.0.0.1:6800/5719 <== osd.0 127.0.0.1:6801/6074 887 ==== pg_stats(4 pgs tid 0 v 0) v1 ==== 2384+0+0 (1988358769 0 0) 0x5582d076fe40 con 0x5582d0347000
2017-05-18 08:04:35.538390 7fa1716eb700  5 -- 127.0.0.1:6800/5719 >> 127.0.0.1:6801/6074 conn(0x5582d0347000 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx osd.0 seq 889 0x5582d076f080 pg_stats(4 pgs tid 0 v 0) v1
2017-05-18 08:04:35.539246 7fa16aede700  1 -- 127.0.0.1:6800/5719 <== osd.0 127.0.0.1:6801/6074 889 ==== pg_stats(4 pgs tid 0 v 0) v1 ==== 2384+0+0 (3580539246 0 0) 0x5582d076f080 con 0x5582d0347000
2017-05-18 08:04:40.539305 7fa1716eb700  5 -- 127.0.0.1:6800/5719 >> 127.0.0.1:6801/6074 conn(0x5582d0347000 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx osd.0 seq 891 0x5582d076e2c0 pg_stats(4 pgs tid 0 v 0) v1
2017-05-18 08:04:40.540124 7fa16aede700  1 -- 127.0.0.1:6800/5719 <== osd.0 127.0.0.1:6801/6074 891 ==== pg_stats(4 pgs tid 0 v 0) v1 ==== 2384+0+0 (1736395539 0 0) 0x5582d076e2c0 con 0x5582d0347000
2017-05-18 08:04:45.540464 7fa1716eb700  5 -- 127.0.0.1:6800/5719 >> 127.0.0.1:6801/6074 conn(0x5582d0347000 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx osd.0 seq 893 0x5582cfa73440 pg_stats(4 pgs tid 0 v 0) v1
2017-05-18 08:04:45.541428 7fa16aede700  1 -- 127.0.0.1:6800/5719 <== osd.0 127.0.0.1:6801/6074 893 ==== pg_stats(4 pgs tid 0 v 0) v1 ==== 2384+0+0 (3997986520 0 0) 0x5582cfa73440 con 0x5582d0347000
2017-05-18 08:04:50.541766 7fa1716eb700  5 -- 127.0.0.1:6800/5719 >> 127.0.0.1:6801/6074 conn(0x5582d0347000 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx osd.0 seq 895 0x5582cfa72680 pg_stats(4 pgs tid 0 v 0) v1
2017-05-18 08:04:50.542639 7fa16aede700  1 -- 127.0.0.1:6800/5719 <== osd.0 127.0.0.1:6801/6074 895 ==== pg_stats(4 pgs tid 0 v 0) v1 ==== 2384+0+0 (1975083088 0 0) 0x5582cfa72680 con 0x5582d0347000
2017-05-18 08:04:55.543048 7fa1716eb700  5 -- 127.0.0.1:6800/5719 >> 127.0.0.1:6801/6074 conn(0x5582d0347000 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx osd.0 seq 897 0x5582cfa718c0 pg_stats(4 pgs tid 0 v 0) v1
2017-05-18 08:04:55.543918 7fa16aede700  1 -- 127.0.0.1:6800/5719 <== osd.0 127.0.0.1:6801/6074 897 ==== pg_stats(4 pgs tid 0 v 0) v1 ==== 2384+0+0 (2358870595 0 0) 0x5582cfa718c0 con 0x5582d0347000
...

so mgr was being updated by osd.0, but either osd failed to send it updated pg stats, or mgr failed to update pgmap with the pg stats message.

Actions #1

Updated by Kefu Chai almost 7 years ago

can be reproduced occasionally locally by running "ceph-helpers.sh test_wait_for_health_ok"

Actions #2

Updated by Kefu Chai almost 7 years ago

./bin/ceph health detail -c ./src/test/td/ceph-helpers/ceph.conf
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2017-05-18 21:41:03.652407 7f039e255700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
HEALTH_WARN too few PGs per OSD (4 < min 30)
too few PGs per OSD (4 < min 30)
Actions #3

Updated by Kefu Chai almost 7 years ago

  • Subject changed from health report is stale to "HEALTH_WARN too few PGs per OSD" test_wait_for_health_ok
Actions #4

Updated by Kefu Chai almost 7 years ago

  • Status changed from 12 to Fix Under Review
  • Assignee set to Kefu Chai
  • Priority changed from Immediate to Normal
Actions #5

Updated by Kefu Chai almost 7 years ago

Dan, i was wrong. this is only related to testing of ceph.

Actions #6

Updated by Kefu Chai almost 7 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF