Bug #63209
openOptimized the design of pg unknown blocking the continued creation of normal pg caused by node anomalies etc
0%
Description
During the deployment of a host-level, three-replica Ceph cluster, we encountered an anomaly with one of the nodes, hence no OSD was created on this node. Subsequently, when we directly created a host-level pool, we found that the pg could not be created on this abnormal node, which was expected. However, what surprised us was that PGs could not be created on all the normal nodes following this abnormal node either. This led to a large number of PGs in the cluster remaining in a 'creating' state.
Updated by wenjuan wang 7 months ago
We can do a specific problem repetition.
- We started the ceph cluster with vstart, the cluster consists of 5 nodes, each node 10 osd, each node creates a three-replica pool,and each pool has 1024 pgs
1. vstart create a ceph cluster
# Create 50 OSDs, corresponding to 5 hosts, with each host containing 10 OSDs. MON=1 OSD=50 MGR=1 MDS=0 RGW=0 ../src/vstart.sh -b -X --without-dashboard -n # Create host, the default host is szjd-yfq-pm-os01-bconest-4. ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-5 host ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-6 host ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-7 host ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-8 host # Create root ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-4 root ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-5 root ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-6 root ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-7 root ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-8 root # Move the host to the corresponding root and move osd to the corresponding host ceph osd crush move szjd-yfq-pm-os01-bconest-4 root=root-szjd-yfq-pm-os01-bconest-4 ceph osd crush move szjd-yfq-pm-os01-bconest-5 root=root-szjd-yfq-pm-os01-bconest-5 ceph osd crush move szjd-yfq-pm-os01-bconest-6 root=root-szjd-yfq-pm-os01-bconest-6 ceph osd crush move szjd-yfq-pm-os01-bconest-7 root=root-szjd-yfq-pm-os01-bconest-7 ceph osd crush move szjd-yfq-pm-os01-bconest-8 root=root-szjd-yfq-pm-os01-bconest-8 for id in {0..9}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-4; done for id in {10..19}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-5; done for id in {20..29}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-6; done for id in {30..39}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-7; done for id in {40..49}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-8; done
2. Delete all OSDs under the host=szjd-yfq-pm-os01-bconest-5.
for id in {10..19}; do ceph osd stop osd.$id; done for id in {10..19}; do ceph osd rm osd.$id; done for id in {10..19}; do ceph osd crush rm osd.$id; done for id in {10..19}; do ceph auth del osd.$id; done
3. Create a pool with 3 replicas at the node level.
# Create crush rule ceph osd crush rule create-simple wwj_test_crush_rule-0 root-szjd-yfq-pm-os01-bconest-4 host ceph osd crush rule create-simple wwj_test_crush_rule-1 root-szjd-yfq-pm-os01-bconest-5 host ceph osd crush rule create-simple wwj_test_crush_rule-2 root-szjd-yfq-pm-os01-bconest-6 host ceph osd crush rule create-simple wwj_test_crush_rule-3 root-szjd-yfq-pm-os01-bconest-7 host ceph osd crush rule create-simple wwj_test_crush_rule-4 root-szjd-yfq-pm-os01-bconest-8 host # Create host-level pool for id in {0..4}; do ceph osd pool create wwj-test-pool-$id 1024 1024 replicated wwj_test_crush_rule-$id 0 1 off;done # set pool size to 3 for id in {0..4}; do ceph osd pool set wwj-test-pool-$id size 3 --yes-i-really-mean-it; done
4. Reproduction successful.
[root@szjd-yfq-pm-os01-bconest-4 build]# ceph -s 2023-10-16T01:55:00.003+0000 7fc676673700 -1 WARNING: all dangerous and experimental features are enabled. 2023-10-16T01:55:00.065+0000 7fc676673700 -1 WARNING: all dangerous and experimental features are enabled. cluster: id: 0aacb1cc-5684-4aab-87f1-a11f067adf7b health: HEALTH_WARN 2 mgr modules have failed dependencies Reduced data availability: 4096 pgs inactive Degraded data redundancy: 1024 pgs undersized services: mon: 1 daemons, quorum a (age 2d) mgr: x(active, since 2d) osd: 40 osds: 39 up (since 2d), 39 in (since 2d) data: pools: 5 pools, 5120 pgs objects: 0 objects, 0 B usage: 401 MiB used, 3.8 TiB / 3.8 TiB avail pgs: 80.000% pgs unknown 4096 unknown 1024 active+clean [root@szjd-yfq-pm-os01-bconest-4 build]# ceph osd pool ls detail 2023-10-16T01:55:19.279+0000 7fe557455700 -1 WARNING: all dangerous and experimental features are enabled. 2023-10-16T01:55:19.287+0000 7fe557455700 -1 WARNING: all dangerous and experimental features are enabled. pool 16 'wwj-test-pool-0' replicated size 3 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 372 flags hashpspool stripe_width 0 pool 17 'wwj-test-pool-1' replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 373 flags hashpspool,creating stripe_width 0 pool 18 'wwj-test-pool-2' replicated size 3 min_size 1 crush_rule 3 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 374 flags hashpspool,creating stripe_width 0 pool 19 'wwj-test-pool-3' replicated size 3 min_size 1 crush_rule 4 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 375 flags hashpspool,creating stripe_width 0 pool 20 'wwj-test-pool-4' replicated size 3 min_size 1 crush_rule 5 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 376 flags hashpspool,creating stripe_width 0 [root@szjd-yfq-pm-os01-bconest-4 build]# ceph osd tree 2023-10-16T01:56:55.460+0000 7fdc7ee41700 -1 WARNING: all dangerous and experimental features are enabled. 2023-10-16T01:56:56.011+0000 7fdc7ee41700 -1 WARNING: all dangerous and experimental features are enabled. ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -13 0.87918 root root-szjd-yfq-pm-os01-bconest-8 -8 0.87918 host szjd-yfq-pm-os01-bconest-8 40 hdd 0.09769 osd.40 up 1.00000 1.00000 41 hdd 0.09769 osd.41 up 1.00000 1.00000 43 hdd 0.09769 osd.43 up 1.00000 1.00000 44 hdd 0.09769 osd.44 up 1.00000 1.00000 45 hdd 0.09769 osd.45 up 1.00000 1.00000 46 hdd 0.09769 osd.46 up 1.00000 1.00000 47 hdd 0.09769 osd.47 up 1.00000 1.00000 48 hdd 0.09769 osd.48 up 1.00000 1.00000 49 hdd 0.09769 osd.49 up 1.00000 1.00000 -12 0.97687 root root-szjd-yfq-pm-os01-bconest-7 -7 0.97687 host szjd-yfq-pm-os01-bconest-7 30 hdd 0.09769 osd.30 up 1.00000 1.00000 31 hdd 0.09769 osd.31 up 1.00000 1.00000 32 hdd 0.09769 osd.32 up 1.00000 1.00000 33 hdd 0.09769 osd.33 up 1.00000 1.00000 34 hdd 0.09769 osd.34 up 1.00000 1.00000 35 hdd 0.09769 osd.35 up 1.00000 1.00000 36 hdd 0.09769 osd.36 up 1.00000 1.00000 37 hdd 0.09769 osd.37 up 1.00000 1.00000 38 hdd 0.09769 osd.38 up 1.00000 1.00000 39 hdd 0.09769 osd.39 up 1.00000 1.00000 -11 0.97687 root root-szjd-yfq-pm-os01-bconest-6 -6 0.97687 host szjd-yfq-pm-os01-bconest-6 20 hdd 0.09769 osd.20 up 1.00000 1.00000 21 hdd 0.09769 osd.21 up 1.00000 1.00000 22 hdd 0.09769 osd.22 up 1.00000 1.00000 23 hdd 0.09769 osd.23 up 1.00000 1.00000 24 hdd 0.09769 osd.24 up 1.00000 1.00000 25 hdd 0.09769 osd.25 up 1.00000 1.00000 26 hdd 0.09769 osd.26 up 1.00000 1.00000 27 hdd 0.09769 osd.27 up 1.00000 1.00000 28 hdd 0.09769 osd.28 up 1.00000 1.00000 29 hdd 0.09769 osd.29 up 1.00000 1.00000 -10 0 root root-szjd-yfq-pm-os01-bconest-5 -5 0 host szjd-yfq-pm-os01-bconest-5 -9 1.07455 root root-szjd-yfq-pm-os01-bconest-4 -3 1.07455 host szjd-yfq-pm-os01-bconest-4 0 hdd 0.09769 osd.0 up 1.00000 1.00000 1 hdd 0.09769 osd.1 up 1.00000 1.00000 2 hdd 0.09769 osd.2 up 1.00000 1.00000 3 hdd 0.09769 osd.3 up 1.00000 1.00000 4 hdd 0.09769 osd.4 up 1.00000 1.00000 5 hdd 0.09769 osd.5 up 1.00000 1.00000 6 hdd 0.09769 osd.6 up 1.00000 1.00000 7 hdd 0.09769 osd.7 up 1.00000 1.00000 8 hdd 0.09769 osd.8 up 1.00000 1.00000 9 hdd 0.09769 osd.9 up 1.00000 1.00000 42 hdd 0.09769 osd.42 up 1.00000 1.00000 -1 0 root default
As the results show, we found that after szjd-yfq-pm-os01-bconest-5, pg cannot be created on the remaining normal nodes. For example, on the node szjd-yfq-pm-os01-bconest-6, osd.20-osd.29 are all in the up state, but pg is stuck in the creating state.
Updated by wenjuan wang 7 months ago
By adjusting debug_mon=20, we analyzed the mon log and found that the reason for this issue is that the mon started creating pg after receiving the command to create pg. However, due to an anomaly in the szjd-yfq-pm-os01-bconest-5 node,1024 pgs on this node got stuck at the update_pending_pgs() stage. Therefore, the number of pgs in pending_creatings.pgs.size() (1024) is not less than mon_osd_max_creating_pgs (default1024), which blocked the creation of pgs on subsequent normal nodes.
creating_pgs_t OSDMonitor::update_pending_pgs(const OSDMap::Incremental& inc, const OSDMap& nextmap) { ......... // process queue unsigned max = std::max<int64_t>(1, g_conf()->mon_osd_max_creating_pgs); const auto total = pending_creatings.pgs.size(); while (pending_creatings.pgs.size() < max && !pending_creatings.queue.empty()) { auto p = pending_creatings.queue.begin(); int64_t poolid = p->first; dout(10) << __func__ << " pool " << poolid << " created " << p->second.created << " modified " << p->second.modified << " [" << p->second.start << "-" << p->second.end << ")" << dendl; int64_t n = std::min<int64_t>(max - pending_creatings.pgs.size(), p->second.end - p->second.start); ps_t first = p->second.start; ps_t end = first + n; for (ps_t ps = first; ps < end; ++ps) { const pg_t pgid{ps, static_cast<uint64_t>(poolid)}; // NOTE: use the *current* epoch as the PG creation epoch so that the // OSD does not have to generate a long set of PastIntervals. pending_creatings.pgs.emplace( pgid, creating_pgs_t::pg_create_info(inc.epoch, p->second.modified)); dout(10) << __func__ << " adding " << pgid << dendl; } p->second.start = end; if (p->second.done()) { dout(10) << __func__ << " done with queue for " << poolid << dendl; pending_creatings.queue.erase(p); } else { dout(10) << __func__ << " pool " << poolid << " now [" << p->second.start << "-" << p->second.end << ")" << dendl; } } dout(10) << __func__ << " queue remaining: " << pending_creatings.queue.size() << " pools" << dendl; }
Therefore, we attempted to adjust the mon_osd_max_creating_pgs configuration, for instance, setting mon_osd_max_creating_pgs =100000.
[root@szjd-yfq-pm-os01-bconest-4 build]# ceph daemon mon.a config show | grep mon_osd_max_creating_pgs "mon_osd_max_creating_pgs": "1024", [root@szjd-yfq-pm-os01-bconest-4 build]# ceph daemon mon.a config set mon_osd_max_creating_pgs 1000000 { "success": "mon_osd_max_creating_pgs = '1000000' (not observed, change may require restart) " } for id in {0..4}; do ceph osd pool create wwj-test-pool-$id 1024 1024 replicated wwj_test_crush_rule-$id 0 1 off;done for id in {0..4}; do ceph osd pool set wwj-test-pool-$id size 3 --yes-i-really-mean-it; done [root@szjd-yfq-pm-os01-bconest-4 build]# ceph -s 2023-10-16T02:13:22.945+0000 7faddbdec700 -1 WARNING: all dangerous and experimental features are enabled. 2023-10-16T02:13:22.948+0000 7faddbdec700 -1 WARNING: all dangerous and experimental features are enabled. cluster: id: 0aacb1cc-5684-4aab-87f1-a11f067adf7b health: HEALTH_WARN 2 mgr modules have failed dependencies services: mon: 1 daemons, quorum a (age 2d) mgr: x(active, since 2d) osd: 40 osds: 40 up (since 16m), 40 in (since 16m) data: pools: 5 pools, 5120 pgs objects: 0 objects, 0 B usage: 446 MiB used, 3.9 TiB / 3.9 TiB avail pgs: 20.000% pgs unknown 4096 active+clean 1024 unknown [root@szjd-yfq-pm-os01-bconest-4 build]# ceph osd pool ls detail 2023-10-16T02:13:29.255+0000 7f2c89060700 -1 WARNING: all dangerous and experimental features are enabled. 2023-10-16T02:13:29.264+0000 7f2c89060700 -1 WARNING: all dangerous and experimental features are enabled. pool 21 'wwj-test-pool-0' replicated size 3 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 485 flags hashpspool stripe_width 0 pool 22 'wwj-test-pool-1' replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 486 flags hashpspool,creating stripe_width 0 pool 23 'wwj-test-pool-2' replicated size 3 min_size 1 crush_rule 3 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 487 flags hashpspool stripe_width 0 pool 24 'wwj-test-pool-3' replicated size 3 min_size 1 crush_rule 4 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 488 flags hashpspool stripe_width 0 pool 25 'wwj-test-pool-4' replicated size 3 min_size 1 crush_rule 5 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 489 flags hashpspool stripe_width 0
We found that freeing mon_osd_max_creating_pgs solved the problem successfully. PGS can be created on normal nodes except abnormal nodes
However, we have a concern. Will relaxing the mon_osd_max_creating_pgs configuration impact the mon? Because after pg completes peering, it will register with the mon. Could this cause a storm on the mon?