Bug #63209: Optimized the design of pg unknown blocking the continued creation of normal pg caused by node anomalies etc - Ceph - Ceph

Actions

Copy link

Bug #63209

open

Optimized the design of pg unknown blocking the continued creation of normal pg caused by node anomalies etc

Added by wenjuan wang 7 months ago. Updated 7 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

During the deployment of a host-level, three-replica Ceph cluster, we encountered an anomaly with one of the nodes, hence no OSD was created on this node. Subsequently, when we directly created a host-level pool, we found that the pg could not be created on this abnormal node, which was expected. However, what surprised us was that PGs could not be created on all the normal nodes following this abnormal node either. This led to a large number of PGs in the cluster remaining in a 'creating' state.

Actions

Copy link

Updated by wenjuan wang 7 months ago

We can do a specific problem repetition.

We started the ceph cluster with vstart, the cluster consists of 5 nodes, each node 10 osd, each node creates a three-replica pool，and each pool has 1024 pgs

1. vstart create a ceph cluster

# Create 50 OSDs, corresponding to 5 hosts, with each host containing 10 OSDs.
MON=1 OSD=50 MGR=1 MDS=0 RGW=0 ../src/vstart.sh -b -X --without-dashboard -n

# Create host, the default host is szjd-yfq-pm-os01-bconest-4.
ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-5 host
ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-6 host
ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-7 host
ceph osd crush add-bucket szjd-yfq-pm-os01-bconest-8 host

# Create root
ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-4 root
ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-5 root
ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-6 root
ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-7 root
ceph osd crush add-bucket root-szjd-yfq-pm-os01-bconest-8 root

# Move the host to the corresponding root and move osd to the corresponding host
ceph osd crush move szjd-yfq-pm-os01-bconest-4 root=root-szjd-yfq-pm-os01-bconest-4
ceph osd crush move szjd-yfq-pm-os01-bconest-5 root=root-szjd-yfq-pm-os01-bconest-5
ceph osd crush move szjd-yfq-pm-os01-bconest-6 root=root-szjd-yfq-pm-os01-bconest-6
ceph osd crush move szjd-yfq-pm-os01-bconest-7 root=root-szjd-yfq-pm-os01-bconest-7
ceph osd crush move szjd-yfq-pm-os01-bconest-8 root=root-szjd-yfq-pm-os01-bconest-8

for id in {0..9}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-4; done
for id in {10..19}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-5; done
for id in {20..29}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-6; done
for id in {30..39}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-7; done
for id in {40..49}; do ceph osd crush move osd.$id host=szjd-yfq-pm-os01-bconest-8; done

2. Delete all OSDs under the host=szjd-yfq-pm-os01-bconest-5.

for id in {10..19}; do ceph osd stop osd.$id; done
for id in {10..19}; do ceph osd rm osd.$id; done
for id in {10..19}; do ceph osd crush rm osd.$id; done
for id in {10..19}; do ceph auth del osd.$id; done

3. Create a pool with 3 replicas at the node level.


# Create crush rule
ceph osd crush rule create-simple wwj_test_crush_rule-0 root-szjd-yfq-pm-os01-bconest-4 host
ceph osd crush rule create-simple wwj_test_crush_rule-1 root-szjd-yfq-pm-os01-bconest-5 host
ceph osd crush rule create-simple wwj_test_crush_rule-2 root-szjd-yfq-pm-os01-bconest-6 host
ceph osd crush rule create-simple wwj_test_crush_rule-3 root-szjd-yfq-pm-os01-bconest-7 host
ceph osd crush rule create-simple wwj_test_crush_rule-4 root-szjd-yfq-pm-os01-bconest-8 host

# Create host-level pool
for id in {0..4}; do ceph osd pool create wwj-test-pool-$id 1024 1024 replicated wwj_test_crush_rule-$id 0 1 off;done

# set pool size to 3
for id in {0..4}; do ceph osd pool set wwj-test-pool-$id size 3 --yes-i-really-mean-it; done

4. Reproduction successful.

[root@szjd-yfq-pm-os01-bconest-4 build]# ceph -s
2023-10-16T01:55:00.003+0000 7fc676673700 -1 WARNING: all dangerous and experimental features are enabled.
2023-10-16T01:55:00.065+0000 7fc676673700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     0aacb1cc-5684-4aab-87f1-a11f067adf7b
    health: HEALTH_WARN
            2 mgr modules have failed dependencies
            Reduced data availability: 4096 pgs inactive
            Degraded data redundancy: 1024 pgs undersized

  services:
    mon: 1 daemons, quorum a (age 2d)
    mgr: x(active, since 2d)
    osd: 40 osds: 39 up (since 2d), 39 in (since 2d)

  data:
    pools:   5 pools, 5120 pgs
    objects: 0 objects, 0 B
    usage:   401 MiB used, 3.8 TiB / 3.8 TiB avail
    pgs:     80.000% pgs unknown
             4096 unknown
             1024 active+clean

[root@szjd-yfq-pm-os01-bconest-4 build]# ceph osd pool ls detail
2023-10-16T01:55:19.279+0000 7fe557455700 -1 WARNING: all dangerous and experimental features are enabled.
2023-10-16T01:55:19.287+0000 7fe557455700 -1 WARNING: all dangerous and experimental features are enabled.
pool 16 'wwj-test-pool-0' replicated size 3 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 372 flags hashpspool stripe_width 0
pool 17 'wwj-test-pool-1' replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 373 flags hashpspool,creating stripe_width 0
pool 18 'wwj-test-pool-2' replicated size 3 min_size 1 crush_rule 3 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 374 flags hashpspool,creating stripe_width 0
pool 19 'wwj-test-pool-3' replicated size 3 min_size 1 crush_rule 4 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 375 flags hashpspool,creating stripe_width 0
pool 20 'wwj-test-pool-4' replicated size 3 min_size 1 crush_rule 5 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 376 flags hashpspool,creating stripe_width 0

[root@szjd-yfq-pm-os01-bconest-4 build]# ceph osd tree
2023-10-16T01:56:55.460+0000 7fdc7ee41700 -1 WARNING: all dangerous and experimental features are enabled.
2023-10-16T01:56:56.011+0000 7fdc7ee41700 -1 WARNING: all dangerous and experimental features are enabled.
ID   CLASS  WEIGHT   TYPE NAME                             STATUS  REWEIGHT  PRI-AFF
-13         0.87918  root root-szjd-yfq-pm-os01-bconest-8
 -8         0.87918      host szjd-yfq-pm-os01-bconest-8
 40    hdd  0.09769          osd.40                            up   1.00000  1.00000
 41    hdd  0.09769          osd.41                            up   1.00000  1.00000
 43    hdd  0.09769          osd.43                            up   1.00000  1.00000
 44    hdd  0.09769          osd.44                            up   1.00000  1.00000
 45    hdd  0.09769          osd.45                            up   1.00000  1.00000
 46    hdd  0.09769          osd.46                            up   1.00000  1.00000
 47    hdd  0.09769          osd.47                            up   1.00000  1.00000
 48    hdd  0.09769          osd.48                            up   1.00000  1.00000
 49    hdd  0.09769          osd.49                            up   1.00000  1.00000
-12         0.97687  root root-szjd-yfq-pm-os01-bconest-7
 -7         0.97687      host szjd-yfq-pm-os01-bconest-7
 30    hdd  0.09769          osd.30                            up   1.00000  1.00000
 31    hdd  0.09769          osd.31                            up   1.00000  1.00000
 32    hdd  0.09769          osd.32                            up   1.00000  1.00000
 33    hdd  0.09769          osd.33                            up   1.00000  1.00000
 34    hdd  0.09769          osd.34                            up   1.00000  1.00000
 35    hdd  0.09769          osd.35                            up   1.00000  1.00000
 36    hdd  0.09769          osd.36                            up   1.00000  1.00000
 37    hdd  0.09769          osd.37                            up   1.00000  1.00000
 38    hdd  0.09769          osd.38                            up   1.00000  1.00000
 39    hdd  0.09769          osd.39                            up   1.00000  1.00000
-11         0.97687  root root-szjd-yfq-pm-os01-bconest-6
 -6         0.97687      host szjd-yfq-pm-os01-bconest-6
 20    hdd  0.09769          osd.20                            up   1.00000  1.00000
 21    hdd  0.09769          osd.21                            up   1.00000  1.00000
 22    hdd  0.09769          osd.22                            up   1.00000  1.00000
 23    hdd  0.09769          osd.23                            up   1.00000  1.00000
 24    hdd  0.09769          osd.24                            up   1.00000  1.00000
 25    hdd  0.09769          osd.25                            up   1.00000  1.00000
 26    hdd  0.09769          osd.26                            up   1.00000  1.00000
 27    hdd  0.09769          osd.27                            up   1.00000  1.00000
 28    hdd  0.09769          osd.28                            up   1.00000  1.00000
 29    hdd  0.09769          osd.29                            up   1.00000  1.00000
-10               0  root root-szjd-yfq-pm-os01-bconest-5
 -5               0      host szjd-yfq-pm-os01-bconest-5
 -9         1.07455  root root-szjd-yfq-pm-os01-bconest-4
 -3         1.07455      host szjd-yfq-pm-os01-bconest-4
  0    hdd  0.09769          osd.0                             up   1.00000  1.00000
  1    hdd  0.09769          osd.1                             up   1.00000  1.00000
  2    hdd  0.09769          osd.2                             up   1.00000  1.00000
  3    hdd  0.09769          osd.3                             up   1.00000  1.00000
  4    hdd  0.09769          osd.4                             up   1.00000  1.00000
  5    hdd  0.09769          osd.5                             up   1.00000  1.00000
  6    hdd  0.09769          osd.6                             up   1.00000  1.00000
  7    hdd  0.09769          osd.7                             up   1.00000  1.00000
  8    hdd  0.09769          osd.8                             up   1.00000  1.00000
  9    hdd  0.09769          osd.9                             up   1.00000  1.00000
 42    hdd  0.09769          osd.42                            up   1.00000  1.00000
 -1               0  root default

As the results show, we found that after szjd-yfq-pm-os01-bconest-5, pg cannot be created on the remaining normal nodes. For example, on the node szjd-yfq-pm-os01-bconest-6, osd.20-osd.29 are all in the up state, but pg is stuck in the creating state.

Actions

Copy link

Updated by wenjuan wang 7 months ago

By adjusting debug_mon=20, we analyzed the mon log and found that the reason for this issue is that the mon started creating pg after receiving the command to create pg. However, due to an anomaly in the szjd-yfq-pm-os01-bconest-5 node,1024 pgs on this node got stuck at the update_pending_pgs() stage. Therefore, the number of pgs in pending_creatings.pgs.size() (1024) is not less than mon_osd_max_creating_pgs (default1024), which blocked the creation of pgs on subsequent normal nodes.

creating_pgs_t
OSDMonitor::update_pending_pgs(const OSDMap::Incremental& inc,
                   const OSDMap& nextmap)
{
.........
 // process queue
  unsigned max = std::max<int64_t>(1, g_conf()->mon_osd_max_creating_pgs);
  const auto total = pending_creatings.pgs.size();
  while (pending_creatings.pgs.size() < max &&
     !pending_creatings.queue.empty()) {
    auto p = pending_creatings.queue.begin();
    int64_t poolid = p->first;
    dout(10) << __func__ << " pool " << poolid
         << " created " << p->second.created
         << " modified " << p->second.modified
         << " [" << p->second.start << "-" << p->second.end << ")" 
         << dendl;
    int64_t n = std::min<int64_t>(max - pending_creatings.pgs.size(),
                  p->second.end - p->second.start);
    ps_t first = p->second.start;
    ps_t end = first + n;
    for (ps_t ps = first; ps < end; ++ps) {
      const pg_t pgid{ps, static_cast<uint64_t>(poolid)};
      // NOTE: use the *current* epoch as the PG creation epoch so that the
      // OSD does not have to generate a long set of PastIntervals.
      pending_creatings.pgs.emplace(
    pgid,
    creating_pgs_t::pg_create_info(inc.epoch,
                       p->second.modified));
      dout(10) << __func__ << " adding " << pgid << dendl;
    }
    p->second.start = end;
    if (p->second.done()) {
      dout(10) << __func__ << " done with queue for " << poolid << dendl;
      pending_creatings.queue.erase(p);
    } else {
      dout(10) << __func__ << " pool " << poolid
           << " now [" << p->second.start << "-" << p->second.end << ")" 
           << dendl;
    }
  }
  dout(10) << __func__ << " queue remaining: " << pending_creatings.queue.size()
       << " pools" << dendl;

}

Therefore, we attempted to adjust the mon_osd_max_creating_pgs configuration, for instance, setting mon_osd_max_creating_pgs =100000.


[root@szjd-yfq-pm-os01-bconest-4 build]# ceph daemon mon.a config show | grep mon_osd_max_creating_pgs
    "mon_osd_max_creating_pgs": "1024",
[root@szjd-yfq-pm-os01-bconest-4 build]# ceph daemon mon.a config set mon_osd_max_creating_pgs 1000000
{
    "success": "mon_osd_max_creating_pgs = '1000000' (not observed, change may require restart) " 
}

for id in {0..4}; do ceph osd pool create wwj-test-pool-$id 1024 1024 replicated wwj_test_crush_rule-$id 0 1 off;done
for id in {0..4}; do ceph osd pool set wwj-test-pool-$id size 3 --yes-i-really-mean-it; done

[root@szjd-yfq-pm-os01-bconest-4 build]# ceph -s
2023-10-16T02:13:22.945+0000 7faddbdec700 -1 WARNING: all dangerous and experimental features are enabled.
2023-10-16T02:13:22.948+0000 7faddbdec700 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     0aacb1cc-5684-4aab-87f1-a11f067adf7b
    health: HEALTH_WARN
            2 mgr modules have failed dependencies

  services:
    mon: 1 daemons, quorum a (age 2d)
    mgr: x(active, since 2d)
    osd: 40 osds: 40 up (since 16m), 40 in (since 16m)

  data:
    pools:   5 pools, 5120 pgs
    objects: 0 objects, 0 B
    usage:   446 MiB used, 3.9 TiB / 3.9 TiB avail
    pgs:     20.000% pgs unknown
             4096 active+clean
             1024 unknown

[root@szjd-yfq-pm-os01-bconest-4 build]# ceph osd pool ls detail
2023-10-16T02:13:29.255+0000 7f2c89060700 -1 WARNING: all dangerous and experimental features are enabled.
2023-10-16T02:13:29.264+0000 7f2c89060700 -1 WARNING: all dangerous and experimental features are enabled.
pool 21 'wwj-test-pool-0' replicated size 3 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 485 flags hashpspool stripe_width 0
pool 22 'wwj-test-pool-1' replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 486 flags hashpspool,creating stripe_width 0
pool 23 'wwj-test-pool-2' replicated size 3 min_size 1 crush_rule 3 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 487 flags hashpspool stripe_width 0
pool 24 'wwj-test-pool-3' replicated size 3 min_size 1 crush_rule 4 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 488 flags hashpspool stripe_width 0
pool 25 'wwj-test-pool-4' replicated size 3 min_size 1 crush_rule 5 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 489 flags hashpspool stripe_width 0

We found that freeing mon_osd_max_creating_pgs solved the problem successfully. PGS can be created on normal nodes except abnormal nodes

However, we have a concern. Will relaxing the mon_osd_max_creating_pgs configuration impact the mon? Because after pg completes peering, it will register with the mon. Could this cause a storm on the mon?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #63209

Optimized the design of pg unknown blocking the continued creation of normal pg caused by node anomalies etc

Updated by wenjuan wang 7 months ago

Updated by wenjuan wang 7 months ago