Project

General

Profile

Bug #59656

pg_upmap_primary timeout

Added by Kevin NGUETCHOUANG 10 months ago. Updated 7 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,
I created a ceph cluster using cephadm with ceph version reef 10 nodes, 3 mon nodes and 8 osd nodes. On top of that, I mount a cephfs and everything working well.

My problem is now, I tried to manage pg primaries using the pg_upmap_primary. This is where problems appeared. When i delete a file which has pgs subject to some pg_upmap_primary, the ceph filesystem created stucks, doing no other I/O operations.
I sincerely don't know what happen.
ceph health detail indicated "HEALTH_OK" during the whole process.
No crash are reported during the whole process, no error log either

I'm stuck on this for about 7 days and have no idea on how to fix this.

Thank you for reading me.

ceph-volume.log View (57 KB) Kevin NGUETCHOUANG, 05/08/2023 08:07 PM

ceph-client.ceph-exporter.node0-link-1.log View (560 KB) Kevin NGUETCHOUANG, 05/08/2023 08:07 PM

om (8.09 KB) Kevin NGUETCHOUANG, 05/08/2023 08:27 PM

History

#1 Updated by Venky Shankar 10 months ago

  • Project changed from CephFS to RADOS
  • Category changed from Administration/Usability to Administration/Usability

#2 Updated by Radoslaw Zarzynski 10 months ago

  • Assignee set to Laura Flores

This a fresh Reef's feature added in https://github.com/ceph/ceph/pull/49178.
CCing Laura who was involved here.

#3 Updated by Laura Flores 10 months ago

Hello Kevin, thanks for reporting this issue.

A few questions:
1. What is the version of your cluster?
2. In what way is your filesystem stuck? Can you provide some logs or log messages that point to the problem?
3. If possible, can you attach a copy of your osdmap? You can do so by running `ceph osd getmap -o my_osdmap`.
4. Did you apply the pg-upmap-primary command manually, or did you apply it using the output from the osdmaptool in read balance mode?

Any more details you can provide about your problem would be helpful, including steps to reproduce.

#4 Updated by Kevin NGUETCHOUANG 10 months ago

Hello Laura thank for answering me.

1. I'm using the reef version. (v18)
2. this is where the problem begins, I don't have any logs, and after 360s, I got the ceph "health_warn 1 MDSs report slow requests",
3. Ok I will do so, my cluster is upgrading for now.
4. I used the output from the osdmaptool in read balance mode.

I don't have any errors, and some very rare times, I got the error : "Error EINVAL: osd.* is not acting set for pg <pgid>".
And the whole other time I don't have this, I don't have get any messages errors. The cluster just stuck in any I/O operations to the selected read-pool.

What I tried was to reconstruct the map with the desired pg given by the osdmaptool's output and apply the new map with the new primary osd.

#5 Updated by Laura Flores 10 months ago

Thanks Kevin, "Error EINVAL: osd.* is not in acting set for pg <pgid" helps, as it points me to the area of the code where the problem happens. If you have OSD logs from around the time of the failure, can you attach those too?

#6 Updated by Kevin NGUETCHOUANG 10 months ago

How can I get the OSD logs ?

#7 Updated by Laura Flores 10 months ago

Kevin NGUETCHOUANG wrote:

How can I get the OSD logs ?

All ceph logs are available by default under "/var/log/ceph/<your cluster id>". If you saw an error message like "Error EINVAL: osd.0 is not acting set for pg <pgid>", it would then be helpful to send the log for osd.0.

If tracker complains that your log is too big to attach, you can attach a compressed version by:
  1. Saving a copy (i.e. osd.0.copy)
  2. gzip osd.0.copy

The compressed copy that you can attach would be named "osd.0.copy.gz".

#8 Updated by Kevin NGUETCHOUANG 10 months ago

Ok, for now I have no errors, I will send it when I will face the "not acting set" error again.
Can you please look why I/Os stuck ? whild this behavior is the more recurrent one

#9 Updated by Laura Flores 10 months ago

It's difficult to say without the logs. Even if there are no errors explicitly presenting themselves, something off about the behavior may be observable in the OSD logs.

#10 Updated by Kevin NGUETCHOUANG 10 months ago

This is the only logs I have, I really don't think they contain any valuable information,but may be I'm wrong

#11 Updated by Laura Flores 10 months ago

Thanks Kevin. Your osdmap will also be helpful whenever you get a chance.
I will need some time to evaluate what's happening here, but will update the tracker with updates.

Also, please update the tracker if you see any more error messages.

#12 Updated by Kevin NGUETCHOUANG 10 months ago

  • File om added

this is the map on which I tried to apply the osdmaptool

#13 Updated by Laura Flores 10 months ago

Thank you Kevin! I appreciate it.

#14 Updated by Laura Flores 10 months ago

Hi Kevin,

I am working to reproduce this issue on my end, but I also have some tricks you can try to generate OSD and monitor logs on your end. My guess as to why you don't have OSD or monitor logs is that you don't have debug logging enabled.

If enabled indefinitely, debug logging can take up space over time. So, here's how you can enable debug logging temporarily, then revert back to your current settings.

To get your current debug_osd settings, you can run this command for each OSD. Record what the settings are for each OSD so you can change the setting back later, if you prefer:
`ceph daemon osd.<osd id> config get debug_osd`

To get your current debug_mon settings, you can run this command for each monitor. Record what the settings are for each monitor so you can change the setting back later, if you prefer:
`ceph daemon mon.<mon id> config get debug_mon`

To enable logging on each OSD, run:
`ceph tell osd.* config set debug_osd 20`.

To enable logging on each monitor, run:
`ceph tell mon.* config set debug_mon 20`

Do this while you are experiencing symptoms of this bug. You can check that logs are generated by checking under /var/log/ceph/<your cluster id>.

If primary upmaps are the problem here, you may also want to try removing the upmaps with:
`ceph osd rm-pg-upmap-primary <pgid>`.

Keep the logs running and see if this improves your stuck filesystem.

Once you are satisfied with the logging, you can set the debug settings back to what they originally were with:
`ceph tell osd.* config set debug_osd <the setting you recorded earlier>`
`ceph tell mon.* config set debug_mon <the setting you recorded earlier>`

Let me know if this helps, or if you have any questions!

#15 Updated by Kevin NGUETCHOUANG 10 months ago

Hello Flaura (my shortcut for Flores + Laura), I experienced another malfunction (or not ?) of the read balancer.
Based on the source code, the desired repartition of primaries osd can depend on primary affinities of OSDs. So I tried to change the primary affinity of my OSDs, and I always have the same desired repartition independently of primary affinities I affected to my OSDs. The steps i reproduce:
1- affect affinities to my 10 OSDs, with 2 of them having the least primary affinity value
2- then i run the osdmaptool with the read-pool parameter and then apply the output provided
3- change the affinities in a such a way that OSDs with least primary affinity change
But when i rerun the osdmaptool, i received the message that there are no further optimisation, I thought the balancer will readapt distribution based on the new primary affinities values ?

#16 Updated by Laura Flores 10 months ago

Hi Kevin,

Yes, the read balancer does take primary affinity into account.

I will walk through an example on a vstart cluster, following the steps you provided to reproduce, to demonstrate how the read balancer readapts based on the new primary affinity values.

First, I got a copy of my osdmap:

$ ./bin/ceph osd getmap -o my_osdmap
got osdmap epoch 64

This is what it looks like:

$ ./bin/osdmaptool my_osdmap --dump
./bin/osdmaptool: osdmap file 'my_osdmap'
epoch 64
fsid 952981b4-1c33-4547-9394-4ba4908c1221
created 2023-05-17T20:43:49.380078+0000
modified 2023-05-17T20:46:08.200080+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 7
full_ratio 0.99
backfillfull_ratio 0.99
nearfull_ratio 0.99
require_min_compat_client luminous
min_compat_client jewel
require_osd_release reef
stretch_mode_enabled false

pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 3.00
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 15 pgp_num_target 16 autoscale_mode on last_change 64 lfor 0/0/32 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 1.69
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/32 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.13
pool 4 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.13
pool 5 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 15 pgp_num_target 32 autoscale_mode on last_change 64 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.31
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 application rgw read_balance_score 1.22
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 1.13

max_osd 3
osd.0 up   in  weight 1 up_from 6 up_thru 58 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6802/1108676481,v1:172.21.5.151:6803/1108676481] [v2:172.21.5.151:6804/1108676481,v1:172.21.5.151:6805/1108676481] exists,up d6ad325d-f532-4e4c-9d73-cef76dc7d700
osd.1 up   in  weight 1 up_from 10 up_thru 62 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6810/4078894860,v1:172.21.5.151:6811/4078894860] [v2:172.21.5.151:6812/4078894860,v1:172.21.5.151:6813/4078894860] exists,up 28f84961-9442-455b-8dec-01d619938d72
osd.2 up   in  weight 1 up_from 14 up_thru 62 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6818/3858490727,v1:172.21.5.151:6819/3858490727] [v2:172.21.5.151:6820/3858490727,v1:172.21.5.151:6821/3858490727] exists,up 7a4a8af3-30c3-4358-b695-23a707ba4f5c

Next, I'll change the primary affinity on osd.0 to 0 so no primaries will be assigned to it:

$ ./bin/ceph osd primary-affinity osd.0 0
set osd.0 primary-affinity to 0 (802)

$ ./bin/ceph osd getmap -o my_osdmap
got osdmap epoch 97

$ ./bin/osdmaptool my_osdmap --dump
./bin/osdmaptool: osdmap file 'my_osdmap'
epoch 97
fsid 952981b4-1c33-4547-9394-4ba4908c1221
created 2023-05-17T20:43:49.380078+0000
modified 2023-05-17T20:48:33.353125+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 7
full_ratio 0.99
backfillfull_ratio 0.99
nearfull_ratio 0.99
require_min_compat_client luminous
min_compat_client jewel
require_osd_release reef
stretch_mode_enabled false

pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 2.00
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 66 lfor 0/0/32 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 1.25
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/32 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.25
pool 4 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.25
pool 5 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 94 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.00
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 application rgw read_balance_score 1.19
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 1.06

max_osd 3
osd.0 up   in  weight 1 primary_affinity 0 up_from 6 up_thru 92 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6802/1108676481,v1:172.21.5.151:6803/1108676481] [v2:172.21.5.151:6804/1108676481,v1:172.21.5.151:6805/1108676481] exists,up d6ad325d-f532-4e4c-9d73-cef76dc7d700
osd.1 up   in  weight 1 up_from 10 up_thru 96 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6810/4078894860,v1:172.21.5.151:6811/4078894860] [v2:172.21.5.151:6812/4078894860,v1:172.21.5.151:6813/4078894860] exists,up 28f84961-9442-455b-8dec-01d619938d72
osd.2 up   in  weight 1 up_from 14 up_thru 96 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6818/3858490727,v1:172.21.5.151:6819/3858490727] [v2:172.21.5.151:6820/3858490727,v1:172.21.5.151:6821/3858490727] exists,up 7a4a8af3-30c3-4358-b695-23a707ba4f5c

Let's try and improve the read_balance_score of "cephfs.a.data" since it has one of the highest scores. (the .mgr pool is high as well, but it only has 1 pg, and we cannot improve it any further).

First I ran the upmap balancer to make sure that we have balanced capacity (this helps set up the read balancer to calculate the optimal distribution):

$ ./bin/osdmaptool my_osdmap --upmap upmap.out
./bin/osdmaptool: osdmap file 'my_osdmap'
writing upmap command output to: upmap.out
checking for upmap cleanups
upmap, max-count 10, max deviation 5
pools default.rgw.meta .mgr default.rgw.control cephfs.a.meta .rgw.root cephfs.a.data default.rgw.log 
prepared 0/10 changes
Unable to find further optimization, or distribution is already perfect

Then, I ran the read balancer on cephfs.a.data:

$ ./bin/osdmaptool my_osdmap --read read.out --vstart --read-pool cephfs.a.data
./bin/osdmaptool: osdmap file 'my_osdmap'
writing upmap command output to: read.out

---------- BEFORE ------------ 
 osd.1 | primary affinity: 1 | number of prims: 20
 osd.2 | primary affinity: 1 | number of prims: 12

read_balance_score of 'cephfs.a.data': 1.25

---------- AFTER ------------ 
 osd.1 | primary affinity: 1 | number of prims: 16
 osd.2 | primary affinity: 1 | number of prims: 16

read_balance_score of 'cephfs.a.data': 1

num changes: 4

Applied the changes:

$ source read.out
change primary for pg 3.0 to osd.2
change primary for pg 3.2 to osd.2
change primary for pg 3.4 to osd.2
change primary for pg 3.6 to osd.2

Now we'll get a new copy of the osdmap and check how the scores have changed (you can also check the scores with `ceph osd pool ls detail` instead of checking the osdmap every time like I have done):

$ ./bin/ceph osd getmap -o my_osdmap
got osdmap epoch 103

$ ./bin/osdmaptool my_osdmap --dump
./bin/osdmaptool: osdmap file 'my_osdmap'
epoch 103
fsid 952981b4-1c33-4547-9394-4ba4908c1221
created 2023-05-17T20:43:49.380078+0000
modified 2023-05-17T20:57:16.411414+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 7
full_ratio 0.99
backfillfull_ratio 0.99
nearfull_ratio 0.99
require_min_compat_client reef
min_compat_client luminous
require_osd_release reef
stretch_mode_enabled false

pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 2.00
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 66 lfor 0/0/32 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 1.25
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/32 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.00
pool 4 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.25
pool 5 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 94 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.00
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 application rgw read_balance_score 1.19
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 1.06

max_osd 3
osd.0 up   in  weight 1 primary_affinity 0 up_from 6 up_thru 92 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6802/1108676481,v1:172.21.5.151:6803/1108676481] [v2:172.21.5.151:6804/1108676481,v1:172.21.5.151:6805/1108676481] exists,up d6ad325d-f532-4e4c-9d73-cef76dc7d700
osd.1 up   in  weight 1 up_from 10 up_thru 96 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6810/4078894860,v1:172.21.5.151:6811/4078894860] [v2:172.21.5.151:6812/4078894860,v1:172.21.5.151:6813/4078894860] exists,up 28f84961-9442-455b-8dec-01d619938d72
osd.2 up   in  weight 1 up_from 14 up_thru 102 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6818/3858490727,v1:172.21.5.151:6819/3858490727] [v2:172.21.5.151:6820/3858490727,v1:172.21.5.151:6821/3858490727] exists,up 7a4a8af3-30c3-4358-b695-23a707ba4f5c

pg_upmap_primary 3.0 2
pg_upmap_primary 3.2 2
pg_upmap_primary 3.4 2
pg_upmap_primary 3.6 2

Now, I'll change the primary affinity for osd.0 from 0 to 0.5:

$ ./bin/ceph osd primary-affinity osd.0 0.5
set osd.0 primary-affinity to 0.5 (8327682)

$ ./bin/ceph osd getmap -o my_osdmap
got osdmap epoch 105

$ ./bin/osdmaptool my_osdmap --dump
./bin/osdmaptool: osdmap file 'my_osdmap'
epoch 105
fsid 952981b4-1c33-4547-9394-4ba4908c1221
created 2023-05-17T20:43:49.380078+0000
modified 2023-05-17T21:00:51.851517+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 7
full_ratio 0.99
backfillfull_ratio 0.99
nearfull_ratio 0.99
require_min_compat_client reef
min_compat_client luminous
require_osd_release reef
stretch_mode_enabled false

pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 3.00
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 66 lfor 0/0/32 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 1.50
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/32 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.31
pool 4 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.31
pool 5 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 94 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.41
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 application rgw read_balance_score 1.50
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 1.41

max_osd 3
osd.0 up   in  weight 1 primary_affinity 0.5 up_from 6 up_thru 104 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6802/1108676481,v1:172.21.5.151:6803/1108676481] [v2:172.21.5.151:6804/1108676481,v1:172.21.5.151:6805/1108676481] exists,up d6ad325d-f532-4e4c-9d73-cef76dc7d700
osd.1 up   in  weight 1 up_from 10 up_thru 96 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6810/4078894860,v1:172.21.5.151:6811/4078894860] [v2:172.21.5.151:6812/4078894860,v1:172.21.5.151:6813/4078894860] exists,up 28f84961-9442-455b-8dec-01d619938d72
osd.2 up   in  weight 1 up_from 14 up_thru 102 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6818/3858490727,v1:172.21.5.151:6819/3858490727] [v2:172.21.5.151:6820/3858490727,v1:172.21.5.151:6821/3858490727] exists,up 7a4a8af3-30c3-4358-b695-23a707ba4f5c

pg_upmap_primary 3.0 2
pg_upmap_primary 3.2 2
pg_upmap_primary 3.4 2
pg_upmap_primary 3.6 2

The score for cephfs.a.data has gone up now that we changed primary affinity, so we'll rerun the read balancer to improve distribution again.

First, I'll balance capacity:

$ ./bin/osdmaptool my_osdmap --upmap upmap.out
./bin/osdmaptool: osdmap file 'my_osdmap'
writing upmap command output to: upmap.out
checking for upmap cleanups
upmap, max-count 10, max deviation 5
pools default.rgw.meta default.rgw.control .rgw.root cephfs.a.data cephfs.a.meta default.rgw.log .mgr 
prepared 0/10 changes
Unable to find further optimization, or distribution is already perfect

Capacity is good, so I'll run the read balancer:

$ ./bin/osdmaptool my_osdmap --read read.out --vstart --read-pool cephfs.a.data
./bin/osdmaptool: osdmap file 'my_osdmap'
writing upmap command output to: read.out
 Unable to find further optimization, or distribution is already perfect

The read balancer says it's optimized despite the score changing when we tweaked primary affinity. I designed the read balancer so if the read_balance_score doesn't improve after it tries to optimize, it prints a message saying it can't help any further. However, if you want to check further details on why it couldn't optimize, you can run the command with "debug_osd=10" (or "debug_osd=30" for the most logging):

$ ./bin/osdmaptool my_osdmap --read read.out --vstart --read-pool cephfs.a.data --debug_osd=10
./bin/osdmaptool: osdmap file 'my_osdmap'
writing upmap command output to: read.out
2023-05-17T21:18:12.090+0000 7ffbb6c30100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.31
2023-05-17T21:18:12.090+0000 7ffbb6c30100 10 balance_primaries desired distribution for osd.0 6.4
2023-05-17T21:18:12.090+0000 7ffbb6c30100 10 balance_primaries desired distribution for osd.1 12.8
2023-05-17T21:18:12.090+0000 7ffbb6c30100 10 balance_primaries desired distribution for osd.2 12.8
2023-05-17T21:18:12.091+0000 7ffbb6c30100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.31
2023-05-17T21:18:12.091+0000 7ffbb6c30100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.31
2023-05-17T21:18:12.091+0000 7ffbb6c30100 10 balance_primaries num_changes 0
2023-05-17T21:18:12.092+0000 7ffbb6c30100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.31

This tells us that the read balancer tried moving pgs around, but ultimately, the score was not able to improve from 1.31. The read balancer only provides commands to optimize if the read_balance_score improves.

If we bump osd.0's primary affinity to 1, we see the read balancer was able to optimize since osd.0 can take more pgs:

$ ./bin/ceph osd primary-affinity osd.0 1
set osd.0 primary-affinity to 1 (8655362)

$ ./bin/ceph osd getmap -o my_osdmap
got osdmap epoch 107

$ ./bin/osdmaptool my_osdmap --dump
./bin/osdmaptool: osdmap file 'my_osdmap'
epoch 107
fsid 952981b4-1c33-4547-9394-4ba4908c1221
created 2023-05-17T20:43:49.380078+0000
modified 2023-05-17T21:21:12.747977+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 7
full_ratio 0.99
backfillfull_ratio 0.99
nearfull_ratio 0.99
require_min_compat_client reef
min_compat_client luminous
require_osd_release reef
stretch_mode_enabled false

pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 3.00
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 66 lfor 0/0/32 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 1.50
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/32 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.22
pool 4 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.13
pool 5 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 94 lfor 0/0/34 flags hashpspool stripe_width 0 application rgw read_balance_score 1.13
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 application rgw read_balance_score 1.22
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 lfor 0/0/36 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 1.13

max_osd 3
osd.0 up   in  weight 1 up_from 6 up_thru 106 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6802/1108676481,v1:172.21.5.151:6803/1108676481] [v2:172.21.5.151:6804/1108676481,v1:172.21.5.151:6805/1108676481] exists,up d6ad325d-f532-4e4c-9d73-cef76dc7d700
osd.1 up   in  weight 1 up_from 10 up_thru 96 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6810/4078894860,v1:172.21.5.151:6811/4078894860] [v2:172.21.5.151:6812/4078894860,v1:172.21.5.151:6813/4078894860] exists,up 28f84961-9442-455b-8dec-01d619938d72
osd.2 up   in  weight 1 up_from 14 up_thru 102 down_at 0 last_clean_interval [0,0) [v2:172.21.5.151:6818/3858490727,v1:172.21.5.151:6819/3858490727] [v2:172.21.5.151:6820/3858490727,v1:172.21.5.151:6821/3858490727] exists,up 7a4a8af3-30c3-4358-b695-23a707ba4f5c

pg_upmap_primary 3.0 2
pg_upmap_primary 3.2 2
pg_upmap_primary 3.4 2
pg_upmap_primary 3.6 2

$ ./bin/osdmaptool my_osdmap --upmap upmap.out
./bin/osdmaptool: osdmap file 'my_osdmap'
writing upmap command output to: upmap.out
checking for upmap cleanups
upmap, max-count 10, max deviation 5
pools .rgw.root cephfs.a.meta default.rgw.control .mgr default.rgw.meta default.rgw.log cephfs.a.data 
prepared 0/10 changes
Unable to find further optimization, or distribution is already perfect

$ ./bin/osdmaptool my_osdmap --read read.out --vstart --read-pool cephfs.a.data
./bin/osdmaptool: osdmap file 'my_osdmap'
writing upmap command output to: read.out

---------- BEFORE ------------ 
 osd.0 | primary affinity: 1 | number of prims: 11
 osd.1 | primary affinity: 1 | number of prims: 8
 osd.2 | primary affinity: 1 | number of prims: 13

read_balance_score of 'cephfs.a.data': 1.22

---------- AFTER ------------ 
 osd.0 | primary affinity: 1 | number of prims: 11
 osd.1 | primary affinity: 1 | number of prims: 10
 osd.2 | primary affinity: 1 | number of prims: 11

read_balance_score of 'cephfs.a.data': 1.03

num changes: 3

Running it with debug_osd=10 shows how the score improves in the logging:

2023-05-17T21:24:51.661+0000 7fbf81ea2100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.22
2023-05-17T21:24:51.662+0000 7fbf81ea2100 10 balance_primaries desired distribution for osd.0 10.6667
2023-05-17T21:24:51.662+0000 7fbf81ea2100 10 balance_primaries desired distribution for osd.1 10.6667
2023-05-17T21:24:51.662+0000 7fbf81ea2100 10 balance_primaries desired distribution for osd.2 10.6667
2023-05-17T21:24:51.662+0000 7fbf81ea2100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.22
2023-05-17T21:24:51.663+0000 7fbf81ea2100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.03
2023-05-17T21:24:51.663+0000 7fbf81ea2100 10 balance_primaries num_changes 3
2023-05-17T21:24:51.663+0000 7fbf81ea2100 10 calc_read_balance_score pool cephfs.a.data wl_score: 1.03

Ultimately, my guess is that with the configuration you tried, the read balancer tried to move pgs around, but ultimately couldn't improve the read_balance_score for the pool you chose. Feel free to paste any of your output like I have done so we can analyze your use case specifically.

#17 Updated by Kevin NGUETCHOUANG 10 months ago

Hello Flaura, thanks for your answer.

Indeed your example explains a lot of things, i will try to understand more my own outputs with the debug mode and next time i will get into you with my output if i don't understand it.

#18 Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from New to Need More Info

#19 Updated by Kevin NGUETCHOUANG 7 months ago

Can i have also have the logs enabled (from dout function) for primarylogpg if i set debug_osd to 10 ?

#20 Updated by Laura Flores 7 months ago

Hi Kevin, you can use debug_osd=10 to see high level information, and debug_osd=20 to see all the nitty-gritty stuff.

#21 Updated by Kevin NGUETCHOUANG 7 months ago

even for the dout ? because it works for ldout functions but with dout i don't see it working

#22 Updated by Laura Flores 7 months ago

Pretty sure only ldout is used in the OSDMap code.

#23 Updated by Kevin NGUETCHOUANG 7 months ago

Yes, ldout is only used in the OSDMap, I was instead talking of others sources file code; like OSd.cc

#24 Updated by Laura Flores 7 months ago

Ah, I see. The majority of the primary balancer is OSDMap code, which is made up of ldout functions. So, all the relevant debug information available is from ldout.

Also available in: Atom PDF