Bug #65013
openreplica_read not available on most recently updated objects in each PG
0%
Description
In my dev environment, when trying to read from replica (leveraging crush_location), the osd rejects the requests and I see the following in the log:
2024-03-20T15:03:25.636-0400 7f176bdc36c0 20 osd.1 pg_epoch: 162 pg[6.1f( v 162'9763 (162'9000,162'9763] local-lis/les=161/162 n=6 ec=74/68 lis/c=161/161 les/c/f=162/162/0 sis=161) [0,1,2] r=1 lpr=161 luod=0'0 lua=162'6507 crt=162'9763 lcod 162'9762 mlcod 162'9761 active mbc={}] PeeringState::can_serve_replica_read can't read object on this replica
Was tested on a branch off main version (2996d1320f95540aac255041daf03bd926600ca4).
Updated by Samuel Just about 1 month ago
From that log line, looks like last_update is 162'9763 and mlcod is 162'9761 -- that looks like 2 updates behind?
Updated by Samuel Just about 1 month ago
Is that branch https://github.com/ceph/ceph/pull/56180/files ad0cb1eb1609caa646abbbdf6ebccd4dfda0b417 ?
Updated by Samuel Just about 1 month ago
The following tests are done on ad0cb1eb1609caa646abbbdf6ebccd4dfda0b417 from https://github.com/ceph/ceph/pull/56180/files
MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --redirect-output --debug -n --no-restart --osds-per-host 1 ./bin/ceph osd primary-affinity 0 0 ./bin/ceph osd primary-affinity 1 0 ./bin/ceph osd pool create rbd 64 64 ./bin/ceph osd pool set rbd size 3 ./bin/rbd create rbd/image --size 4G ./bin/rbd create rbd/image2 --size 4G ./bin/rbd bench --io-type write --io-pattern rand --image rbd/image --io-threads 16 --io-total 250M #./bin/rbd bench --io-type write --io-pattern rand --image rbd/image2 --io-threads 16 --io-total 50M ./bin/rbd bench --io-type read --io-pattern rand --image rbd/image --io-threads 16 --io-total=100M --crush_location "host=$(hostname)-0" --rados_replica_read_policy=localize
The above should ensure that all pgs have osd.2 as a primary. The final bench read command should direct all reads to osd.0. The final read command yields the following output for me locally:
bench type read io_size 4096 io_threads 16 bytes 104857600 pattern random SEC OPS OPS/SEC BYTES/SEC 1 11904 11955.5 47 MiB/s 2 23984 12017.7 47 MiB/s elapsed: 2 ops: 25600 ops/sec: 11995.9 bytes/sec: 47 MiB/s
65013-replica-read/build [sjust/wip-65013-replica-read] » grep -c 'read object on this replica' out/osd.0.log 5401
65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.2.log | wc -l 5391 65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.1.log | wc -l 0 65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.0.log | wc -l 25600
So, ~5k out of ~25k ended up being bounced to the primary. Let's try again, but with some writes to a second image prior to reads:
MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --redirect-output --debug -n --no-restart --osds-per-host 1 ./bin/ceph osd primary-affinity 0 0 ./bin/ceph osd primary-affinity 1 0 ./bin/ceph osd pool create rbd 64 64 ./bin/ceph osd pool set rbd size 3 ./bin/rbd create rbd/image --size 4G ./bin/rbd create rbd/image2 --size 4G ./bin/rbd bench --io-type write --io-pattern rand --image rbd/image --io-threads 16 --io-total 250M ./bin/rbd bench --io-type write --io-pattern rand --image rbd/image2 --io-threads 16 --io-total 50M ./bin/rbd bench --io-type read --io-pattern rand --image rbd/image --io-threads 16 --io-total=100M --crush_location "host=$(hostname)-0" --rados_replica_read_policy=localize
bench type read io_size 4096 io_threads 16 bytes 104857600 pattern random SEC OPS OPS/SEC BYTES/SEC 1 9168 9192.93 36 MiB/s 2 20848 10436.9 41 MiB/s elapsed: 2 ops: 25600 ops/sec: 10815.1 bytes/sec: 42 MiB/s
65013-replica-read/build [sjust/wip-65013-replica-read] » grep -c 'read object on this replica' out/osd.0.log 4 65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.0.log | wc -l 25600 65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.1.log | wc -l 0 65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.2.log | wc -l 2
Flooding the pgs with writes to another image causes virtually all of the reads to be serviced by the replica.
Updated by Samuel Just about 1 month ago
A 4G image will be comprised of 1024 4M objects. The above script creates 64 pgs. The best possible case with the current implementation is that the replica's view of min_last_complete_on_disk is one version behind the primary (we don't send an extra message to update it if there isn't another write). Thus, we'd expect every PG to be unable to serve replica reads on at least one object, which ever is most recent. We'd expect 64/1024 or ~6% of reads to hit those objects and therefore be bounced to the primary. That's a factor of 5 off of what we see above.
A cursory check of several pgs on osd.0 gives us:
65013-replica-read/build [sjust/wip-65013-replica-read] » for i in 0 1 2 3 4 5; grep "pg\[1.$i(" out/osd.0.log | tail -n 1 2024-03-20T22:41:34.784+0000 7f5f86328640 20 osd.0 pg_epoch: 20 pg[1.0( v 20'841 (0'0,20'841] local-lis/les=16/17 n=25 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,1,0] r=2 lpr=16 luod=0'0 lua=0'0 crt=20'841 lcod 20'840 mlcod 20'833 active mbc={}] PeeringState::proc_lease readable_until_ub now 838.371032715s 2024-03-20T22:41:42.812+0000 7f5f85b27640 20 osd.0 pg_epoch: 20 pg[1.1( v 20'933 (0'0,20'933] local-lis/les=16/17 n=23 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,0,1] r=1 lpr=16 luod=0'0 lua=0'0 crt=20'933 lcod 20'932 mlcod 20'925 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.399414062s 2024-03-20T22:41:42.813+0000 7f5f85326640 20 osd.0 pg_epoch: 20 pg[1.2( v 20'1304 (0'0,20'1304] local-lis/les=16/17 n=32 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,1,0] r=2 lpr=16 luod=0'0 lua=0'0 crt=20'1304 lcod 20'1303 mlcod 20'1297 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.399841309s 2024-03-20T22:41:42.838+0000 7f5f84b25640 20 osd.0 pg_epoch: 20 pg[1.3( v 20'1618 (0'0,20'1618] local-lis/les=16/17 n=34 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,0,1] r=1 lpr=16 luod=0'0 lua=0'0 crt=20'1618 lcod 20'1617 mlcod 20'1613 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.425048828s 2024-03-20T22:41:42.784+0000 7f5f84324640 20 osd.0 pg_epoch: 20 pg[1.4( v 20'1152 (0'0,20'1152] local-lis/les=16/17 n=31 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,1,0] r=2 lpr=16 luod=0'0 lua=0'0 crt=20'1152 lcod 20'1151 mlcod 20'1149 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.371276855s 2024-03-20T22:41:42.832+0000 7f5f86328640 20 osd.0 pg_epoch: 20 pg[1.5( v 20'1015 (0'0,20'1015] local-lis/les=16/17 n=31 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,0,1] r=1 lpr=16 luod=0'0 lua=0'0 crt=20'1015 lcod 20'1014 mlcod 20'1011 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.418884277s
pgid lu mlcod gap 1.0 20'841 20'833 8 1.1 20'933 20'925 8 1.2 20'1304 20'1297 7 1.3 20'1618 20'1613 5 1.4 20'1152 20'1149 3 1.5 20'1015 20'1011 4
Mean=5.8
If we're missing an average of 5.8 entries per pg, we'd expect 5.8*64 objects to have reads bounced to the primary, or about (5.8*64/1024) = 36%. This is larger than the 25% we're measuring, but it's in the ballpark.
Updated by Samuel Just about 1 month ago
./bin/rbd bench --io-type write --io-pattern rand --image rbd/image --io-threads 16 --io-total 250M
Should never have more than 16 IOs in flight at a time. With io-pattern=rand, it shouldn't be the case that a single object or pg has more than one outstanding very often. The 5.8 gap above is therefore really strange.
Note: based on some later observations, I'm really not sure that rbd bench actually respects the io-threads param properly. For one thing, there isn't a large difference in throughput between 1 and 16. For another, mlcod with --io-threads=1 lags by much more than with rados bench -t 1. Will check into this more.
Updated by Samuel Just about 1 month ago
In addition to the observation that mlcod is lagging by more than it should, there's a second detail:
./bin/ceph osd down 0
65013-replica-read/build [sjust/wip-65013-replica-read] » for i in 0 1 2 3 4 5; grep "pg\[1.$i(" out/osd.0.log | tail -n 1 2024-03-20T23:13:32.536+0000 7f5f86328640 20 osd.0 pg_epoch: 24 pg[1.0( v 20'841 (0'0,20'841] local-lis/les=23/24 n=25 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,1,0] r=2 lpr=23 luod=0'0 crt=20'841 lcod 20'840 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2756.123291016s 2024-03-20T23:13:32.500+0000 7f5f85b27640 20 osd.0 pg_epoch: 24 pg[1.1( v 20'933 (0'0,20'933] local-lis/les=23/24 n=23 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,0,1] r=1 lpr=23 luod=0'0 crt=20'933 lcod 20'932 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2756.087158203s 2024-03-20T23:13:32.536+0000 7f5f85326640 20 osd.0 pg_epoch: 24 pg[1.2( v 20'1304 (0'0,20'1304] local-lis/les=23/24 n=32 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,1,0] r=2 lpr=23 luod=0'0 crt=20'1304 lcod 20'1303 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2756.123046875s 2024-03-20T23:13:40.536+0000 7f5f84b25640 20 osd.0 pg_epoch: 24 pg[1.3( v 20'1618 (0'0,20'1618] local-lis/les=23/24 n=34 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,0,1] r=1 lpr=23 luod=0'0 crt=20'1618 lcod 20'1617 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2764.123291016s 2024-03-20T23:13:40.536+0000 7f5f84324640 20 osd.0 pg_epoch: 24 pg[1.4( v 20'1152 (0'0,20'1152] local-lis/les=23/24 n=31 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,1,0] r=2 lpr=23 luod=0'0 crt=20'1152 lcod 20'1151 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2764.123779297s 2024-03-20T23:13:40.538+0000 7f5f86328640 20 osd.0 pg_epoch: 24 pg[1.5( v 20'1015 (0'0,20'1015] local-lis/les=23/24 n=31 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,0,1] r=1 lpr=23 luod=0'0 crt=20'1015 lcod 20'1014 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2764.125488281s
It seems we don't initialize mlcod until the very first write. This should be a very easy fix since the activation last_update can't be rolled back, but the value in question is min_last_complete_ondisk which is an unnecessarily strong bound and requires information about replicas.
Updated by Samuel Just about 1 month ago
Some definitions:
pg_info_t::last_update: most recent update seen by an OSD (primary or replica), should be the same as pg_log_t::head
pg_info_t::last_complete: version v such that all objects with versions <= v are fully recovered and readable (primary or replica)
PeeringState::last_update_ondisk: most recent update completed (most recent version that cannot be divergent) (primary only)
PeeringState::last_complete_ondisk: most recent version of pg_info_t::last_complete committed to disk (primary only)
PeeringState::min_last_complete_ondisk: min last_complete_ondisk across all replicas, mainly used for log trimming
This last one is the bound we're using to permit replica reads. On each write, we send the current min_last_complete_ondisk value from the primary to the replicas.
Updated by Yehuda Sadeh about 1 month ago
Reproduced the issue again on a new dev environment.
Created env:
$ ./src/mstart.sh c2 -n --debug --osds-per-host 1
start radosgw (off a branch that adds the read from replica capability):
$ radosgw -c /home/yehuda/ceph/build/run/c2/ceph.conf --rgw-frontends="beast port=8001" -n client.rgw.8001 --pid-file=/home/yehuda/ceph/build/run/c2/out/radosgw.8001.pid --admin-socket=/home/yehuda/ceph/build/run/c2/out/radosgw.8001.asok radosgw --debug-ms=1 --debug-rgw=20 --rgw-gc-processor-period=120 --rgw-gc-processor-max-time=120 --rgw-gc-obj-min-wait=60 --rgw-enable-usage-log=true --rgw-gc-max-trim-chunk=16 --rgw-gc-max-concurrent-io=8 --rgw-sync-lease-period=30 --rgw-lc-debug-interval=10 --rgw-data-sync-poll-interval=5 --crush-location=host=crestview-1 --rados-replica-read-policy=localize --log-file=/home/yehuda/ceph/build/run/c2/out/radosgw.8001.log
uploaded a few objects (should be 10 rados objects for each):
$ for i in `seq 0 10`; do ./s3 put buck/blabla$i --in-file=40M; echo $i; done
reading these objects:
$ for i in `seq 0 10`; do ./s3 get buck/blabla$i > /dev/null & done
Now, checking the radosgw log I see a lot of the following:
2024-03-21T10:05:35.498-0400 7ff76f4546c0 1 -- 192.168.68.60:0/1299758790 <== osd.1 v2:192.168.68.60:6810/803208549 20086 ==== osd_op_reply(37592 1835f00c-d8f7-4e9f-9bda-beda8d11fdf7.4375.1__shadow_.VE9tc2ZswMKVxcQTrrpHuxOuC0hxxLc_5 [read 0~4194304] v0'0 uv0 ondisk = -11 ((11) Resource temporarily unavailable)) v8 ==== 230+0+0 (crc 0 0 0) 0x55654929ef00 con 0x55654e13a400 2024-03-21T10:05:35.498-0400 7ff76f4546c0 1 -- 192.168.68.60:0/1299758790 --> [v2:192.168.68.60:6802/1191469188,v1:192.168.68.60:6803/1191469188] -- osd_op(unknown.0.0:37610 10.8 10:10160bb0:::1835f00c-d8f7-4e9f-9bda-beda8d11fdf7.4375.1__shadow_.VE9tc2ZswMKVxcQTrrpHuxOuC0hxxLc_5:head [read 0~4194304] snapc 0=[] RETRY=1 ondisk+retry+read+known_if_redirected+supports_pool_eio e141) v9 -- 0x556556e9f400 con 0x55654e2c3680
Seem like around half of the total ops reading these objects are getting bounced
$ grep RETR radosgw.8001.log | grep -v call | wc -l 204 $ for i in `seq 0 10`; do ./s3 get buck/blabla$i > /dev/null & done $ grep RETR radosgw.8001.log | grep -v call | wc -l 256
from osd logs, requests that can/can't be satisfied from replica:
$ grep can.*safe osd.1.log | wc -l 411 $ grep can.t.read.obj osd.1.log | wc -l 30670
actual successful read from replica ops on the rgw data pool were on a limited number of pgs:
$ grep can.*safe osd.1.log | sed 's/.*pg\[//g' | awk '{print $1}' | grep ^10 | sort | uniq 10.10( 10.12( 10.15( 10.1e( 10.20( 10.39( 10.3f( 10.8( 10.f(
Updated by Yehuda Sadeh about 1 month ago
Also note, pgs from the data pool that couldn't serve from replica:
$ grep can.t.read.object osd.1.log | grep pg.10 | sed 's/.*pg\[//g' | awk '{print $1}' | sort | uniq 10.10( 10.12( 10.13( 10.15( 10.18( 10.19( 10.1b( 10.1e( 10.2( 10.20( 10.21( 10.24( 10.25( 10.29( 10.2a( 10.2b( 10.2d( 10.2e( 10.2f( 10.3( 10.30( 10.39( 10.3a( 10.3b( 10.3d( 10.3e( 10.3f( 10.4( 10.5( 10.8( 10.f(
from osd logs min_last_complete_ondisk is updated on the following pgs (from the data pool) -- for which this osd is the primary:
$ grep last_complete_ondisk.is.updated osd.1.log | sed 's/.*pg\[//g' | grep ^10 | awk '{print $1}' | sort | uniq 10.0( 10.16( 10.17( 10.1a( 10.1d( 10.1f( 10.22( 10.23( 10.26( 10.27( 10.28( 10.32( 10.33( 10.34( 10.35( 10.37( 10.7( 10.9( 10.b( 10.c( 10.d( 10.e(
Updated by Samuel Just about 1 month ago
'last_complete_ondisk is updated to' only happens on the primary from PeeringState::calc_min_last_complete_ondisk(), the mechanism on the replica is different as it simply gets the value on repop.
Updated by Samuel Just about 1 month ago
- Related to Bug #65085: rados: replica mlcod tends to lag by two cycles rather than one limiting replica reads added
Updated by Samuel Just about 1 month ago
- Related to Bug #65086: rados: replicas do not initialize their mlcod value upon activation, replica reads are unavailable until the pg has seen at least one write added
Updated by Samuel Just about 1 month ago
- Related to Bug #65090: rados: the object most recently written on a pg won't be available for replica read until another write occurs added
Updated by Samuel Just about 1 month ago
- Subject changed from reading for replica (mostly) doesn't work to replica_read not available on most recently updated objects in each PG
- Description updated (diff)