Bug #65013: replica_read not available on most recently updated objects in each PG - RADOS - Ceph

Actions

Copy link

Bug #65013

open

replica_read not available on most recently updated objects in each PG

Added by Yehuda Sadeh about 1 month ago. Updated about 1 month ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In my dev environment, when trying to read from replica (leveraging crush_location), the osd rejects the requests and I see the following in the log:

2024-03-20T15:03:25.636-0400 7f176bdc36c0 20 osd.1 pg_epoch: 162 pg[6.1f( v 162'9763 (162'9000,162'9763] local-lis/les=161/162 n=6 ec=74/68 lis/c=161/161 les/c/f=162/162/0 sis=161) [0,1,2] r=1 lpr=161 luod=0'0 lua=162'6507 crt=162'9763   lcod 162'9762 mlcod 162'9761 active mbc={}] PeeringState::can_serve_replica_read can't read object on this replica

Was tested on a branch off main version (2996d1320f95540aac255041daf03bd926600ca4).

Related issues 3 (3 open — 0 closed)

Actions

Copy link

Updated by Samuel Just about 1 month ago

From that log line, looks like last_update is 162'9763 and mlcod is 162'9761 -- that looks like 2 updates behind?

Actions

Copy link

Updated by Samuel Just about 1 month ago

Is that branch https://github.com/ceph/ceph/pull/56180/files ad0cb1eb1609caa646abbbdf6ebccd4dfda0b417 ?

Actions

Copy link

Updated by Samuel Just about 1 month ago

The following tests are done on ad0cb1eb1609caa646abbbdf6ebccd4dfda0b417 from https://github.com/ceph/ceph/pull/56180/files

  MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --redirect-output --debug -n --no-restart --osds-per-host 1
  ./bin/ceph osd primary-affinity 0 0
  ./bin/ceph osd primary-affinity 1 0
  ./bin/ceph osd pool create rbd 64 64
  ./bin/ceph osd pool set rbd size 3
  ./bin/rbd create rbd/image --size 4G
  ./bin/rbd create rbd/image2 --size 4G
  ./bin/rbd bench --io-type write --io-pattern rand --image rbd/image --io-threads 16 --io-total 250M
  #./bin/rbd bench --io-type write --io-pattern rand --image rbd/image2 --io-threads 16 --io-total 50M
  ./bin/rbd bench --io-type read --io-pattern rand --image rbd/image --io-threads 16 --io-total=100M --crush_location "host=$(hostname)-0" --rados_replica_read_policy=localize

The above should ensure that all pgs have osd.2 as a primary. The final bench read command should direct all reads to osd.0. The final read command yields the following output for me locally:

bench  type read io_size 4096 io_threads 16 bytes 104857600 pattern random
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     11904   11955.5    47 MiB/s
    2     23984   12017.7    47 MiB/s
elapsed: 2   ops: 25600   ops/sec: 11995.9   bytes/sec: 47 MiB/s

65013-replica-read/build [sjust/wip-65013-replica-read] » grep -c 'read object on this replica' out/osd.0.log
5401

65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.2.log | wc -l
5391
65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.1.log | wc -l
0
65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.0.log | wc -l
25600

So, ~5k out of ~25k ended up being bounced to the primary. Let's try again, but with some writes to a second image prior to reads:

  MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --redirect-output --debug -n --no-restart --osds-per-host 1
  ./bin/ceph osd primary-affinity 0 0
  ./bin/ceph osd primary-affinity 1 0
  ./bin/ceph osd pool create rbd 64 64
  ./bin/ceph osd pool set rbd size 3
  ./bin/rbd create rbd/image --size 4G
  ./bin/rbd create rbd/image2 --size 4G
  ./bin/rbd bench --io-type write --io-pattern rand --image rbd/image --io-threads 16 --io-total 250M
  ./bin/rbd bench --io-type write --io-pattern rand --image rbd/image2 --io-threads 16 --io-total 50M
  ./bin/rbd bench --io-type read --io-pattern rand --image rbd/image --io-threads 16 --io-total=100M --crush_location "host=$(hostname)-0" --rados_replica_read_policy=localize

bench  type read io_size 4096 io_threads 16 bytes 104857600 pattern random
  SEC       OPS   OPS/SEC   BYTES/SEC
    1      9168   9192.93    36 MiB/s
    2     20848   10436.9    41 MiB/s
elapsed: 2   ops: 25600   ops/sec: 10815.1   bytes/sec: 42 MiB/s

65013-replica-read/build [sjust/wip-65013-replica-read] » grep -c 'read object on this replica' out/osd.0.log
4
65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.0.log | wc -l
25600
65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.1.log | wc -l
0
65013-replica-read/build [sjust/wip-65013-replica-read] » grep 'do_op osd_op.*\[read' out/osd.2.log | wc -l
2

Flooding the pgs with writes to another image causes virtually all of the reads to be serviced by the replica.

Actions

Copy link

Updated by Samuel Just about 1 month ago

A 4G image will be comprised of 1024 4M objects. The above script creates 64 pgs. The best possible case with the current implementation is that the replica's view of min_last_complete_on_disk is one version behind the primary (we don't send an extra message to update it if there isn't another write). Thus, we'd expect every PG to be unable to serve replica reads on at least one object, which ever is most recent. We'd expect 64/1024 or ~6% of reads to hit those objects and therefore be bounced to the primary. That's a factor of 5 off of what we see above.

A cursory check of several pgs on osd.0 gives us:

65013-replica-read/build [sjust/wip-65013-replica-read] » for i in 0 1 2 3 4 5; grep "pg\[1.$i(" out/osd.0.log | tail -n 1
2024-03-20T22:41:34.784+0000 7f5f86328640 20 osd.0 pg_epoch: 20 pg[1.0( v 20'841 (0'0,20'841] local-lis/les=16/17 n=25 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,1,0] r=2 lpr=16 luod=0'0 lua=0'0 crt=20'841 lcod 20'840 mlcod 20'833 active mbc={}] PeeringState::proc_lease readable_until_ub now 838.371032715s
2024-03-20T22:41:42.812+0000 7f5f85b27640 20 osd.0 pg_epoch: 20 pg[1.1( v 20'933 (0'0,20'933] local-lis/les=16/17 n=23 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,0,1] r=1 lpr=16 luod=0'0 lua=0'0 crt=20'933 lcod 20'932 mlcod 20'925 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.399414062s
2024-03-20T22:41:42.813+0000 7f5f85326640 20 osd.0 pg_epoch: 20 pg[1.2( v 20'1304 (0'0,20'1304] local-lis/les=16/17 n=32 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,1,0] r=2 lpr=16 luod=0'0 lua=0'0 crt=20'1304 lcod 20'1303 mlcod 20'1297 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.399841309s
2024-03-20T22:41:42.838+0000 7f5f84b25640 20 osd.0 pg_epoch: 20 pg[1.3( v 20'1618 (0'0,20'1618] local-lis/les=16/17 n=34 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,0,1] r=1 lpr=16 luod=0'0 lua=0'0 crt=20'1618 lcod 20'1617 mlcod 20'1613 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.425048828s
2024-03-20T22:41:42.784+0000 7f5f84324640 20 osd.0 pg_epoch: 20 pg[1.4( v 20'1152 (0'0,20'1152] local-lis/les=16/17 n=31 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,1,0] r=2 lpr=16 luod=0'0 lua=0'0 crt=20'1152 lcod 20'1151 mlcod 20'1149 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.371276855s
2024-03-20T22:41:42.832+0000 7f5f86328640 20 osd.0 pg_epoch: 20 pg[1.5( v 20'1015 (0'0,20'1015] local-lis/les=16/17 n=31 ec=16/16 lis/c=16/16 les/c/f=17/17/0 sis=16) [2,0,1] r=1 lpr=16 luod=0'0 lua=0'0 crt=20'1015 lcod 20'1014 mlcod 20'1011 active mbc={}] PeeringState::proc_lease readable_until_ub now 846.418884277s

pgid    lu       mlcod    gap
1.0     20'841   20'833   8
1.1     20'933   20'925   8
1.2     20'1304  20'1297  7
1.3     20'1618  20'1613  5
1.4     20'1152  20'1149  3
1.5     20'1015  20'1011  4

Mean=5.8

If we're missing an average of 5.8 entries per pg, we'd expect 5.8*64 objects to have reads bounced to the primary, or about (5.8*64/1024) = 36%. This is larger than the 25% we're measuring, but it's in the ballpark.

Actions

Copy link

Updated by Samuel Just about 1 month ago

  ./bin/rbd bench --io-type write --io-pattern rand --image rbd/image --io-threads 16 --io-total 250M

Should never have more than 16 IOs in flight at a time. With io-pattern=rand, it shouldn't be the case that a single object or pg has more than one outstanding very often. The 5.8 gap above is therefore really strange.

Note: based on some later observations, I'm really not sure that rbd bench actually respects the io-threads param properly. For one thing, there isn't a large difference in throughput between 1 and 16. For another, mlcod with --io-threads=1 lags by much more than with rados bench -t 1. Will check into this more.

Actions

Copy link

Updated by Samuel Just about 1 month ago

In addition to the observation that mlcod is lagging by more than it should, there's a second detail:

./bin/ceph osd down 0

65013-replica-read/build [sjust/wip-65013-replica-read] » for i in 0 1 2 3 4 5; grep "pg\[1.$i(" out/osd.0.log | tail -n 1
2024-03-20T23:13:32.536+0000 7f5f86328640 20 osd.0 pg_epoch: 24 pg[1.0( v 20'841 (0'0,20'841] local-lis/les=23/24 n=25 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,1,0] r=2 lpr=23 luod=0'0 crt=20'841 lcod 20'840 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2756.123291016s
2024-03-20T23:13:32.500+0000 7f5f85b27640 20 osd.0 pg_epoch: 24 pg[1.1( v 20'933 (0'0,20'933] local-lis/les=23/24 n=23 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,0,1] r=1 lpr=23 luod=0'0 crt=20'933 lcod 20'932 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2756.087158203s
2024-03-20T23:13:32.536+0000 7f5f85326640 20 osd.0 pg_epoch: 24 pg[1.2( v 20'1304 (0'0,20'1304] local-lis/les=23/24 n=32 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,1,0] r=2 lpr=23 luod=0'0 crt=20'1304 lcod 20'1303 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2756.123046875s
2024-03-20T23:13:40.536+0000 7f5f84b25640 20 osd.0 pg_epoch: 24 pg[1.3( v 20'1618 (0'0,20'1618] local-lis/les=23/24 n=34 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,0,1] r=1 lpr=23 luod=0'0 crt=20'1618 lcod 20'1617 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2764.123291016s
2024-03-20T23:13:40.536+0000 7f5f84324640 20 osd.0 pg_epoch: 24 pg[1.4( v 20'1152 (0'0,20'1152] local-lis/les=23/24 n=31 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,1,0] r=2 lpr=23 luod=0'0 crt=20'1152 lcod 20'1151 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2764.123779297s
2024-03-20T23:13:40.538+0000 7f5f86328640 20 osd.0 pg_epoch: 24 pg[1.5( v 20'1015 (0'0,20'1015] local-lis/les=23/24 n=31 ec=16/16 lis/c=23/23 les/c/f=24/24/0 sis=23) [2,0,1] r=1 lpr=23 luod=0'0 crt=20'1015 lcod 20'1014 mlcod 0'0 active mbc={}] PeeringState::proc_lease readable_until_ub now 2764.125488281s

It seems we don't initialize mlcod until the very first write. This should be a very easy fix since the activation last_update can't be rolled back, but the value in question is min_last_complete_ondisk which is an unnecessarily strong bound and requires information about replicas.

Actions

Copy link

Updated by Samuel Just about 1 month ago

Some definitions:

pg_info_t::last_update: most recent update seen by an OSD (primary or replica), should be the same as pg_log_t::head
pg_info_t::last_complete: version v such that all objects with versions <= v are fully recovered and readable (primary or replica)
PeeringState::last_update_ondisk: most recent update completed (most recent version that cannot be divergent) (primary only)
PeeringState::last_complete_ondisk: most recent version of pg_info_t::last_complete committed to disk (primary only)
PeeringState::min_last_complete_ondisk: min last_complete_ondisk across all replicas, mainly used for log trimming

This last one is the bound we're using to permit replica reads. On each write, we send the current min_last_complete_ondisk value from the primary to the replicas.

Actions

Copy link

Updated by Yehuda Sadeh about 1 month ago

Reproduced the issue again on a new dev environment.
Created env:

$ ./src/mstart.sh c2 -n --debug --osds-per-host 1

start radosgw (off a branch that adds the read from replica capability):

$ radosgw -c /home/yehuda/ceph/build/run/c2/ceph.conf --rgw-frontends="beast port=8001"   -n client.rgw.8001 --pid-file=/home/yehuda/ceph/build/run/c2/out/radosgw.8001.pid --admin-socket=/home/yehuda/ceph/build/run/c2/out/radosgw.8001.asok radosgw --debug-ms=1 --debug-rgw=20 --rgw-gc-processor-period=120 --rgw-gc-processor-max-time=120 --rgw-gc-obj-min-wait=60 --rgw-enable-usage-log=true --rgw-gc-max-trim-chunk=16 --rgw-gc-max-concurrent-io=8 --rgw-sync-lease-period=30 --rgw-lc-debug-interval=10 --rgw-data-sync-poll-interval=5 --crush-location=host=crestview-1 --rados-replica-read-policy=localize --log-file=/home/yehuda/ceph/build/run/c2/out/radosgw.8001.log

uploaded a few objects (should be 10 rados objects for each):

$ for i in `seq 0 10`; do ./s3 put buck/blabla$i --in-file=40M; echo $i; done

reading these objects:

$ for i in `seq 0 10`; do ./s3 get buck/blabla$i > /dev/null & done

Now, checking the radosgw log I see a lot of the following:

2024-03-21T10:05:35.498-0400 7ff76f4546c0  1 -- 192.168.68.60:0/1299758790 <== osd.1 v2:192.168.68.60:6810/803208549 20086 ==== osd_op_reply(37592 1835f00c-d8f7-4e9f-9bda-beda8d11fdf7.4375.1__shadow_.VE9tc2ZswMKVxcQTrrpHuxOuC0hxxLc_5 [read 0~4194304] v0'0 uv0 ondisk = -11 ((11) Resource temporarily unavailable)) v8  ==== 230+0+0 (crc 0 0 0) 0x55654929ef00 con 0x55654e13a400
2024-03-21T10:05:35.498-0400 7ff76f4546c0  1 -- 192.168.68.60:0/1299758790 --> [v2:192.168.68.60:6802/1191469188,v1:192.168.68.60:6803/1191469188] -- osd_op(unknown.0.0:37610 10.8 10:10160bb0:::1835f00c-d8f7-4e9f-9bda-beda8d11fdf7.4375.1__shadow_.VE9tc2ZswMKVxcQTrrpHuxOuC0hxxLc_5:head [read 0~4194304] snapc 0=[]     RETRY=1 ondisk+retry+read+known_if_redirected+supports_pool_eio e141) v9 -- 0x556556e9f400 con 0x55654e2c3680

Seem like around half of the total ops reading these objects are getting bounced

$ grep RETR radosgw.8001.log  | grep -v call | wc -l
204
$ for i in `seq 0 10`; do ./s3 get buck/blabla$i > /dev/null & done
$ grep RETR radosgw.8001.log  | grep -v call | wc -l
256

from osd logs, requests that can/can't be satisfied from replica:

$ grep can.*safe osd.1.log | wc -l
411
$ grep can.t.read.obj osd.1.log | wc -l
30670

actual successful read from replica ops on the rgw data pool were on a limited number of pgs:

$ grep can.*safe osd.1.log | sed 's/.*pg\[//g' | awk '{print $1}' | grep ^10 | sort | uniq
10.10(
10.12(
10.15(
10.1e(
10.20(
10.39(
10.3f(
10.8(
10.f(

Actions

Copy link

Updated by Yehuda Sadeh about 1 month ago

Also note, pgs from the data pool that couldn't serve from replica:

$ grep can.t.read.object osd.1.log  | grep pg.10 | sed 's/.*pg\[//g' | awk '{print $1}' | sort | uniq
10.10(
10.12(
10.13(
10.15(
10.18(
10.19(
10.1b(
10.1e(
10.2(
10.20(
10.21(
10.24(
10.25(
10.29(
10.2a(
10.2b(
10.2d(
10.2e(
10.2f(
10.3(
10.30(
10.39(
10.3a(
10.3b(
10.3d(
10.3e(
10.3f(
10.4(
10.5(
10.8(
10.f(

from osd logs min_last_complete_ondisk is updated on the following pgs (from the data pool) -- for which this osd is the primary:

$ grep last_complete_ondisk.is.updated osd.1.log | sed 's/.*pg\[//g' | grep ^10 | awk '{print $1}' | sort | uniq
10.0(
10.16(
10.17(
10.1a(
10.1d(
10.1f(
10.22(
10.23(
10.26(
10.27(
10.28(
10.32(
10.33(
10.34(
10.35(
10.37(
10.7(
10.9(
10.b(
10.c(
10.d(
10.e(

Actions

Copy link

#10

Updated by Samuel Just about 1 month ago

'last_complete_ondisk is updated to' only happens on the primary from PeeringState::calc_min_last_complete_ondisk(), the mechanism on the replica is different as it simply gets the value on repop.

Actions

Copy link

#11

Updated by Samuel Just about 1 month ago

Related to Bug #65085: rados: replica mlcod tends to lag by two cycles rather than one limiting replica reads added

Actions

Copy link

#12

Updated by Samuel Just about 1 month ago

Related to Bug #65086: rados: replicas do not initialize their mlcod value upon activation, replica reads are unavailable until the pg has seen at least one write added

Actions

Copy link

#13

Updated by Samuel Just about 1 month ago

Related to Bug #65090: rados: the object most recently written on a pg won't be available for replica read until another write occurs added

Actions

Copy link

#14

Updated by Samuel Just about 1 month ago

Subject changed from reading for replica (mostly) doesn't work to replica_read not available on most recently updated objects in each PG
Description updated (diff)

Actions

Copy link

#15

Updated by Radoslaw Zarzynski about 1 month ago

Bump up.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #65013

replica_read not available on most recently updated objects in each PG

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Yehuda Sadeh about 1 month ago

Updated by Yehuda Sadeh about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Samuel Just about 1 month ago

Updated by Radoslaw Zarzynski about 1 month ago