Bug #36406
Cache-tier forward mode hang in luminous (again)
% Done:
0%
Source:
Tags:
Backport:
mimic,luminous
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Similar to https://tracker.ceph.com/issues/23296
Commands ran to reproduce (in vstart.sh)
./bin/ceph osd pool create pool 16 16 ./bin/rados -p pool stat doesnotexist # (2) No such file or directory ./bin/ceph osd pool create newpool 8 8 ./bin/ceph osd tier add newpool pool --force-nonempty ./bin/ceph osd tier cache-mode pool forward --yes-i-really-mean-it ./bin/ceph osd tier set-overlay newpool pool ./bin/rados -p pool stat doesnotexist # hangs...
Clients will hang either forever or until timeout occurs.
This worked fine in Kraken or Jewel when I last used this feature as kind of live migration.
Related issues
History
#1 Updated by Iain Bucław about 5 years ago
In the logs, it looks like the client/server enters an infinite loop.
2018-10-11 18:24:01.417299 7f2061967700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(1 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557444431380 con 0 2018-10-11 18:24:01.417464 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 2 ==== osd_op(client.304115.0:2 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (3568045529 0 0) 0x557444431380 con 0x55744407e800 2018-10-11 18:24:01.417527 7f206596f700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(2 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557443c2f400 con 0 2018-10-11 18:24:01.417658 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 3 ==== osd_op(client.304115.0:3 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (2269016457 0 0) 0x557443c2f400 con 0x55744407e800 2018-10-11 18:24:01.417706 7f2061967700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(3 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x55744424f040 con 0 2018-10-11 18:24:01.417828 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 4 ==== osd_op(client.304115.0:4 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (3058673081 0 0) 0x55744424f040 con 0x55744407e800 2018-10-11 18:24:01.417876 7f206596f700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(4 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557444431380 con 0 2018-10-11 18:24:01.417984 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 5 ==== osd_op(client.304115.0:5 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (538609961 0 0) 0 x557444431380 con 0x55744407e800 2018-10-11 18:24:01.418028 7f2061967700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(5 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557443c2f400 con 0 2018-10-11 18:24:01.418146 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 6 ==== osd_op(client.304115.0:6 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (292248857 0 0) 0 x557443c2f400 con 0x55744407e800 2018-10-11 18:24:01.418243 7f206596f700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(6 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x55744424f040 con 0 2018-10-11 18:24:01.418369 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 7 ==== osd_op(client.304115.0:7 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1123623241 0 0) 0x55744424f040 con 0x55744407e800 2018-10-11 18:24:01.418448 7f2061967700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(7 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557444431380 con 0 2018-10-11 18:24:01.418574 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 8 ==== osd_op(client.304115.0:8 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1938355577 0 0) 0x557444431380 con 0x55744407e800 2018-10-11 18:24:01.418625 7f206596f700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(8 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557443c2f400 con 0 2018-10-11 18:24:01.418735 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 9 ==== osd_op(client.304115.0:9 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1807683224 0 0) 0x557443c2f400 con 0x55744407e800 2018-10-11 18:24:01.418792 7f2061967700 1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(9 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x55744424f040 con 0 2018-10-11 18:24:01.418913 7f20831d2700 1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 10 ==== osd_op(client.304115.0:10 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1523505832 0 0 ) 0x55744424f040 con 0x55744407e800 ...
#2 Updated by Iain Bucław about 5 years ago
Iain Bucław wrote:
In the logs, it looks like the client/server enters an infinite loop.
[...]
These are the client logs
2018-10-11 18:43:27.045176 7fe1ab694fc0 1 Processor -- start 2018-10-11 18:43:27.045234 7fe1ab694fc0 1 -- - start start 2018-10-11 18:43:27.045806 7fe1ab694fc0 1 -- - --> 172.17.0.1:40000/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x559d186619c0 con 0 2018-10-11 18:43:27.045820 7fe1ab694fc0 1 -- - --> 172.17.0.1:40002/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x559d18661dd0 con 0 2018-10-11 18:43:27.046030 7fe197fff700 1 -- 10.240.105.71:0/527386479 learned_addr learned my addr 10.240.105.71:0/527386479 2018-10-11 18:43:27.046815 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.2 172.17.0.1:40002/0 1 ==== mon_map magic: 0 v1 ==== 361+0+0 (3515841928 0 0) 0x7fe17c002500 con 0x559d186711b0 2018-10-11 18:43:27.047228 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 1 ==== mon_map magic: 0 v1 ==== 361+0+0 (3515841928 0 0) 0x7fe18c002570 con 0x559d18674810 2018-10-11 18:43:27.047360 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.2 172.17.0.1:40002/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (3383190558 0 0) 0x7fe17c002850 con 0x559d186711b0 2018-10-11 18:43:27.047643 7fe1967fc700 1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40002/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fe180001e10 con 0 2018-10-11 18:43:27.047666 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (321158953 0 0) 0x7fe18c0028c0 con 0x559d18674810 2018-10-11 18:43:27.047713 7fe1967fc700 1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fe180003580 con 0 2018-10-11 18:43:27.049285 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (1644107686 0 0) 0x7fe18c002db0 con 0x559d18674810 2018-10-11 18:43:27.049410 7fe1967fc700 1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7fe180002400 con 0 2018-10-11 18:43:27.049434 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.2 172.17.0.1:40002/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (1880296171 0 0) 0x7fe17c000e50 con 0x559d186711b0 2018-10-11 18:43:27.049502 7fe1967fc700 1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40002/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7fe180009170 con 0 2018-10-11 18:43:27.049880 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 580+0+0 (3609376686 0 0) 0x7fe18c001280 con 0x559d18674810 2018-10-11 18:43:27.049950 7fe1967fc700 1 -- 10.240.105.71:0/527386479 >> 172.17.0.1:40002/0 conn(0x559d186711b0 :-1 s=STATE_OPEN pgs=14 cs=1 l=1).mark_down 2018-10-11 18:43:27.050012 7fe1967fc700 1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- mon_subscribe({monmap=0+}) v2 -- 0x559d186622e0 con 0 2018-10-11 18:43:27.050073 7fe1ab694fc0 1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- mon_subscribe({mgrmap=0+}) v2 -- 0x559d186790e0 con 0 2018-10-11 18:43:27.050175 7fe1ab694fc0 1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- mon_subscribe({osdmap=0}) v2 -- 0x559d18679840 con 0 2018-10-11 18:43:27.050399 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 5 ==== mon_map magic: 0 v1 ==== 361+0+0 (3515841928 0 0) 0x7fe18c0016c0 con 0x559d18674810 2018-10-11 18:43:27.050441 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 6 ==== mgrmap(e 61) v1 ==== 317+0+0 (1691338673 0 0) 0x7fe18c001aa0 con 0x559d18674810 2018-10-11 18:43:27.050632 7fe1967fc700 1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 7 ==== osd_map(160..160 src has 1..160) v3 ==== 4080+0+0 (1017012160 0 0) 0x7fe180002400 con 0x559d18674810 2018-10-11 18:43:27.050986 7fe1ab694fc0 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:1 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] ondisk+read+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0 2018-10-11 18:43:27.052922 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 1 ==== osd_op_reply(1 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (44106256 0 0) 0x7fe184002ed0 con 0x559d1867b5b0 2018-10-11 18:43:27.052984 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:2 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0 2018-10-11 18:43:27.053265 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 2 ==== osd_op_reply(2 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (2086169348 0 0) 0x7fe184002ed0 con 0x559d1867b5b0 2018-10-11 18:43:27.053299 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:3 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=2 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0 2018-10-11 18:43:27.053554 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 3 ==== osd_op_reply(3 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (996246354 0 0) 0x7fe184002ed0 con 0x559d1867b5b0 2018-10-11 18:43:27.053586 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:4 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=3 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0 2018-10-11 18:43:27.053859 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 4 ==== osd_op_reply(4 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (108452704 0 0) 0x7fe184002ed0 con 0x559d1867b5b0 2018-10-11 18:43:27.053887 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:5 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=4 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0 2018-10-11 18:43:27.054099 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 5 ==== osd_op_reply(5 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (3037877246 0 0) 0x7fe184002ed0 con 0x559d1867b5b0 2018-10-11 18:43:27.054126 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:6 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=5 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0 2018-10-11 18:43:27.054336 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 6 ==== osd_op_reply(6 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (2282032076 0 0) 0x7fe184002ed0 con 0x559d1867b5b0 2018-10-11 18:43:27.054363 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:7 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=6 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0 2018-10-11 18:43:27.054623 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 7 ==== osd_op_reply(7 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (3476874138 0 0) 0x7fe184002ed0 con 0x559d1867b5b0 2018-10-11 18:43:27.054652 7fe19ccc3700 1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:8 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=7 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
I killed the process when the RETY counter reached over 4000.
#3 Updated by Iain Bucław about 5 years ago
Iain Bucław wrote:
Similar to https://tracker.ceph.com/issues/23296
Looking at the fix for the other issue.
https://github.com/ceph/ceph/pull/20766
Also applying the CEPH_OSD_FLAG_IGNORE_CACHE flag fixes infinite loop for this issue.
#4 Updated by John Spray about 5 years ago
- Project changed from mgr to RADOS
#5 Updated by Iain Bucław about 5 years ago
#6 Updated by John Spray about 5 years ago
- Status changed from New to Fix Under Review
#7 Updated by Sage Weil about 5 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to mimic,luminous
#8 Updated by Nathan Cutler about 5 years ago
- Copied to Backport #36657: luminous: Cache-tier forward mode hang in luminous (again) added
#9 Updated by Nathan Cutler about 5 years ago
- Copied to Backport #36658: mimic: Cache-tier forward mode hang in luminous (again) added
#10 Updated by Nathan Cutler almost 5 years ago
- Status changed from Pending Backport to Resolved