Project

General

Profile

Actions

Bug #36406

closed

Cache-tier forward mode hang in luminous (again)

Added by Iain Bucław over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Similar to https://tracker.ceph.com/issues/23296

Commands ran to reproduce (in vstart.sh)

./bin/ceph osd pool create pool 16 16
./bin/rados -p pool stat doesnotexist   # (2) No such file or directory

./bin/ceph osd pool create newpool 8 8
./bin/ceph osd tier add newpool pool --force-nonempty
./bin/ceph osd tier cache-mode pool forward --yes-i-really-mean-it
./bin/ceph osd tier set-overlay newpool pool
./bin/rados -p pool stat doesnotexist   # hangs...

Clients will hang either forever or until timeout occurs.

This worked fine in Kraken or Jewel when I last used this feature as kind of live migration.


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #36657: luminous: Cache-tier forward mode hang in luminous (again)ResolvedJonathan BrielmaierActions
Copied to RADOS - Backport #36658: mimic: Cache-tier forward mode hang in luminous (again)ResolvedJonathan BrielmaierActions
Actions #1

Updated by Iain Bucław over 5 years ago

In the logs, it looks like the client/server enters an infinite loop.

2018-10-11 18:24:01.417299 7f2061967700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(1 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557444431380 con 0
2018-10-11 18:24:01.417464 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 2 ==== osd_op(client.304115.0:2 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (3568045529 0 0) 
0x557444431380 con 0x55744407e800
2018-10-11 18:24:01.417527 7f206596f700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(2 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557443c2f400 con 0
2018-10-11 18:24:01.417658 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 3 ==== osd_op(client.304115.0:3 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (2269016457 0 0) 
0x557443c2f400 con 0x55744407e800
2018-10-11 18:24:01.417706 7f2061967700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(3 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x55744424f040 con 0
2018-10-11 18:24:01.417828 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 4 ==== osd_op(client.304115.0:4 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (3058673081 0 0) 
0x55744424f040 con 0x55744407e800
2018-10-11 18:24:01.417876 7f206596f700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(4 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557444431380 con 0
2018-10-11 18:24:01.417984 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 5 ==== osd_op(client.304115.0:5 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (538609961 0 0) 0
x557444431380 con 0x55744407e800
2018-10-11 18:24:01.418028 7f2061967700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(5 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557443c2f400 con 0
2018-10-11 18:24:01.418146 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 6 ==== osd_op(client.304115.0:6 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (292248857 0 0) 0
x557443c2f400 con 0x55744407e800
2018-10-11 18:24:01.418243 7f206596f700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(6 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x55744424f040 con 0
2018-10-11 18:24:01.418369 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 7 ==== osd_op(client.304115.0:7 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1123623241 0 0) 
0x55744424f040 con 0x55744407e800
2018-10-11 18:24:01.418448 7f2061967700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(7 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557444431380 con 0
2018-10-11 18:24:01.418574 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 8 ==== osd_op(client.304115.0:8 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1938355577 0 0) 
0x557444431380 con 0x55744407e800
2018-10-11 18:24:01.418625 7f206596f700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(8 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x557443c2f400 con 0
2018-10-11 18:24:01.418735 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 9 ==== osd_op(client.304115.0:9 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1807683224 0 0) 
0x557443c2f400 con 0x55744407e800
2018-10-11 18:24:01.418792 7f2061967700  1 -- 10.240.105.71:6805/28328 --> 10.240.105.71:0/3747596705 -- osd_op_reply(9 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x55744424f040 con 0
2018-10-11 18:24:01.418913 7f20831d2700  1 -- 10.240.105.71:6805/28328 <== client.304115 10.240.105.71:0/3747596705 10 ==== osd_op(client.304115.0:10 8.8 8.4dab5ea8 (undecoded) ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e149) v8 ==== 221+0+0 (1523505832 0 0
) 0x55744424f040 con 0x55744407e800
...
Actions #2

Updated by Iain Bucław over 5 years ago

Iain Bucław wrote:

In the logs, it looks like the client/server enters an infinite loop.

[...]

These are the client logs

2018-10-11 18:43:27.045176 7fe1ab694fc0  1  Processor -- start
2018-10-11 18:43:27.045234 7fe1ab694fc0  1 -- - start start
2018-10-11 18:43:27.045806 7fe1ab694fc0  1 -- - --> 172.17.0.1:40000/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x559d186619c0 con 0
2018-10-11 18:43:27.045820 7fe1ab694fc0  1 -- - --> 172.17.0.1:40002/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x559d18661dd0 con 0
2018-10-11 18:43:27.046030 7fe197fff700  1 -- 10.240.105.71:0/527386479 learned_addr learned my addr 10.240.105.71:0/527386479
2018-10-11 18:43:27.046815 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.2 172.17.0.1:40002/0 1 ==== mon_map magic: 0 v1 ==== 361+0+0 (3515841928 0 0) 0x7fe17c002500 con 0x559d186711b0
2018-10-11 18:43:27.047228 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 1 ==== mon_map magic: 0 v1 ==== 361+0+0 (3515841928 0 0) 0x7fe18c002570 con 0x559d18674810
2018-10-11 18:43:27.047360 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.2 172.17.0.1:40002/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (3383190558 0 0) 0x7fe17c002850 con 0x559d186711b0
2018-10-11 18:43:27.047643 7fe1967fc700  1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40002/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fe180001e10 con 0
2018-10-11 18:43:27.047666 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (321158953 0 0) 0x7fe18c0028c0 con 0x559d18674810
2018-10-11 18:43:27.047713 7fe1967fc700  1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fe180003580 con 0
2018-10-11 18:43:27.049285 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (1644107686 0 0) 0x7fe18c002db0 con 0x559d18674810
2018-10-11 18:43:27.049410 7fe1967fc700  1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7fe180002400 con 0
2018-10-11 18:43:27.049434 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.2 172.17.0.1:40002/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (1880296171 0 0) 0x7fe17c000e50 con 0x559d186711b0
2018-10-11 18:43:27.049502 7fe1967fc700  1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40002/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7fe180009170 con 0
2018-10-11 18:43:27.049880 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 580+0+0 (3609376686 0 0) 0x7fe18c001280 con 0x559d18674810
2018-10-11 18:43:27.049950 7fe1967fc700  1 -- 10.240.105.71:0/527386479 >> 172.17.0.1:40002/0 conn(0x559d186711b0 :-1 s=STATE_OPEN pgs=14 cs=1 l=1).mark_down
2018-10-11 18:43:27.050012 7fe1967fc700  1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- mon_subscribe({monmap=0+}) v2 -- 0x559d186622e0 con 0
2018-10-11 18:43:27.050073 7fe1ab694fc0  1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- mon_subscribe({mgrmap=0+}) v2 -- 0x559d186790e0 con 0
2018-10-11 18:43:27.050175 7fe1ab694fc0  1 -- 10.240.105.71:0/527386479 --> 172.17.0.1:40000/0 -- mon_subscribe({osdmap=0}) v2 -- 0x559d18679840 con 0
2018-10-11 18:43:27.050399 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 5 ==== mon_map magic: 0 v1 ==== 361+0+0 (3515841928 0 0) 0x7fe18c0016c0 con 0x559d18674810
2018-10-11 18:43:27.050441 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 6 ==== mgrmap(e 61) v1 ==== 317+0+0 (1691338673 0 0) 0x7fe18c001aa0 con 0x559d18674810
2018-10-11 18:43:27.050632 7fe1967fc700  1 -- 10.240.105.71:0/527386479 <== mon.0 172.17.0.1:40000/0 7 ==== osd_map(160..160 src has 1..160) v3 ==== 4080+0+0 (1017012160 0 0) 0x7fe180002400 con 0x559d18674810
2018-10-11 18:43:27.050986 7fe1ab694fc0  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:1 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] ondisk+read+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
2018-10-11 18:43:27.052922 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 1 ==== osd_op_reply(1 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (44106256 0 0) 0x7fe184002ed0 con 0x559d1867b5b0
2018-10-11 18:43:27.052984 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:2 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
2018-10-11 18:43:27.053265 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 2 ==== osd_op_reply(2 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (2086169348 0 0) 0x7fe184002ed0 con 0x559d1867b5b0
2018-10-11 18:43:27.053299 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:3 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=2 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
2018-10-11 18:43:27.053554 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 3 ==== osd_op_reply(3 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (996246354 0 0) 0x7fe184002ed0 con 0x559d1867b5b0
2018-10-11 18:43:27.053586 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:4 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=3 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
2018-10-11 18:43:27.053859 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 4 ==== osd_op_reply(4 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (108452704 0 0) 0x7fe184002ed0 con 0x559d1867b5b0
2018-10-11 18:43:27.053887 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:5 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=4 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
2018-10-11 18:43:27.054099 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 5 ==== osd_op_reply(5 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (3037877246 0 0) 0x7fe184002ed0 con 0x559d1867b5b0
2018-10-11 18:43:27.054126 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:6 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=5 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
2018-10-11 18:43:27.054336 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 6 ==== osd_op_reply(6 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (2282032076 0 0) 0x7fe184002ed0 con 0x559d1867b5b0
2018-10-11 18:43:27.054363 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:7 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=6 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0
2018-10-11 18:43:27.054623 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 <== osd.1 10.240.105.71:6805/4490 7 ==== osd_op_reply(7 doesnotexist [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory) redirect: { object , locator{@9} }) v8 ==== 204+0+0 (3476874138 0 0) 0x7fe184002ed0 con 0x559d1867b5b0
2018-10-11 18:43:27.054652 7fe19ccc3700  1 -- 10.240.105.71:0/527386479 --> 10.240.105.71:6805/4490 -- osd_op(unknown.0.0:8 8.8 8:157ad5b2:::doesnotexist:head [stat] snapc 0=[] RETRY=7 ondisk+retry+read+ignore_overlay+redirected+known_if_redirected e160) v8 -- 0x559d1867fcb0 con 0

I killed the process when the RETY counter reached over 4000.

Actions #3

Updated by Iain Bucław over 5 years ago

Iain Bucław wrote:

Similar to https://tracker.ceph.com/issues/23296

Looking at the fix for the other issue.

https://github.com/ceph/ceph/pull/20766

Also applying the CEPH_OSD_FLAG_IGNORE_CACHE flag fixes infinite loop for this issue.

Actions #4

Updated by John Spray over 5 years ago

  • Project changed from mgr to RADOS
Actions #6

Updated by John Spray over 5 years ago

  • Status changed from New to Fix Under Review
Actions #7

Updated by Sage Weil over 5 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to mimic,luminous
Actions #8

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #36657: luminous: Cache-tier forward mode hang in luminous (again) added
Actions #9

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #36658: mimic: Cache-tier forward mode hang in luminous (again) added
Actions #10

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF