Bug #42388: mimic: Test failure: test_full_different_file (tasks.cephfs.test_full.TestQuotaFull) - CephFS - Ceph

Actions

Copy link

Bug #42388

open

mimic: Test failure: test_full_different_file (tasks.cephfs.test_full.TestQuotaFull)

Added by Venky Shankar over 4 years ago. Updated over 4 years ago.

Status:

In Progress

Priority:

Normal

Assignee:

Venky Shankar

Category:

Target version:

Ceph - v13.2.7

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Started seeing this frequently in mimic test runs:

Things look fine initially:

2019-10-19T14:40:52.036 INFO:tasks.cephfs.test_full:Writing 32MB should fill this cluster
2019-10-19T14:40:52.107 INFO:teuthology.orchestra.run.smithi156:Running:
2019-10-19T14:40:52.107 INFO:teuthology.orchestra.run.smithi156:> cd /home/ubuntu/cephtest/mnt.0 && sudo dd if=/dev/urandom of=large_file_a bs=1M conv=fdatasync count=16 seek=0
2019-10-19T14:40:52.691 INFO:teuthology.orchestra.run.smithi156.stderr:16+0 records in
2019-10-19T14:40:52.691 INFO:teuthology.orchestra.run.smithi156.stderr:16+0 records out
2019-10-19T14:40:52.691 INFO:teuthology.orchestra.run.smithi156.stderr:16777216 bytes (17 MB) copied, 0.50137 s, 33.5 MB/s
2019-10-19T14:40:52.766 INFO:teuthology.orchestra.run.smithi156:Running:
2019-10-19T14:40:52.766 INFO:teuthology.orchestra.run.smithi156:> cd /home/ubuntu/cephtest/mnt.0 && sudo dd if=/dev/urandom of=large_file_b bs=1M conv=fdatasync count=16 seek=0
2019-10-19T14:40:53.395 INFO:teuthology.orchestra.run.smithi156.stderr:16+0 records in
2019-10-19T14:40:53.395 INFO:teuthology.orchestra.run.smithi156.stderr:16+0 records out
2019-10-19T14:40:53.395 INFO:teuthology.orchestra.run.smithi156.stderr:16777216 bytes (17 MB) copied, 0.549403 s, 30.5 MB/s
2019-10-19T14:40:53.397 INFO:tasks.cephfs.test_full:Writing file B succeeded (full status will happen soon)

At this point the cluster is full -- the test waits until pool flags hit `full,full_quota` -- after which I/O operations should fail with ENOSPC, but that does not happen.

019-10-19T14:41:04.707 INFO:teuthology.orchestra.run.smithi156:Running:
2019-10-19T14:41:04.707 INFO:teuthology.orchestra.run.smithi156:> cd /home/ubuntu/cephtest/mnt.0 && sudo dd if=/dev/urandom of=large_file_b bs=1M conv=fdatasync count=50 seek=16
2019-10-19T14:41:06.420 INFO:teuthology.orchestra.run.smithi156.stderr:50+0 records in
2019-10-19T14:41:06.420 INFO:teuthology.orchestra.run.smithi156.stderr:50+0 records out
2019-10-19T14:41:06.421 INFO:teuthology.orchestra.run.smithi156.stderr:52428800 bytes (52 MB) copied, 1.62985 s, 32.2 MB/s
2019-10-19T14:41:06.423 INFO:tasks.cephfs_test_runner:test_full_different_file (tasks.cephfs.test_full.TestQuotaFull) ... FAIL

The monitors have been notified about pool getting full:

2019-10-19 14:41:02.760 7f0d7064f700  0 log_channel(cluster) log [WRN] : pool 'cephfs_data' is full (reached quota's max_bytes: 32 MiB)
2019-10-19 14:41:02.761 7f0d6de4a700 10 mon.b@0(leader).log v351  logging 2019-10-19 14:41:02.761881 mon.b mon.0 172.21.15.62:6789/0 752 : cluster [WRN] pool 'cephfs_data' is full (reached quota's max_bytes: 32 MiB)
2019-10-19 14:41:02.813 7f0d69e42700  7 mon.b@0(leader).log v160 update_from_paxos applying incremental log 160 07e3-0a-13 0e:29:02.0ba019 mon.b mon.0 172.21.15.62:6789/0 2f0 : cluster [WRN] pool 'cephfs_data' is full (reached quota's max_bytes: 32 MiB)

From client/objecter logs the `fsync` goes through directly:

2019-10-19 14:41:06.423 7f7e00ea6700 15 client.5860 using return-valued form of _fsync
2019-10-19 14:41:06.423 7f7e00ea6700 10 client.5860 no metadata needs to commit
2019-10-19 14:41:06.423 7f7e00ea6700 15 client.5860 waiting on data to flush
2019-10-19 14:41:06.426 7f7e09eb8700  1 -- 172.21.15.156:0/3578027966 <== osd.0 172.21.15.156:6801/27852 10 ==== osd_op_reply(27 10000000001.00000010 [write 0~2097152] v109'4 uv4 ondisk = 0) v8 ==== 164+0+0 (1878422747 0 0) 0x56007b631600 con 0x56007b423000
2019-10-19 14:41:06.426 7f7e09eb8700 10 client.5860.objecter ms_dispatch 0x56007b12a5a0 osd_op_reply(27 10000000001.00000010 [write 0~2097152] v109'4 uv4 ondisk = 0) v8
2019-10-19 14:41:06.426 7f7e09eb8700 10 client.5860.objecter in handle_osd_op_reply

ideally, the `fsync` should have requested a new map marking ops for the pool to fail with ENOSPC (log snippet from a different test):

2019-10-19 14:25:39.768 7fa58e13f700 15 client.4906 using return-valued form of _fsync
2019-10-19 14:25:39.768 7fa58e13f700 10 client.4906 no metadata needs to commit
2019-10-19 14:25:39.768 7fa58e13f700 15 client.4906 waiting on data to flush
2019-10-19 14:25:39.774 7fa592948700  1 -- 172.21.15.156:0/4182787481 <== osd.3 172.21.15.156:6807/27857 23 ==== osd_map(1..40 src has 1..43) v4 ==== 17112+0+0 (4122988117 0 0) 0x559538114f00 con 0x559537325600

Actions

Copy link

Updated by Venky Shankar over 4 years ago

Priority changed from Normal to High

There's a similar problem with `test_full_fclose` with `fclose()` going through on a full pool (w/ quota).

Actions

Copy link

Updated by Venky Shankar over 4 years ago

test run here: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-19_00:01:32-fs-wip-yuri-mimic_10.18.19-testing-basic-smithi/4422746/teuthology.log

Actions

Copy link

Updated by Venky Shankar over 4 years ago

Component(FS) deleted (~~cephfs.pyx~~)

Actions

Copy link

Updated by Venky Shankar over 4 years ago

happens with kclient too on mimic: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-19_00:01:16-kcephfs-wip-yuri-mimic_10.18.19-testing-basic-smithi/4422742/teuthology.log

master is ok: http://pulpito.front.sepia.ceph.com/yuriw-2019-10-22_18:12:16-fs-master-testing-basic-smithi/

Actions

Copy link

Updated by Venky Shankar over 4 years ago

Priority changed from High to Normal
Severity changed from 2 - major to 3 - minor

I could not reproduce this with vstart_runner. One way to sneak in a write (+ fsync) was to not wait for the mons to flag the pool as "full,full_quota" (however, the test waits for the pool flags). Seems like this has to do something with propagating osd_map.

Adjusting priority/severity for this tracker -- I'll continue to look into this.

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Venky Shankar wrote:

I could not reproduce this with vstart_runner. One way to sneak in a write (+ fsync) was to not wait for the mons to flag the pool as "full,full_quota" (however, the test waits for the pool flags). Seems like this has to do something with propagating osd_map.

Adjusting priority/severity for this tracker -- I'll continue to look into this.

I don't recall anything that changed which would cause this but I've seen test failures in these tests semi-recently. Maybe dig up recent tracker tickets for those test cases to see what was related?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #42388

mimic: Test failure: test_full_different_file (tasks.cephfs.test_full.TestQuotaFull)

Updated by Venky Shankar over 4 years ago

Updated by Venky Shankar over 4 years ago

Updated by Venky Shankar over 4 years ago

Updated by Venky Shankar over 4 years ago

Updated by Venky Shankar over 4 years ago

Updated by Patrick Donnelly over 4 years ago