Project

General

Profile

Bug #45434

qa: test_full_fsync (tasks.cephfs.test_full.TestClusterFull) failed

Added by Ramana Raja about 1 year ago. Updated about 1 month ago.

Status:
Triaged
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/yuriw-2020-05-05_20:55:43-kcephfs-wip-yuri-testing-2020-05-05-1439-distro-basic-smithi/5026174/

020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:======================================================================
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:ERROR: test_full_fsync (tasks.cephfs.test_full.TestClusterFull)
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuri-testing-2020-05-05-1439/qa/tasks/cephfs/test_full.py", line 356, in test_full_fsync
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:    self._remote_write_test(remote_script)
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuri-testing-2020-05-05-1439/qa/tasks/cephfs/test_full.py", line 222, in _remote_write_test
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:    is_fuse=isinstance(self.mount_a, FuseMount)
2020-05-06T18:04:48.476 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-yuri-testing-2020-05-05-1439/qa/tasks/cephfs/mount.py", line 191, in run_python
2020-05-06T18:04:48.477 INFO:tasks.cephfs_test_runner:    p.wait()
2020-05-06T18:04:48.477 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 162, in wait
2020-05-06T18:04:48.477 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2020-05-06T18:04:48.477 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 184, in _raise_for_status
2020-05-06T18:04:48.477 INFO:tasks.cephfs_test_runner:    node=self.hostname, label=self.label
2020-05-06T18:04:48.477 INFO:tasks.cephfs_test_runner:CommandFailedError: Command failed on smithi171 with status 1: 'sudo adjust-ulimits daemon-helper kill python3 -c \'\nimport time\nimport datetime\nimport subprocess\nimport os\n\n# Write some buffered data through before going full, all should be well\nprint("writing some data through which we expect to succeed")\nbytes = 0\nf = os.open("/home/ubuntu/cephtest/mnt.0/full_test_file", os.O_WRONLY | os.O_CREAT)\nbytes += os.write(f, b\'"\'"\'a\'"\'"\' * 4096)\nos.fsync(f)\nprint("fsync\'"\'"\'ed data successfully, will now attempt to fill fs")\n\n# Okay, now we\'"\'"\'re going to fill up the filesystem, and then keep\n# writing until we see an error from fsync.  As long as we\'"\'"\'re doing\n# buffered IO, the error should always only appear from fsync and not\n# from write\nfull = False\n\nfor n in range(0, int(373 * 1.1)):\n    try:\n        bytes += os.write(f, b\'"\'"\'x\'"\'"\' * 1024 * 1024)\n        print("wrote bytes via buffered write, moving on to fsync")\n    except OSError as e:\n        print("Unexpected error %s from write() instead of fsync()" % e)\n        raise\n\n    try:\n        os.fsync(f)\n        print("fsync\'"\'"\'ed successfully")\n    except OSError as e:\n        print("Reached fullness after %.2f MB" % (bytes / (1024.0 * 1024.0)))\n        full = True\n        break\n    else:\n        print("Not full yet after %.2f MB" % (bytes / (1024.0 * 1024.0)))\n\n    if n > 373 * 0.9:\n        # Be cautious in the last region where we expect to hit\n        # the full condition, so that we don\'"\'"\'t overshoot too dramatically\n        print("sleeping a bit as we\'"\'"\'ve exceeded 90% of our expected full ratio")\n        time.sleep(15.0)\n\nif not full:\n    raise RuntimeError("Failed to reach fullness after writing %d bytes" % bytes)\n\n# close() should not raise an error because we already caught it in\n# fsync.  There shouldn\'"\'"\'t have been any more writeback errors\n# since then because all IOs got cancelled on the full flag.\nprint("calling close")\nos.close(f)\nprint("close() did not raise error")\n\nos.unlink("/home/ubuntu/cephtest/mnt.0/full_test_file")\n\''

And I see this earlier in the teuthology.log

2020-05-06T18:04:34.297 INFO:teuthology.orchestra.run.smithi171.stdout:Not full yet after 338.00 MB
2020-05-06T18:04:34.297 INFO:teuthology.orchestra.run.smithi171.stdout:sleeping a bit as we've exceeded 90% of our expected full ratio
2020-05-06T18:04:34.297 INFO:teuthology.orchestra.run.smithi171.stdout:Unexpected error [Errno 28] No space left on device from write() instead of fsync()
2020-05-06T18:04:34.298 INFO:teuthology.orchestra.run.smithi171.stderr:Traceback (most recent call last):
2020-05-06T18:04:34.298 INFO:teuthology.orchestra.run.smithi171.stderr:  File "<string>", line 23, in <module>
2020-05-06T18:04:34.298 INFO:teuthology.orchestra.run.smithi171.stderr:OSError: [Errno 28] No space left on device
2020-05-06T18:04:34.527 INFO:teuthology.orchestra.run.smithi171.stderr:daemon-helper: command failed with exit status 1
2020-05-06T18:04:34.546 DEBUG:teuthology.orchestra.run:got remote process result: 1
2020-05-06T18:04:34.548 INFO:tasks.cephfs_test_runner:test_full_fsync (tasks.cephfs.test_full.TestClusterFull) ... ERROR


Related issues

Duplicated by CephFS - Bug #45145: qa/test_full: failed to open 'large_file_a': No space left on device Duplicate

History

#1 Updated by Ramana Raja about 1 year ago

  • Description updated (diff)

#4 Updated by Patrick Donnelly 12 months ago

/ceph/teuthology-archive/pdonnell-2020-06-12_09:37:27-kcephfs-wip-pdonnell-testing-20200612.063208-distro-basic-smithi/5141828/teuthology.log

Looks like only on the testing branch.

#5 Updated by Patrick Donnelly 12 months ago

  • Subject changed from octopus: test_full_fsync (tasks.cephfs.test_full.TestClusterFull) failed to qa: test_full_fsync (tasks.cephfs.test_full.TestClusterFull) failed
  • Status changed from New to Triaged
  • Priority changed from Normal to Urgent
  • Target version set to v16.0.0
  • Backport set to octopus,nautilus
  • Component(FS) qa-suite added
  • Labels (FS) qa added

#6 Updated by Patrick Donnelly 8 months ago

  • Labels (FS) qa-failure added

#8 Updated by Patrick Donnelly 6 months ago

  • Assignee deleted (Zheng Yan)

#9 Updated by Patrick Donnelly 5 months ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport changed from octopus,nautilus to pacific,octopus,nautilus

#11 Updated by Patrick Donnelly about 2 months ago

  • Duplicated by Bug #45145: qa/test_full: failed to open 'large_file_a': No space left on device added

#12 Updated by Xiubo Li about 2 months ago

Ramana Raja wrote:

see this in octopus testing, https://pulpito.ceph.com/yuriw-2021-02-09_00:31:50-kcephfs-wip-yuri2-testing-2021-02-08-1048-octopus-testing-basic-gibba/5868956/

The full ratio settings are:

        mon:
          debug mon: 20
          debug ms: 1
          debug paxos: 20
          mon op complaint time: 120
          mon osd backfillfull ratio: 0.6
          mon osd full ratio: 0.7
          mon osd nearfull ratio: 0.6

The osd.6:

2021-02-09T18:11:05.853+0000 7f04f42a8700 20 osd.6 79 check_full_status cur ratio 0.00224925, physical ratio 0.00224925, new state none
...
2021-02-09T18:11:10.557+0000 7f04f42a8700 20 osd.6 79 check_full_status cur ratio 0.512345, physical ratio 0.512345, new state none
...
2021-02-09T18:11:12.257+0000 7f04f42a8700 20 osd.6 79 check_full_status cur ratio 0.670046, physical ratio 0.670046, new state backfillfull
2021-02-09T18:11:12.257+0000 7f04f42a8700 10 osd.6 79 check_full_status none -> backfillfull
...
2021-02-09T18:11:16.358+0000 7f04f42a8700 20 osd.6 81 check_full_status cur ratio 0.885941, physical ratio 0.885941, new state full
2021-02-09T18:11:16.358+0000 7f04f42a8700 10 osd.6 81 check_full_status backfillfull -> full

The test logs and time:

2021-02-09T18:11:07.170 INFO:tasks.cephfs.test_full:pool capacity 391560288, 373MB should be enough to fill it
2021-02-09T18:11:07.170 DEBUG:teuthology.orchestra.run.gibba023:> sudo adjust-ulimits daemon-helper kill python3 -c '
2021-02-09T18:11:07.170 DEBUG:teuthology.orchestra.run.gibba023:> import time
2021-02-09T18:11:07.170 DEBUG:teuthology.orchestra.run.gibba023:> import datetime
2021-02-09T18:11:07.170 DEBUG:teuthology.orchestra.run.gibba023:> import subprocess
2021-02-09T18:11:07.170 DEBUG:teuthology.orchestra.run.gibba023:> import os
2021-02-09T18:11:07.170 DEBUG:teuthology.orchestra.run.gibba023:>

...

2021-02-09T18:11:11.680 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 238.00 MB
2021-02-09T18:11:11.680 INFO:teuthology.orchestra.run.gibba023.stdout:wrote bytes via buffered write, moving on to fsync
2021-02-09T18:11:11.680 INFO:teuthology.orchestra.run.gibba023.stdout:fsync'ed successfully
2021-02-09T18:11:11.680 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 239.00 MB
2021-02-09T18:11:11.680 INFO:teuthology.orchestra.run.gibba023.stdout:wrote bytes via buffered write, moving on to fsync
2021-02-09T18:11:11.681 INFO:teuthology.orchestra.run.gibba023.stdout:fsync'ed successfully
2021-02-09T18:11:11.681 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 240.00 MB
2021-02-09T18:11:11.681 INFO:teuthology.orchestra.run.gibba023.stdout:wrote bytes via buffered write, moving on to fsync
2021-02-09T18:11:11.681 INFO:teuthology.orchestra.run.gibba023.stdout:fsync'ed successfully
2021-02-09T18:11:11.681 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 241.00 MB
2021-02-09T18:11:13.208 INFO:teuthology.orchestra.run.gibba023.stdout:wrote bytes via buffered write, moving on to fsync
2021-02-09T18:11:13.209 INFO:teuthology.orchestra.run.gibba023.stdout:fsync'ed successfully
2021-02-09T18:11:13.209 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 242.00 MB
2021-02-09T18:11:13.209 INFO:teuthology.orchestra.run.gibba023.stdout:wrote bytes via buffered write, moving on to fsync
2021-02-09T18:11:13.209 INFO:teuthology.orchestra.run.gibba023.stdout:fsync'ed successfully
2021-02-09T18:11:13.209 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 243.00 MB

...

2021-02-09T18:11:13.239 INFO:teuthology.orchestra.run.gibba023.stdout:wrote bytes via buffered write, moving on to fsync
2021-02-09T18:11:13.239 INFO:teuthology.orchestra.run.gibba023.stdout:fsync'ed successfully
2021-02-09T18:11:16.939 INFO:tasks.ceph.mon.a.gibba004.stderr:2021-02-09T18:11:16.935+0000 7f62cd2a3700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL)
2021-02-09T18:11:43.546 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 322.00 MB

...

2021-02-09T18:11:43.550 INFO:teuthology.orchestra.run.gibba023.stdout:sleeping a bit as we've exceeded 90% of our expected full ratio
2021-02-09T18:11:43.550 INFO:teuthology.orchestra.run.gibba023.stdout:wrote bytes via buffered write, moving on to fsync
2021-02-09T18:11:43.550 INFO:teuthology.orchestra.run.gibba023.stdout:fsync'ed successfully
2021-02-09T18:11:43.550 INFO:teuthology.orchestra.run.gibba023.stdout:Not full yet after 338.00 MB
2021-02-09T18:11:43.551 INFO:teuthology.orchestra.run.gibba023.stdout:sleeping a bit as we've exceeded 90% of our expected full ratio
2021-02-09T18:11:43.551 INFO:teuthology.orchestra.run.gibba023.stdout:Unexpected error [Errno 28] No space left on device from write() instead of fsync()
2021-02-09T18:11:43.551 INFO:teuthology.orchestra.run.gibba023.stderr:Traceback (most recent call last):
2021-02-09T18:11:43.551 INFO:teuthology.orchestra.run.gibba023.stderr:  File "<string>", line 23, in <module>
2021-02-09T18:11:43.551 INFO:teuthology.orchestra.run.gibba023.stderr:OSError: [Errno 28] No space left on device
...

Still not sure why only the osd.6 is growing so quickly, all the other osds works well.

#14 Updated by Xiubo Li about 2 months ago

osdstat of osd.6:

3165             {
3166                 "osd": 6,
3167                 "up_from": 13,
3168                 "seq": 55834574933,
3169                 "num_pgs": 9,
3170                 "num_osds": 1,
3171                 "num_per_pool_osds": 0,
3172                 "num_per_pool_omap_osds": 0,
3173                 "kb": 195312,
3174                 "kb_used": 173045,
3175                 "kb_used_data": 0,
3176                 "kb_used_omap": 0,
3177                 "kb_used_meta": 0,
3178                 "kb_avail": 22266,
3179                 "statfs": {
3180                     "total": 200000000,
3181                     "available": 22800986,
3182                     "internally_reserved": 0,
3183                     "allocated": 0,
3184                     "data_stored": 0,
3185                     "data_compressed": 0,
3186                     "data_compressed_allocated": 0,
3187                     "data_compressed_original": 0,
3188                     "omap_allocated": 0,
3189                     "internal_metadata": 0
3190                 },

We can see that the usage of the disk is : "kb_used": 173045 / "kb": 195312 = 0.88941, which the stat will be set to full. The "kb_used": 173045 = 167MB.

And the pool "cephfs_data" pg dump:

pgid     objects      osds      disk used

12.0     13           [4,6]     54525952B/52MB
12.1     8            [6,3]     33554432B/32MB
12.2     10           [4,3]     41943040B/40MB
12.3     8            [0,7]     30412800B/29MB
12.4     11           [3,6]     46137344B/44MB
12.5     10           [2,6]     41943040B/40MB
12.6     11           [1,0]     46137344B/44MB
12.7     14           [1,7]     58720256B/56MB

From above we can see that for the cephfs_data pool the osd.6 had used 168MB. The ratio is 168 / 191 = 0.8796, the stat should be full.

And the pool "cephfs_metadata" pg dump:

pgid     objects      osds      disk used

11.0     4            [1,3]     0
11.1     0            [6,3]     0
11.2     0            [2,3]     0
11.3     3            [6,7]     34
11.4     5            [7,6]     568438
11.5     4            [5,7]     160
11.6     2            [2,0]     0
11.7     4            [1,5]     575

From above we can see that for the cephfs_metadata pool, the osd.6 had used some space too.

All in all, for the osd.6 it had already reach it osd_full_ratio, but for some other osds, like the osd.2 it only used 40M, and there was no any data wrote in osd.5 :-( . The crush rule seemed not doing well for the balancement of the disk usage.

@Patrick, @Ramana

Is there any option that we could use to improve the crush rule's logic to improve it ?

#15 Updated by Xiubo Li about 2 months ago

  • Assignee set to Xiubo Li

#16 Updated by Xiubo Li about 2 months ago

  • Pull request ID set to 41025

Also available in: Atom PDF