Project

General

Profile

Actions

Bug #42313

closed

"No space left on device" errors following 41a13eca480e38cfeeba7a180b4516b90598c39b

Added by Yuri Weinstein over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Crash signature (v1):
Crash signature (v2):

Description

We are seeing lots of jobs failed due to this issue.
It's blocking point releases testing,

http://pulpito.ceph.com/yuriw-2019-10-11_12:58:22-rbd-wip-yuri6-testing-2019-10-10-2057-mimic-distro-basic-smithi/
http://pulpito.ceph.com/yuriw-2019-10-11_19:41:35-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi/

 2019-10-13T09:31:17.097
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:17.095566
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
4 full osd(s) (OSD_FULL)
2019-10-13T09:31:22.117
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:22.115362
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
1 full osd(s) (OSD_FULL)
2019-10-13T09:31:27.632
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:27.630483
7f189f9b9700 -1 log_channel(cluster) log [ERR] : Health check failed:
mon b is very low on available space (MON_DISK_CRIT)
2019-10-13T09:31:28.672
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:28.670676
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
4 full osd(s) (OSD_FULL)
2019-10-13T09:31:32.980
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:32.978369
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
mons a,b,c are very low on available space (MON_DISK_CRIT)
2019-10-13T09:31:34.061
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:34.053935
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update:
2 full osd(s) (OSD_FULL)
2019-10-13T09:31:40.153
INFO:tasks.ceph.osd.0.smithi167.stderr:2019-10-13 09:31:40.151323
7fd400753700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 97% full
2019-10-13T09:31:40.328
INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:40.326090
7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed:
1 full osd(s) (OSD_FULL)
2019-10-13T09:31:40.494
INFO:tasks.ceph.osd.3.smithi167.stderr:2019-10-13 09:31:40.492169
7fda29d9c700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 97% full
2019-10-13T09:31:40.900
INFO:tasks.ceph.osd.1.smithi167.stderr:2019-10-13 09:31:40.898979
7f10785bb700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:41.279
INFO:tasks.ceph.osd.4.smithi174.stderr:2019-10-13 09:31:41.277991
7f8651337700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:41.525
INFO:tasks.ceph.osd.2.smithi167.stderr:2019-10-13 09:31:41.523075
7f1df905a700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:42.493
INFO:tasks.ceph.osd.7.smithi174.stderr:2019-10-13 09:31:42.491132
7febd19f8700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:43.281
INFO:tasks.ceph.osd.5.smithi174.stderr:2019-10-13 09:31:43.279310
7f9c80f21700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
2019-10-13T09:31:44.399
INFO:tasks.ceph.osd.6.smithi174.stderr:2019-10-13 09:31:44.398004
7f4c2bf61700 -1 log_channel(cluster) log [ERR] : full status failsafe
engaged, dropping updates, now 98% full
Actions #1

Updated by Yuri Weinstein over 4 years ago

  • Description updated (diff)
Actions #2

Updated by David Galloway over 4 years ago

Oct 11 11:24:54 <dgalloway>     jdillaman, i'm looking at the "no space left on device" error you were messaging about yesterday
Oct 11 11:25:18 <dgalloway>     ceph-cm-ansible is configured to create 4x ~90GB logical volumes for OSDs and a 15GB logical volume for journals
Oct 11 11:25:35 <dgalloway>     i see the 15GB mounted at /var/lib/ceph and the other 4 LVs aren't in use at all
Oct 11 11:25:58 <dgalloway>     that seems wrong
Oct 11 11:27:07 <dgalloway>     i SSHed to a random smithi and it's also set up that way.  making me think something in a yaml somewhere got misconfigured

If the tests are writing data to /var/lib/ceph/osd/$id and there's no volume mounted there, of course /var/lib/ceph is going to fill up and quickly.

I was wondering if maybe https://github.com/ceph/teuthology/pull/1332 is related?

Actions #3

Updated by Nathan Cutler over 4 years ago

Due to some recent change [1], what used to be:

> ls -l /dev/disk/by-id/wwn-*
lrwxrwxrwx 1 root root  9 Oct 11 12:22 /dev/disk/by-id/wwn-0x5000c50091e316b4 -> ../../sda
lrwxrwxrwx 1 root root 10 Oct 11 12:22 /dev/disk/by-id/wwn-0x5000c50091e316b4-part1 -> ../../sda1

is now showing in teuthology.log thusly:

> ls -l '/dev/disk/by-id/wwn-*'
ls: cannot access '/dev/disk/by-id/wwn-*': No such file or directory

Note the single quotes around the device path, which includes a glob character. I tried the following experiment on my laptop:

$ ls -l '*'
ls: cannot access '*': No such file or directory

[1] https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90598c39b being the obvious candidate, but I've been staring at the code and I can't figure out how the quoting is getting triggered.

Actions #4

Updated by Nathan Cutler over 4 years ago

Looking at http://pulpito.ceph.com, I can see that some users are getting

> ls -l /dev/disk/by-id/wwn-*

while others (notably yuriw) are getting the buggy version:

ls -l '/dev/disk/by-id/wwn-*'

This might indicate that the buggy code is getting read from the teuthology code in the user's virtualenv. That would make it easy to verify (just try it with https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90598c39b and without).

Actions #5

Updated by Nathan Cutler over 4 years ago

Testing that hypothesis here: http://pulpito.ceph.com/smithfarm-2019-10-15_14:05:10-rbd:mirror-thrash-mimic-distro-basic-smithi/

(My teuthology clone does not have https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90598c39b in it.)

And, sure enough:

2019-10-15T14:32:24.660 INFO:teuthology.orchestra.run.smithi012:Running:
2019-10-15T14:32:24.660 INFO:teuthology.orchestra.run.smithi012:> ls -l /dev/disk/by-id/wwn-*
Actions #6

Updated by Kyrylo Shatskyy over 4 years ago

David Galloway wrote:

[...]

If the tests are writing data to /var/lib/ceph/osd/$id and there's no volume mounted there, of course /var/lib/ceph is going to fill up and quickly.

I was wondering if maybe https://github.com/ceph/teuthology/pull/1332 is related?

Why do you think it is related, the PR1332 is not merged. If you thinks that 41a13ec is related to this issue, then it is not true either, because the log in description says that this patch has not been merged yet at the time this run been executed.

Actions #8

Updated by David Galloway over 4 years ago

Kyrylo Shatskyy wrote:

Why do you think it is related, the PR1332 is not merged. If you thinks that 41a13ec is related to this issue, then it is not true either, because the log in description says that this patch has not been merged yet at the time this run been executed.

I'm wondering if PR 1322 will fix the issue described in this bug. Not the other way around.

Actions #9

Updated by David Galloway over 4 years ago

I just checked a log from last month and see:

2019-09-11T10:08:58.047 INFO:tasks.ceph:fs option selected, checking for scratch devs
2019-09-11T10:08:58.047 INFO:tasks.ceph:found devs: ['/dev/vg_nvme/lv_4', '/dev/vg_nvme/lv_3', '/dev/vg_nvme/lv_2', '/dev/vg_nvme/lv_1']
2019-09-11T10:08:58.047 INFO:teuthology.orchestra.run.smithi116:Running:
2019-09-11T10:08:58.047 INFO:teuthology.orchestra.run.smithi116:> ls -l '/dev/disk/by-id/wwn-*'
2019-09-11T10:08:58.119 INFO:teuthology.orchestra.run.smithi116.stderr:ls: cannot access '/dev/disk/by-id/wwn-*': No such file or directory
2019-09-11T10:08:58.120 DEBUG:teuthology.orchestra.run:got remote process result: 2
2019-09-11T10:08:58.120 INFO:teuthology.misc:Failed to get wwn devices! Using /dev/sd* devices...
2019-09-11T10:08:58.120 INFO:tasks.ceph:dev map: {'osd.1': '/dev/vg_nvme/lv_4', 'osd.3': '/dev/vg_nvme/lv_2', 'osd.2': '/dev/vg_nvme/lv_1'}

So the LVs we use on the smithi were being used correctly then. http://qa-proxy.ceph.com/teuthology/prsrivas-2019-09-11_08:36:08-rgw-wip-rgw-omap-offload-distro-basic-smithi/4298148/teuthology.log

I suspect "No space left on device" is a side effect of the problem this PR fixes: https://github.com/ceph/teuthology/pull/1332

I'm not comfortable reviewing and merging a PR I don't understand though.

EDIT: PR 1332 DOES NOT fix "no space left on device"

http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-16_15:24:52-rbd-wip-yuri6-testing-2019-10-10-2057-mimic-distro-basic-smithi/4416722/teuthology.log

The LVs are recognized and added to the dev map but they don't get mounted or anything.

2019-10-16T15:42:06.527 INFO:tasks.ceph:found devs: ['/dev/vg_nvme/lv_4', '/dev/vg_nvme/lv_3', '/dev/vg_nvme/lv_2', '/dev/vg_nvme/lv_1']
2019-10-16T15:42:06.527 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:06.527 INFO:teuthology.orchestra.run.smithi044:> ls -l /dev/disk/by-id/dm-name-*
2019-10-16T15:42:06.606 INFO:teuthology.orchestra.run.smithi044.stdout:lrwxrwxrwx 1 root root 10 Oct 16 15:39 /dev/disk/by-id/dm-name-vg_nvme-lv_1 -> ../../dm-4
2019-10-16T15:42:06.607 INFO:teuthology.orchestra.run.smithi044.stdout:lrwxrwxrwx 1 root root 10 Oct 16 15:39 /dev/disk/by-id/dm-name-vg_nvme-lv_2 -> ../../dm-3
2019-10-16T15:42:06.607 INFO:teuthology.orchestra.run.smithi044.stdout:lrwxrwxrwx 1 root root 10 Oct 16 15:39 /dev/disk/by-id/dm-name-vg_nvme-lv_3 -> ../../dm-2
2019-10-16T15:42:06.607 INFO:teuthology.orchestra.run.smithi044.stdout:lrwxrwxrwx 1 root root 10 Oct 16 15:39 /dev/disk/by-id/dm-name-vg_nvme-lv_4 -> ../../dm-1
2019-10-16T15:42:06.607 INFO:teuthology.orchestra.run.smithi044.stdout:lrwxrwxrwx 1 root root 10 Oct 16 15:39 /dev/disk/by-id/dm-name-vg_nvme-lv_5 -> ../../dm-0
2019-10-16T15:42:06.607 INFO:tasks.ceph:dev map: {}
2019-10-16T15:42:06.608 INFO:tasks.ceph:Generating config...
2019-10-16T15:42:06.614 INFO:tasks.ceph:[global] ms inject socket failures = 5000
2019-10-16T15:42:06.614 INFO:tasks.ceph:[client] rbd default features = 125
2019-10-16T15:42:06.615 INFO:tasks.ceph:[client] rbd cache = True
2019-10-16T15:42:06.615 INFO:tasks.ceph:[osd] debug ms = 1
2019-10-16T15:42:06.615 INFO:tasks.ceph:[osd] debug journal = 20
2019-10-16T15:42:06.615 INFO:tasks.ceph:[osd] osd shutdown pgref assert = True
2019-10-16T15:42:06.615 INFO:tasks.ceph:[osd] debug osd = 25
2019-10-16T15:42:06.615 INFO:tasks.ceph:[osd] debug filestore = 20
2019-10-16T15:42:06.615 INFO:tasks.ceph:[osd] osd objectstore = filestore
2019-10-16T15:42:06.615 INFO:tasks.ceph:[osd] osd sloppy crc = True
2019-10-16T15:42:06.616 INFO:tasks.ceph:[mon] debug mon = 20
2019-10-16T15:42:06.616 INFO:tasks.ceph:[mon] debug paxos = 20
2019-10-16T15:42:06.616 INFO:tasks.ceph:[mon] debug ms = 1
2019-10-16T15:42:06.616 INFO:tasks.ceph:Setting up mon.a...
2019-10-16T15:42:06.616 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:06.616 INFO:teuthology.orchestra.run.smithi044:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-authtool --create-keyring /etc/ceph/ceph.keyring
2019-10-16T15:42:06.760 INFO:teuthology.orchestra.run.smithi044.stdout:creating /etc/ceph/ceph.keyring
2019-10-16T15:42:06.763 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:06.763 INFO:teuthology.orchestra.run.smithi044:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-authtool --gen-key --name=mon. /etc/ceph/ceph.keyring
2019-10-16T15:42:06.808 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:06.808 INFO:teuthology.orchestra.run.smithi044:> sudo chmod 0644 /etc/ceph/ceph.keyring
2019-10-16T15:42:06.893 DEBUG:teuthology.misc:Ceph mon addresses: [('a', '172.21.15.44:6789'), ('c', '172.21.15.44:6790'), ('b', '172.21.15.39:6789')]
2019-10-16T15:42:06.893 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:06.893 INFO:teuthology.orchestra.run.smithi044:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage monmaptool --create --clobber --add a 172.21.15.44:6789 --add c 172.21.15.44:6790 --add b 172.21.15.39:6789 --print /home/ubuntu/cephtest/ceph.monmap
2019-10-16T15:42:06.991 INFO:teuthology.orchestra.run.smithi044.stdout:monmaptool: monmap file /home/ubuntu/cephtest/ceph.monmap
2019-10-16T15:42:06.992 INFO:teuthology.orchestra.run.smithi044.stdout:monmaptool: generated fsid 9c51fdd3-14bb-4b92-96b9-ffaede16be28
2019-10-16T15:42:06.992 INFO:teuthology.orchestra.run.smithi044.stdout:epoch 0
2019-10-16T15:42:06.992 INFO:teuthology.orchestra.run.smithi044.stdout:fsid 9c51fdd3-14bb-4b92-96b9-ffaede16be28
2019-10-16T15:42:06.992 INFO:teuthology.orchestra.run.smithi044.stdout:last_changed 2019-10-16 15:42:06.993945
2019-10-16T15:42:06.992 INFO:teuthology.orchestra.run.smithi044.stdout:created 2019-10-16 15:42:06.993945
2019-10-16T15:42:06.993 INFO:teuthology.orchestra.run.smithi044.stdout:0: 172.21.15.39:6789/0 mon.b
2019-10-16T15:42:06.993 INFO:teuthology.orchestra.run.smithi044.stdout:1: 172.21.15.44:6789/0 mon.a
2019-10-16T15:42:06.993 INFO:teuthology.orchestra.run.smithi044.stdout:2: 172.21.15.44:6790/0 mon.c
2019-10-16T15:42:06.993 INFO:teuthology.orchestra.run.smithi044.stdout:monmaptool: writing epoch 0 to /home/ubuntu/cephtest/ceph.monmap (3 monitors)
2019-10-16T15:42:06.994 INFO:tasks.ceph:Writing /etc/ceph/ceph.conf for FSID 9c51fdd3-14bb-4b92-96b9-ffaede16be28...
2019-10-16T15:42:06.996 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:06.996 INFO:teuthology.orchestra.run.smithi039:> sudo mkdir -p /etc/ceph && sudo chmod 0755 /etc/ceph && sudo python -c 'import shutil, sys; shutil.copyfileobj(sys.stdin, file(sys.argv[1], "wb"))' /etc/ceph/ceph.conf && sudo chmod 0644 /etc/ceph/ceph.conf
2019-10-16T15:42:07.001 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:07.001 INFO:teuthology.orchestra.run.smithi044:> sudo mkdir -p /etc/ceph && sudo chmod 0755 /etc/ceph && sudo python -c 'import shutil, sys; shutil.copyfileobj(sys.stdin, file(sys.argv[1], "wb"))' /etc/ceph/ceph.conf && sudo chmod 0644 /etc/ceph/ceph.conf
2019-10-16T15:42:07.070 INFO:teuthology.orchestra.run.smithi168:Running:
2019-10-16T15:42:07.071 INFO:teuthology.orchestra.run.smithi168:> sudo mkdir -p /etc/ceph && sudo chmod 0755 /etc/ceph && sudo python -c 'import shutil, sys; shutil.copyfileobj(sys.stdin, file(sys.argv[1], "wb"))' /etc/ceph/ceph.conf && sudo chmod 0644 /etc/ceph/ceph.conf
2019-10-16T15:42:07.124 INFO:tasks.ceph:Creating admin key on mon.a...
2019-10-16T15:42:07.124 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:07.124 INFO:teuthology.orchestra.run.smithi044:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-authtool --gen-key --name=client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *' /etc/ceph/ceph.keyring
2019-10-16T15:42:07.237 INFO:tasks.ceph:Copying monmap to all nodes...
2019-10-16T15:42:07.310 DEBUG:teuthology.orchestra.remote:smithi044:/etc/ceph/ceph.keyring is 216B
2019-10-16T15:42:07.331 DEBUG:teuthology.orchestra.remote:smithi044:/home/ubuntu/cephtest/ceph.monmap is 357B
2019-10-16T15:42:07.342 INFO:tasks.ceph:Sending monmap to node ubuntu@smithi039.front.sepia.ceph.com
2019-10-16T15:42:07.342 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:07.342 INFO:teuthology.orchestra.run.smithi039:> sudo sh -c 'cat > /etc/ceph/ceph.keyring' && sudo chmod 0644 /etc/ceph/ceph.keyring
2019-10-16T15:42:07.398 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:07.399 INFO:teuthology.orchestra.run.smithi039:> cat > /home/ubuntu/cephtest/ceph.monmap
2019-10-16T15:42:07.512 INFO:tasks.ceph:Sending monmap to node ubuntu@smithi044.front.sepia.ceph.com
2019-10-16T15:42:07.512 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:07.512 INFO:teuthology.orchestra.run.smithi044:> sudo sh -c 'cat > /etc/ceph/ceph.keyring' && sudo chmod 0644 /etc/ceph/ceph.keyring
2019-10-16T15:42:07.571 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:07.571 INFO:teuthology.orchestra.run.smithi044:> cat > /home/ubuntu/cephtest/ceph.monmap
2019-10-16T15:42:07.684 INFO:tasks.ceph:Sending monmap to node ubuntu@smithi168.front.sepia.ceph.com
2019-10-16T15:42:07.684 INFO:teuthology.orchestra.run.smithi168:Running:
2019-10-16T15:42:07.684 INFO:teuthology.orchestra.run.smithi168:> sudo sh -c 'cat > /etc/ceph/ceph.keyring' && sudo chmod 0644 /etc/ceph/ceph.keyring
2019-10-16T15:42:07.742 INFO:teuthology.orchestra.run.smithi168:Running:
2019-10-16T15:42:07.742 INFO:teuthology.orchestra.run.smithi168:> cat > /home/ubuntu/cephtest/ceph.monmap
2019-10-16T15:42:07.857 INFO:tasks.ceph:Setting up mon nodes...
2019-10-16T15:42:07.858 INFO:tasks.ceph:Setting up mgr nodes...
2019-10-16T15:42:07.858 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:07.858 INFO:teuthology.orchestra.run.smithi039:> sudo mkdir -p /var/lib/ceph/mgr/ceph-y && sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-authtool --create-keyring --gen-key --name=mgr.y /var/lib/ceph/mgr/ceph-y/keyring
2019-10-16T15:42:07.940 INFO:teuthology.orchestra.run.smithi039.stdout:creating /var/lib/ceph/mgr/ceph-y/keyring
2019-10-16T15:42:07.943 INFO:teuthology.orchestra.run.smithi044:Running:
2019-10-16T15:42:07.943 INFO:teuthology.orchestra.run.smithi044:> sudo mkdir -p /var/lib/ceph/mgr/ceph-x && sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-authtool --create-keyring --gen-key --name=mgr.x /var/lib/ceph/mgr/ceph-x/keyring
2019-10-16T15:42:07.986 INFO:teuthology.orchestra.run.smithi044.stdout:creating /var/lib/ceph/mgr/ceph-x/keyring
2019-10-16T15:42:07.989 INFO:tasks.ceph:Setting up mds nodes...
2019-10-16T15:42:07.989 INFO:tasks.ceph_client:Setting up client nodes...
2019-10-16T15:42:07.989 INFO:teuthology.orchestra.run.smithi168:Running:
2019-10-16T15:42:07.990 INFO:teuthology.orchestra.run.smithi168:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-authtool --create-keyring --gen-key --name=client.0 /etc/ceph/ceph.client.0.keyring && sudo chmod 0644 /etc/ceph/ceph.client.0.keyring
2019-10-16T15:42:08.066 INFO:teuthology.orchestra.run.smithi168.stdout:creating /etc/ceph/ceph.client.0.keyring
2019-10-16T15:42:08.079 INFO:tasks.ceph:Running mkfs on osd nodes...
2019-10-16T15:42:08.079 INFO:tasks.ceph:ctx.disk_config.remote_to_roles_to_dev: {Remote(name='ubuntu@smithi039.front.sepia.ceph.com'): {}, Remote(name='ubuntu@smithi044.front.sepia.ceph.com'): {}}
2019-10-16T15:42:08.079 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:08.079 INFO:teuthology.orchestra.run.smithi039:> sudo mkdir -p /var/lib/ceph/osd/ceph-4
2019-10-16T15:42:08.096 INFO:tasks.ceph:{}
2019-10-16T15:42:08.097 INFO:tasks.ceph:{}
2019-10-16T15:42:08.097 INFO:tasks.ceph:osd.4
2019-10-16T15:42:08.097 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:08.097 INFO:teuthology.orchestra.run.smithi039:> sudo mkdir -p /var/lib/ceph/osd/ceph-5
2019-10-16T15:42:08.184 INFO:tasks.ceph:{}
2019-10-16T15:42:08.184 INFO:tasks.ceph:{}
2019-10-16T15:42:08.184 INFO:tasks.ceph:osd.5
2019-10-16T15:42:08.185 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:08.185 INFO:teuthology.orchestra.run.smithi039:> sudo mkdir -p /var/lib/ceph/osd/ceph-6
2019-10-16T15:42:08.267 INFO:tasks.ceph:{}
2019-10-16T15:42:08.268 INFO:tasks.ceph:{}
2019-10-16T15:42:08.268 INFO:tasks.ceph:osd.6
2019-10-16T15:42:08.268 INFO:teuthology.orchestra.run.smithi039:Running:
2019-10-16T15:42:08.268 INFO:teuthology.orchestra.run.smithi039:> sudo mkdir -p /var/lib/ceph/osd/ceph-7
2019-10-16T15:42:08.355 INFO:tasks.ceph:{}
2019-10-16T15:42:08.356 INFO:tasks.ceph:{}
2019-10-16T15:42:08.356 INFO:tasks.ceph:osd.7
Actions #10

Updated by David Galloway over 4 years ago

This is what it should look like:

2019-09-11T10:08:59.262 INFO:tasks.ceph:ctx.disk_config.remote_to_roles_to_dev: {Remote(name='ubuntu@smithi103.front.sepia.ceph.com'): {'osd.0': '/dev/vg_nvme/lv_4'}, Remote(name='ubuntu@smithi116.front.sepia.ceph.com'): {'osd.1': '/dev/vg_nvme/lv_4', 'osd.3': '/dev/vg_nvme/lv_2', 'osd.2': '/dev/vg_nvme/lv_1'}}
2019-09-11T10:08:59.262 INFO:teuthology.orchestra.run.smithi103:Running:
2019-09-11T10:08:59.262 INFO:teuthology.orchestra.run.smithi103:> sudo mkdir -p /var/lib/ceph/osd/ceph-0
2019-09-11T10:08:59.279 INFO:tasks.ceph:{'osd.0': '/dev/vg_nvme/lv_4'}
2019-09-11T10:08:59.279 INFO:tasks.ceph:{}
2019-09-11T10:08:59.279 INFO:tasks.ceph:osd.0
2019-09-11T10:08:59.279 INFO:tasks.ceph:['mkfs.xfs', '-f', '-i', 'size=2048'] on /dev/vg_nvme/lv_4 on ubuntu@smithi103.front.sepia.ceph.com
2019-09-11T10:08:59.279 INFO:teuthology.orchestra.run.smithi103:Running:
2019-09-11T10:08:59.280 INFO:teuthology.orchestra.run.smithi103:> yes | sudo mkfs.xfs -f -i size=2048 /dev/vg_nvme/lv_4
2019-09-11T10:08:59.828 INFO:teuthology.orchestra.run.smithi103.stdout:meta-data=/dev/vg_nvme/lv_4      isize=2048   agcount=4, agsize=5859072 blks
2019-09-11T10:08:59.828 INFO:teuthology.orchestra.run.smithi103.stdout:         =                       sectsz=512   attr=2, projid32bit=1
2019-09-11T10:08:59.829 INFO:teuthology.orchestra.run.smithi103.stdout:         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
2019-09-11T10:08:59.829 INFO:teuthology.orchestra.run.smithi103.stdout:data     =                       bsize=4096   blocks=23436288, imaxpct=25
2019-09-11T10:08:59.829 INFO:teuthology.orchestra.run.smithi103.stdout:         =                       sunit=0      swidth=0 blks
2019-09-11T10:08:59.829 INFO:teuthology.orchestra.run.smithi103.stdout:naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
2019-09-11T10:08:59.829 INFO:teuthology.orchestra.run.smithi103.stdout:log      =internal log           bsize=4096   blocks=11443, version=2
2019-09-11T10:08:59.829 INFO:teuthology.orchestra.run.smithi103.stdout:         =                       sectsz=512   sunit=0 blks, lazy-count=1
2019-09-11T10:08:59.830 INFO:teuthology.orchestra.run.smithi103.stdout:realtime =none                   extsz=4096   blocks=0, rtextents=0
2019-09-11T10:08:59.830 INFO:tasks.ceph:mount /dev/vg_nvme/lv_4 on ubuntu@smithi103.front.sepia.ceph.com -o noatime
2019-09-11T10:08:59.831 INFO:teuthology.orchestra.run.smithi103:Running:
2019-09-11T10:08:59.831 INFO:teuthology.orchestra.run.smithi103:> sudo mount -t xfs -o noatime /dev/vg_nvme/lv_4 /var/lib/ceph/osd/ceph-0
Actions #11

Updated by David Galloway over 4 years ago

So things start to go wrong with the remote_to_roles_to_dev function.

Good:

2019-09-11T10:08:59.262 INFO:tasks.ceph:ctx.disk_config.remote_to_roles_to_dev: {Remote(name='ubuntu@smithi103.front.sepia.ceph.com'): {'osd.0': '/dev/vg_nvme/lv_4'}, Remote(name='ubuntu@smithi116.front.sepia.ceph.com'): {'osd.1': '/dev/vg_nvme/lv_4', 'osd.3': '/dev/vg_nvme/lv_2', 'osd.2': '/dev/vg_nvme/lv_1'}}

Bad:

2019-10-16T15:42:08.079 INFO:tasks.ceph:ctx.disk_config.remote_to_roles_to_dev: {Remote(name='ubuntu@smithi039.front.sepia.ceph.com'): {}, Remote(name='ubuntu@smithi044.front.sepia.ceph.com'): {}}

I suspect this broke things: https://github.com/ceph/ceph/pull/30792

Actions #12

Updated by Nathan Cutler over 4 years ago

Here's an emergency fix we could try: https://github.com/ceph/teuthology/pull/1334

Actions #13

Updated by Kyrylo Shatskyy over 4 years ago

Nathan Cutler wrote:

Here's an emergency fix we could try: https://github.com/ceph/teuthology/pull/1334

I can confirm that this temporary workaround is most probable working solution for now from my point of view.

Actions #14

Updated by Kyrylo Shatskyy over 4 years ago

Kyrylo Shatskyy wrote:

Nathan Cutler wrote:

Here's an emergency fix we could try: https://github.com/ceph/teuthology/pull/1334

I can confirm that this temporary workaround is most probable working solution for now from my point of view.

However, I still don't understand how the log in the bug description related to the issue, because the fix in #1334 is not related to it, I can suggest that it is coincidence.

Actions #15

Updated by Jason Dillaman over 4 years ago

  • Description updated (diff)
Actions #17

Updated by Jason Dillaman over 4 years ago

  • Project changed from sepia to teuthology
  • Status changed from New to Fix Under Review
Actions #18

Updated by Nathan Cutler over 4 years ago

  • Subject changed from "No space left on device" errors to "No space left on device" errors following 41a13eca480e38cfeeba7a180b4516b90598c39b
  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF