Project

General

Profile

Actions

Feature #16562

open

rados put: use the FULL_TRY flag to report errors when cluster is full

Added by Matthew Sure almost 8 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph cli
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Hello,
I'm brand new to Ceph. I have 3 OSDs (These are slow Pine64 devices using USB thumb drives. Just some testing I'm doing to learn CEPH.) I created a single pool with a max quota of 1GB (ceph osd pool set-quota ubuntublockdev max_bytes 1073741824). I then attempted to copy a 500MB data file, 3 times, into the pool. I was curious as to the response when the quota was reached. I expected some sort of error message, similar to what mv/cp/rsync would give when a disk is out of space. What I got, instead, was a hanging process.

root@pine1:~/my-cluster# time rados put 500mbdata1 /root/500mb.data --pool=ubuntublockdev

real    4m37.586s
user    0m0.480s
sys    0m1.740s
root@pine1:~/my-cluster# time rados put 500mbdata1 /root/500mb.data --pool=ubuntublockdev

real    4m34.522s
user    0m0.620s
sys    0m1.600s

root@pine1:~/my-cluster# time rados put 500mbdata2 /root/500mb.data --pool=ubuntublockdev
<has not returned in over 14 hours>

root@pine2:/var/log/ceph# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.01176 root default
-2 0.00269     host pine4
 0 0.00269         osd.0       up  1.00000          1.00000
-3 0.00639     host pine3
 1 0.00639         osd.1       up  1.00000          1.00000
-4 0.00269     host pine2
 2 0.00269         osd.2       up  1.00000          1.00000

root@pine2:/var/log/ceph# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
 0 0.00269  1.00000  2814M 34696k 2513M  1.20 0.07  67
 1 0.00639  1.00000  6676M  1035M 5375M 15.51 0.92 113
 2 0.00269  1.00000  2814M  1003M 1545M 35.63 2.12  76
              TOTAL 12306M  2072M 9434M 16.84
MIN/MAX VAR: 0.07/2.12  STDDEV: 14.13

root@pine2:/var/log/ceph# ceph -s
    cluster 34c9464a-c307-4fa5-bbe8-0f654813982e
     health HEALTH_WARN
            pool 'ubuntublockdev' is full
     monmap e4: 4 mons at {pine1=10.10.10.21:6789/0,pine2=10.10.10.14:6789/0,pine3=10.10.10.13:6789/0,pine4=10.10.10.23:6789/0}
            election epoch 10, quorum 0,1,2,3 pine3,pine2,pine1,pine4
     osdmap e19: 3 osds: 3 up, 3 in
            flags sortbitwise
      pgmap v436: 128 pgs, 2 pools, 1032 MB data, 3 objects
            2072 MB used, 9434 MB / 12306 MB avail
                 128 active+clean

Logs:

2016-06-30 05:23:32.599870 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v418: 128 pgs: 128 active+clean; 1000 MB data, 2108 MB used, 9398 MB / 12306 MB avail
2016-06-30 05:23:33.667407 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v419: 128 pgs: 128 active+clean; 1004 MB data, 2108 MB used, 9398 MB / 12306 MB avail; 82068 B/s wr, 0 op/s
2016-06-30 05:23:37.598989 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v420: 128 pgs: 128 active+clean; 1004 MB data, 2108 MB used, 9398 MB / 12306 MB avail; 819 kB/s wr, 0 op/s
2016-06-30 05:23:38.674872 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v421: 128 pgs: 128 active+clean; 1016 MB data, 2108 MB used, 9398 MB / 12306 MB avail; 2458 kB/s wr, 0 op/s
2016-06-30 05:23:42.616913 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v422: 128 pgs: 128 active+clean; 1016 MB data, 2116 MB used, 9390 MB / 12306 MB avail; 2457 kB/s wr, 0 op/s
2016-06-30 05:23:43.682803 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v423: 128 pgs: 128 active+clean; 1024 MB data, 2128 MB used, 9378 MB / 12306 MB avail; 1632 kB/s wr, 0 op/s
2016-06-30 05:23:45.023717 7fa84073e0  0 log_channel(cluster) log [WRN] : pool 'ubuntublockdev' is full (reached quota's max_bytes: 1024M)
2016-06-30 05:23:45.092333 7fa94073e0  1 mon.pine3@0(leader).osd e19 e19: 3 osds: 3 up, 3 in
2016-06-30 05:23:45.118943 7fa94073e0  0 log_channel(cluster) log [INF] : osdmap e19: 3 osds: 3 up, 3 in
2016-06-30 05:23:45.180069 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v424: 128 pgs: 128 active+clean; 1024 MB data, 2128 MB used, 9378 MB / 12306 MB avail; 3206 kB/s wr, 0 op/s
2016-06-30 05:23:47.593415 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v425: 128 pgs: 128 active+clean; 1024 MB data, 2140 MB used, 9366 MB / 12306 MB avail
2016-06-30 05:23:48.650748 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v426: 128 pgs: 128 active+clean; 1032 MB data, 2040 MB used, 9466 MB / 12306 MB avail; 2354 kB/s wr, 0 op/s
2016-06-30 05:23:52.545141 7fa94073e0  0 log_channel(cluster) log [INF] : pgmap v427: 128 pgs: 128 active+clean; 1032 MB data, 2040 MB used, 9466 MB / 12306 MB avail; 1638 kB/s wr, 0 op/s
2016-06-30 05:23:53.529835 7fa84073e0  0 log_channel(cluster) log [INF] : HEALTH_WARN; pool 'ubuntublockdev' is full

The line that says HEALTH_WARN repeats every hour, on the hour. Again, the original rados put command still has not returned nor errored.

root@pine2:/var/log/ceph# ceph -v
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

root@pine2:/var/log/ceph# uname -a
Linux pine2 3.10.102-0-pine64-longsleep #7 SMP PREEMPT Fri Jun 17 21:30:48 CEST 2016 aarch64 aarch64 aarch64 GNU/Linux

root@pine2:/var/log/ceph# cat /etc/debian_version
stretch/sid
Actions #1

Updated by Matthew Sure almost 8 years ago

After submitting the above, I ctrl-c'd the rados

root@pine1:~/my-cluster# time rados put 500mbdata2 /root/500mb.data --pool=ubuntublockdev

^C

real    784m0.548s
user    0m4.110s
sys    0m2.120s
root@pine1:~/my-cluster#
root@pine1:~/my-cluster# rados -p ubuntublockdev ls
500mbdata
500mbdata2
500mbdata1

So, additionally, ceph believes the file is present too. Even though the quota is full.

ceph pg dump --> http://pastebin.com/WddPjH11

root@pine1:~/my-cluster# ceph osd map ubuntublockdev 500mbdata
osdmap e19 pool 'ubuntublockdev' (1) object '500mbdata' -> pg 1.ac3c1a46 (1.6) -> up ([2,1], p2) acting ([2,1], p2)
root@pine1:~/my-cluster# ceph osd map ubuntublockdev 500mbdata1
osdmap e19 pool 'ubuntublockdev' (1) object '500mbdata1' -> pg 1.9ee55333 (1.33) -> up ([1,2], p1) acting ([1,2], p1)
root@pine1:~/my-cluster# ceph osd map ubuntublockdev 500mbdata2
osdmap e19 pool 'ubuntublockdev' (1) object '500mbdata2' -> pg 1.a2ebfe26 (1.26) -> up ([0,1], p0) acting ([0,1], p0)
Actions #2

Updated by Matthew Sure almost 8 years ago

I'm unable to delete objects. At least this is giving me some error/warning message that things are wrong. Still hung, had to ctrl-c it.

root@pine1:~/my-cluster# time rados rm 500mbdata --pool=ubuntublockdev
2016-06-30 18:31:55.271180 7f92bcd000  0 client.4394.objecter  FULL, paused modify 0x558d004650 tid 0
^C

real    8m58.841s
user    0m0.100s
sys    0m0.030s
Actions #3

Updated by Matthew Sure almost 8 years ago

Had to remove the quota in order to remove the files.

Actions #4

Updated by Nathan Cutler almost 8 years ago

  • Target version deleted (519)
Actions #5

Updated by Greg Farnum almost 7 years ago

  • Assignee set to David Zafman

David, can you confirm this is fixed after your work on full states?

Actions #6

Updated by Josh Durgin almost 7 years ago

  • Status changed from New to Resolved

'rados rm' now has a --force-full option that lets deletes proceed.

Actions #7

Updated by Matthew Sure almost 7 years ago

But does this resolve the original issue of rados NOT giving error messages when space has been filled up? Or does rados continue to hang indefinitely when space is filled up?

Actions #8

Updated by Josh Durgin almost 7 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from rados put hangs indefinitely on pool is full to rados put: use the FULL_TRY flag to report errors when cluster is full
  • Category changed from OSD to ceph cli
  • Status changed from Resolved to New
  • Assignee deleted (David Zafman)

The behavior of blocking when full is intentionally the default for rados operations - too many consumers of the interface can't handle an out of space error.

It is possible to make the 'rados put' command report that condition though - it just needs to use the LIBRADOS_OPERATION_FULL_TRY flag. That seems like a reasonable thing for the cli to do, so I'll re-open this for that change.

Actions #9

Updated by Matthew Sure almost 7 years ago

They can't handle "out of space" error message, but can handle an infinite hang situation? So you're saying that if you plugged an external drive into your computer, copied a large file to it and it ran out of space, you would rather it NOT give you an error message, but instead, simply sit there forever, waiting for the copy to finish?

Actions #10

Updated by Nathan Cutler almost 7 years ago

How can you be sure the Ceph client will hang forever? If more space is added to the cluster, perhaps the operation will complete?

The comparison with an external hard drive is apples-and-oranges IMHO.

Actions #11

Updated by Matthew Sure almost 7 years ago

If I don't add more space, because I can't, then it will hang forever, hence the original reasoning behind this bug report.

If I understand you correctly, what you are proposing is that in my data center, I have 100TB of ceph-based storage available for my hosts. I slice that up and present 100 1TB mounts to each of my app/db servers. Suddenly, my entire app has hung because ceph is NOT returning proper disk full errors because my db's are full and my apps have used up all local disk space for files.

I do not have the option to add more disks/osd's to my infrastructure without going and literally buying more which could take days/weeks. You would be OK with your apps/dbs hanging there until more space was added?

Pretend this ceph is NFS mount, or SMB mount, or iSCSI mount or Fibrechannel mount or any other kind of mount there is for any disk, once it is full, you would get a message stating that fact. Ceph is the only system that I've encountered where "disk full" actually means "lets wait forever and see if more space happens to appear."

You cannot possibly build a production, mission-critical system on "lets wait forever." ceph needs to return a proper error message when the mounted disk is full, just like every other infrastructure option currently in existence.

Actions #12

Updated by Josh Durgin almost 7 years ago

It depends on the interface above it. The block layer in linux, for example, has no way to indicate 'out of space' - every error turns into EIO, which results in fs corruption. Cephfs does report ENOSPCE, since filesystems have an interface through which this can be reported.

If your application is using librados, you can use the FULL_TRY flag to get an error in this case as well.

Actions

Also available in: Atom PDF