Feature #16562
openrados put: use the FULL_TRY flag to report errors when cluster is full
0%
Description
Hello,
I'm brand new to Ceph. I have 3 OSDs (These are slow Pine64 devices using USB thumb drives. Just some testing I'm doing to learn CEPH.) I created a single pool with a max quota of 1GB (ceph osd pool set-quota ubuntublockdev max_bytes 1073741824). I then attempted to copy a 500MB data file, 3 times, into the pool. I was curious as to the response when the quota was reached. I expected some sort of error message, similar to what mv/cp/rsync would give when a disk is out of space. What I got, instead, was a hanging process.
root@pine1:~/my-cluster# time rados put 500mbdata1 /root/500mb.data --pool=ubuntublockdev real 4m37.586s user 0m0.480s sys 0m1.740s root@pine1:~/my-cluster# time rados put 500mbdata1 /root/500mb.data --pool=ubuntublockdev real 4m34.522s user 0m0.620s sys 0m1.600s root@pine1:~/my-cluster# time rados put 500mbdata2 /root/500mb.data --pool=ubuntublockdev <has not returned in over 14 hours> root@pine2:/var/log/ceph# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.01176 root default -2 0.00269 host pine4 0 0.00269 osd.0 up 1.00000 1.00000 -3 0.00639 host pine3 1 0.00639 osd.1 up 1.00000 1.00000 -4 0.00269 host pine2 2 0.00269 osd.2 up 1.00000 1.00000 root@pine2:/var/log/ceph# ceph osd df ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 0.00269 1.00000 2814M 34696k 2513M 1.20 0.07 67 1 0.00639 1.00000 6676M 1035M 5375M 15.51 0.92 113 2 0.00269 1.00000 2814M 1003M 1545M 35.63 2.12 76 TOTAL 12306M 2072M 9434M 16.84 MIN/MAX VAR: 0.07/2.12 STDDEV: 14.13 root@pine2:/var/log/ceph# ceph -s cluster 34c9464a-c307-4fa5-bbe8-0f654813982e health HEALTH_WARN pool 'ubuntublockdev' is full monmap e4: 4 mons at {pine1=10.10.10.21:6789/0,pine2=10.10.10.14:6789/0,pine3=10.10.10.13:6789/0,pine4=10.10.10.23:6789/0} election epoch 10, quorum 0,1,2,3 pine3,pine2,pine1,pine4 osdmap e19: 3 osds: 3 up, 3 in flags sortbitwise pgmap v436: 128 pgs, 2 pools, 1032 MB data, 3 objects 2072 MB used, 9434 MB / 12306 MB avail 128 active+clean
Logs:
2016-06-30 05:23:32.599870 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v418: 128 pgs: 128 active+clean; 1000 MB data, 2108 MB used, 9398 MB / 12306 MB avail 2016-06-30 05:23:33.667407 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v419: 128 pgs: 128 active+clean; 1004 MB data, 2108 MB used, 9398 MB / 12306 MB avail; 82068 B/s wr, 0 op/s 2016-06-30 05:23:37.598989 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v420: 128 pgs: 128 active+clean; 1004 MB data, 2108 MB used, 9398 MB / 12306 MB avail; 819 kB/s wr, 0 op/s 2016-06-30 05:23:38.674872 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v421: 128 pgs: 128 active+clean; 1016 MB data, 2108 MB used, 9398 MB / 12306 MB avail; 2458 kB/s wr, 0 op/s 2016-06-30 05:23:42.616913 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v422: 128 pgs: 128 active+clean; 1016 MB data, 2116 MB used, 9390 MB / 12306 MB avail; 2457 kB/s wr, 0 op/s 2016-06-30 05:23:43.682803 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v423: 128 pgs: 128 active+clean; 1024 MB data, 2128 MB used, 9378 MB / 12306 MB avail; 1632 kB/s wr, 0 op/s 2016-06-30 05:23:45.023717 7fa84073e0 0 log_channel(cluster) log [WRN] : pool 'ubuntublockdev' is full (reached quota's max_bytes: 1024M) 2016-06-30 05:23:45.092333 7fa94073e0 1 mon.pine3@0(leader).osd e19 e19: 3 osds: 3 up, 3 in 2016-06-30 05:23:45.118943 7fa94073e0 0 log_channel(cluster) log [INF] : osdmap e19: 3 osds: 3 up, 3 in 2016-06-30 05:23:45.180069 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v424: 128 pgs: 128 active+clean; 1024 MB data, 2128 MB used, 9378 MB / 12306 MB avail; 3206 kB/s wr, 0 op/s 2016-06-30 05:23:47.593415 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v425: 128 pgs: 128 active+clean; 1024 MB data, 2140 MB used, 9366 MB / 12306 MB avail 2016-06-30 05:23:48.650748 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v426: 128 pgs: 128 active+clean; 1032 MB data, 2040 MB used, 9466 MB / 12306 MB avail; 2354 kB/s wr, 0 op/s 2016-06-30 05:23:52.545141 7fa94073e0 0 log_channel(cluster) log [INF] : pgmap v427: 128 pgs: 128 active+clean; 1032 MB data, 2040 MB used, 9466 MB / 12306 MB avail; 1638 kB/s wr, 0 op/s 2016-06-30 05:23:53.529835 7fa84073e0 0 log_channel(cluster) log [INF] : HEALTH_WARN; pool 'ubuntublockdev' is full
The line that says HEALTH_WARN repeats every hour, on the hour. Again, the original rados put command still has not returned nor errored.
root@pine2:/var/log/ceph# ceph -v ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) root@pine2:/var/log/ceph# uname -a Linux pine2 3.10.102-0-pine64-longsleep #7 SMP PREEMPT Fri Jun 17 21:30:48 CEST 2016 aarch64 aarch64 aarch64 GNU/Linux root@pine2:/var/log/ceph# cat /etc/debian_version stretch/sid
Updated by Matthew Sure almost 8 years ago
After submitting the above, I ctrl-c'd the rados
root@pine1:~/my-cluster# time rados put 500mbdata2 /root/500mb.data --pool=ubuntublockdev ^C real 784m0.548s user 0m4.110s sys 0m2.120s root@pine1:~/my-cluster# root@pine1:~/my-cluster# rados -p ubuntublockdev ls 500mbdata 500mbdata2 500mbdata1
So, additionally, ceph believes the file is present too. Even though the quota is full.
ceph pg dump --> http://pastebin.com/WddPjH11
root@pine1:~/my-cluster# ceph osd map ubuntublockdev 500mbdata osdmap e19 pool 'ubuntublockdev' (1) object '500mbdata' -> pg 1.ac3c1a46 (1.6) -> up ([2,1], p2) acting ([2,1], p2) root@pine1:~/my-cluster# ceph osd map ubuntublockdev 500mbdata1 osdmap e19 pool 'ubuntublockdev' (1) object '500mbdata1' -> pg 1.9ee55333 (1.33) -> up ([1,2], p1) acting ([1,2], p1) root@pine1:~/my-cluster# ceph osd map ubuntublockdev 500mbdata2 osdmap e19 pool 'ubuntublockdev' (1) object '500mbdata2' -> pg 1.a2ebfe26 (1.26) -> up ([0,1], p0) acting ([0,1], p0)
Updated by Matthew Sure almost 8 years ago
I'm unable to delete objects. At least this is giving me some error/warning message that things are wrong. Still hung, had to ctrl-c it.
root@pine1:~/my-cluster# time rados rm 500mbdata --pool=ubuntublockdev 2016-06-30 18:31:55.271180 7f92bcd000 0 client.4394.objecter FULL, paused modify 0x558d004650 tid 0 ^C real 8m58.841s user 0m0.100s sys 0m0.030s
Updated by Matthew Sure almost 8 years ago
Had to remove the quota in order to remove the files.
Updated by Greg Farnum almost 7 years ago
- Assignee set to David Zafman
David, can you confirm this is fixed after your work on full states?
Updated by Josh Durgin almost 7 years ago
- Status changed from New to Resolved
'rados rm' now has a --force-full option that lets deletes proceed.
Updated by Matthew Sure almost 7 years ago
But does this resolve the original issue of rados NOT giving error messages when space has been filled up? Or does rados continue to hang indefinitely when space is filled up?
Updated by Josh Durgin almost 7 years ago
- Tracker changed from Bug to Feature
- Subject changed from rados put hangs indefinitely on pool is full to rados put: use the FULL_TRY flag to report errors when cluster is full
- Category changed from OSD to ceph cli
- Status changed from Resolved to New
- Assignee deleted (
David Zafman)
The behavior of blocking when full is intentionally the default for rados operations - too many consumers of the interface can't handle an out of space error.
It is possible to make the 'rados put' command report that condition though - it just needs to use the LIBRADOS_OPERATION_FULL_TRY flag. That seems like a reasonable thing for the cli to do, so I'll re-open this for that change.
Updated by Matthew Sure almost 7 years ago
They can't handle "out of space" error message, but can handle an infinite hang situation? So you're saying that if you plugged an external drive into your computer, copied a large file to it and it ran out of space, you would rather it NOT give you an error message, but instead, simply sit there forever, waiting for the copy to finish?
Updated by Nathan Cutler almost 7 years ago
How can you be sure the Ceph client will hang forever? If more space is added to the cluster, perhaps the operation will complete?
The comparison with an external hard drive is apples-and-oranges IMHO.
Updated by Matthew Sure almost 7 years ago
If I don't add more space, because I can't, then it will hang forever, hence the original reasoning behind this bug report.
If I understand you correctly, what you are proposing is that in my data center, I have 100TB of ceph-based storage available for my hosts. I slice that up and present 100 1TB mounts to each of my app/db servers. Suddenly, my entire app has hung because ceph is NOT returning proper disk full errors because my db's are full and my apps have used up all local disk space for files.
I do not have the option to add more disks/osd's to my infrastructure without going and literally buying more which could take days/weeks. You would be OK with your apps/dbs hanging there until more space was added?
Pretend this ceph is NFS mount, or SMB mount, or iSCSI mount or Fibrechannel mount or any other kind of mount there is for any disk, once it is full, you would get a message stating that fact. Ceph is the only system that I've encountered where "disk full" actually means "lets wait forever and see if more space happens to appear."
You cannot possibly build a production, mission-critical system on "lets wait forever." ceph needs to return a proper error message when the mounted disk is full, just like every other infrastructure option currently in existence.
Updated by Josh Durgin almost 7 years ago
It depends on the interface above it. The block layer in linux, for example, has no way to indicate 'out of space' - every error turns into EIO, which results in fs corruption. Cephfs does report ENOSPCE, since filesystems have an interface through which this can be reported.
If your application is using librados, you can use the FULL_TRY flag to get an error in this case as well.