bluestore: full osd will not start. _do_alloc_write failed to reserve 0x10000, etc.
Excited to see how fast I could get into trouble with kraken, I created a small test cluster with 3x 32gb bluestore OSD's.
I set up the cluster by piecing bits of info from different sources, I'm not clear on whether this is was the correct way to set up bluestore, but it did work great until I filled it to the brim with a test file.
the error dump is over 5000 lines so I attached it in a text file.
an oddity of my setup is how ceph-disk prepare allocated my space. On a 32gb disk I ran the following command:
ceph-disk prepare --bluestore /dev/vdb --block.db /dev/vdb --block.wal /dev/vdb
and it produced this partition table:
@Model: Virtio Block Device (virtblk)
Disk /dev/vdb: 34.4GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 106MB 105MB xfs ceph data
3 106MB 21.6GB 21.5GB ceph block.db
4 21.6GB 22.2GB 604MB ceph block.wal
2 22.2GB 34.4GB 12.2GB ceph block@
I noticed the db is way too big, but I rolled with it. My ceph.conf does not define the sizes for any of the partitions because I wanted to see what the defaults would produce. I could not find any guidelines on what size the partitions are supposed to be.
looking at the block partition and seeing that it's 12GB, with 3x OSD's, I expected to run out of space after about 15GB worth of writes (my pool size was 2). and that is what happened.
not surprisingly, 2 of my 3 OSD's will not start. they were the first ones to fill up and at which point there were not enough OSD's up to process further requests.
if this was filestore, I assume that I could manually delete some pg's to get the OSD's to start, but I have no idea how to handle this in bluestore.
one thing that worries me the most is that ceph df reports the following, counting the data, db and wal in the total available space. I would assume that it should only count the data partition in totaling up the space:
2017-01-20 03:12:46.158064 7f89d50d2700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
SIZE AVAIL RAW USED %RAW USED
96272M 65717M 30555M 31.74
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 0 0 32603M 0
cephfs_metadata 1 1004k 0 32603M 20
cephfs_data 2 15671M 24.55 32603M 3918
I hope I've given enough info and I'm sorry to bother you if this bug is already known.
- Status changed from New to Verified
- Assignee set to Sage Weil
- Priority changed from Normal to Immediate
I think the root cause here is that the space reporting should not include the db partition, because that space cannot be used for (normal) object data. The meant the OSD failed to realize it was full. Normally teh cluster full flag or the per-osd failsafe should have kicked in well before this happend.
You can repair with ceph-objectstore-tool's --op fuse command, which will let you mount the offline osd and delete stuff. rmdir on the object directories will delete them (even though they have a bunch of pseudofiles in them). once the colleciton is empty, you can rmdir the collection too.
#3 Updated by Heath Jepson 4 months ago
I have yet to spend time to figure out how to tell ceph-disk what size to make the partitions (whether through the command itself or via ceph.conf), but is 8MB per 1GB of data a reasonable size for the db partition?
If I would have sized the db at 256MB for my 32GB OSD, I assume the full ratio should have kicked in and prevented this situation in the first place. Is this correct?
Even though I could easily just kill and re-create my OSD's, I'd prefer to go through the exercise of recovering with minimal data loss, but after deleting some collections from my OSD I encountered a new crash.
attached full log, but here are some of the important-looking lines:
-4> 2017-01-24 11:35:17.958934 7f051969a940 5 osd.1 pg_epoch: 80 pg[0.4(unlocked)] enter Initial
-3> 2017-01-24 11:35:17.959144 7f051969a940 5 osd.1 pg_epoch: 80 pg[0.4( empty local-les=34 n=0 ec=1 les/c/f 27/27/0 80/80/80)  r=0 lpr=0 pi=26-79/21 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.000209 0 0.000000
-2> 2017-01-24 11:35:17.959173 7f051969a940 5 osd.1 pg_epoch: 80 pg[0.4( empty local-les=34 n=0 ec=1 les/c/f 27/27/0 80/80/80)  r=0 lpr=0 pi=26-79/21 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
-1> 2017-01-24 11:35:17.959190 7f051969a940 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , clear_divergent_priors: 0
0> 2017-01-24 11:35:17.978212 7f051969a940 -1 /build/ceph-11.2.0/src/osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f051969a940 time 2017-01-24 11:35:17.975784
/build/ceph-11.2.0/src/osd/PG.cc: 2995: FAILED assert(values.size() == 2)
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x56477e26a152]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x41a) [0x56477dc1f19a]
3: (OSD::load_pgs()+0x7c7) [0x56477db92177]
4: (OSD::init()+0x2135) [0x56477db9e075]
5: (main()+0x2b74) [0x56477dace2c4]
6: (__libc_start_main()+0xf5) [0x7f051609db45]
7: (()+0x4059e6) [0x56477db4c9e6]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.