Bug #63132: qa: subvolume_snapshot_rm.sh stalls when waiting for OSD_FULL warning - CephFS - Ceph

2023-10-07T05:10:19.014 INFO:tasks.workunit.client.0.smithi156.stderr:+ echo 'Before write'
2023-10-07T05:10:19.015 INFO:tasks.workunit.client.0.smithi156.stderr:+ df /home/ubuntu/cephtest/mnt.0
2023-10-07T05:10:19.015 INFO:tasks.workunit.client.0.smithi156.stdout:Filesystem     1K-blocks  Used Available Use% Mounted on
2023-10-07T05:10:19.016 INFO:tasks.workunit.client.0.smithi156.stdout:ceph-fuse         675840     0    675840   0% /home/ubuntu/cephtest/mnt.0

The OSD df output before write:

2023-10-07T05:10:19.416 INFO:tasks.workunit.client.0.smithi156.stdout:ID  CLASS  WEIGHT   REWEIGHT  SIZE   RAW USE  DATA     OMAP  META     AVAIL    %USE   VAR   PGS  STATUS
2023-10-07T05:10:19.417 INFO:tasks.workunit.client.0.smithi156.stdout: 0    ssd  0.00099   1.00000  1 GiB  310 MiB  7.3 MiB   0 B  302 MiB  714 MiB  30.23  1.00   97      up
2023-10-07T05:10:19.417 INFO:tasks.workunit.client.0.smithi156.stdout:                       TOTAL  1 GiB  310 MiB  7.3 MiB   0 B  302 MiB  714 MiB  30.23
2023-10-07T05:10:19.417 INFO:tasks.workunit.client.0.smithi156.stdout:MIN/MAX VAR: 1.00/1.00  STDDEV: 0

The OSD crash while writing 800MB file:


2023-10-07T05:10:19.429 INFO:tasks.workunit.client.0.smithi156.stderr:+ ignore_failure sudo dd if=/dev/urandom of=/home/ubuntu/cephtest/mnt.0/volumes/_nogroup/sub_0/6b940132-4581-44d4-97d9-dfea73d938c2/800MB_file-1 status=progress bs=1M count=800 conv=fdatasync
2023-10-07T05:10:19.429 INFO:tasks.workunit.client.0.smithi156.stderr:+ sudo dd if=/dev/urandom of=/home/ubuntu/cephtest/mnt.0/volumes/_nogroup/sub_0/6b940132-4581-44d4-97d9-dfea73d938c2/800MB_file-1 status=progress bs=1M count=800 conv=fdatasync
2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_alloc_write failed to allocate 0x400000 allocated 0x 251000 min_alloc_size 0x1000 available 0x 0
2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_write _do_alloc_write failed with (28) No space left on device
2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _txc_add_transaction error (28) No space left on device not handled on operation 10 (op 2, counting from 0)
2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) ENOSPC from bluestore, misconfigured cluster
2023-10-07T05:10:23.411 INFO:tasks.ceph.osd.0.smithi156.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1298-gc656b96b/rpm/el8/BUILD/ceph-17.2.6-1298-gc656b96b/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f689489d700 time 2023-10-07T05:10:23.407401+0000
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1298-gc656b96b/rpm/el8/BUILD/ceph-17.2.6-1298-gc656b96b/src/os/bluestore/BlueStore.cc: 14614: ceph_abort_msg("unexpected error")
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: ceph version 17.2.6-1298-gc656b96b (c656b96b5bee9f08d23a3059b37bb78b669c2895) quincy (stable)
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd7) [0x55eadad815ad]
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x9fb) [0x55eadb480f8b]
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2df) [0x55eadb49212f]
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x53) [0x55eadb04c5a3]
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 5: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7c2) [0x55eadb2a3272]
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 6: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55eadafc41bd]
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 7: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd25) [0x55eadb0213e5]
2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 8: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x288d) [0x55eadb028a4d]
2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55eadae84060]
2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55eadb18c9cd]
2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x112f) [0x55eadaebed3f]
2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55eadb6022f5]
2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55eadb604a14]
2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 14: /lib64/libpthread.so.0(+0x81ca) [0x7f68bb0cb1ca]
2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 15: clone()

The OSD is not expected to crash but suppose to return ENOSPC

Actions

Copy link

Updated by Venky Shankar 7 months ago

ah, so its the way the test config is setup. We should fix that in fs qa suite,

Actions

Copy link

Updated by Venky Shankar 7 months ago

Related to Bug #62501: pacific(?): qa: mgr-osd-full causes OSD aborts due to ENOSPC (fs/full/subvolume_snapshot_rm.sh) added

Actions

Copy link

Updated by Venky Shankar 7 months ago

Status changed from New to Triaged
Assignee set to Kotresh Hiremath Ravishankar

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 6 months ago

It is strange that the the test are passing in main branch.

On the main branch - Available is 997 MiB

2023-11-04T07:49:50.239 INFO:tasks.workunit.client.0.smithi062.stderr:+ ceph osd df
2023-11-04T07:49:50.621 INFO:tasks.workunit.client.0.smithi062.stdout:ID  CLASS  WEIGHT   REWEIGHT  SIZE   RAW USE  DATA     OMAP     META    AVAIL    %USE  VAR   PGS STATUS
2023-11-04T07:49:50.622 INFO:tasks.workunit.client.0.smithi062.stdout: 0    ssd  0.00099   1.00000  1 GiB   27 MiB  736 KiB    1 KiB  26 MiB  997 MiB  2.65  1.00   97     up
2023-11-04T07:49:50.622 INFO:tasks.workunit.client.0.smithi062.stdout:                       TOTAL  1 GiB   27 MiB  736 KiB  1.6 KiB  26 MiB  997 MiB  2.65

On the quincy: Available is 729 MiB

2023-10-07T05:07:29.602 INFO:tasks.workunit.client.0.smithi156.stderr:+ ceph osd df
2023-10-07T05:07:29.998 INFO:tasks.workunit.client.0.smithi156.stdout:ID  CLASS  WEIGHT   REWEIGHT  SIZE   RAW USE  DATA     OMAP  META     AVAIL    %USE   VAR   PGS  STATUS
2023-10-07T05:07:29.998 INFO:tasks.workunit.client.0.smithi156.stdout: 0    ssd  0.00099   1.00000  1 GiB  295 MiB  720 KiB   0 B  294 MiB  729 MiB  28.81  1.00   97      up
2023-10-07T05:07:29.999 INFO:tasks.workunit.client.0.smithi156.stdout:                       TOTAL  1 GiB  295 MiB  720 KiB   0 B  294 MiB  729 MiB  28.81

For the same test suite configuration of 1G OSD, How can it be different between main and release branches?

Because of this difference, the test fails in release branches.

Venky/Patrick,

Any idea ?

Actions

Copy link

Updated by Venky Shankar 6 months ago

Kotresh Hiremath Ravishankar wrote:

It is strange that the the test are passing in main branch.

On the main branch - Available is 997 MiB

[...]

On the quincy: Available is 729 MiB
[...]

Do you see any config difference? main branch could be using a different allocator. I would start from that...

Actions

Copy link

Updated by Rishabh Dave 6 months ago

Related to Bug #63539: fs/full/subvolume_clone.sh: Health check failed: 1 full osd(s) (OSD_FULL) added

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 6 months ago

Venky Shankar wrote:

Kotresh Hiremath Ravishankar wrote:

It is strange that the the test are passing in main branch.

On the main branch - Available is 997 MiB

[...]

On the quincy: Available is 729 MiB
[...]

Do you see any config difference? main branch could be using a different allocator. I would start from that...

ok, I didn't go into osd configuration. I will check that out.

Actions

Copy link

Updated by Radoslaw Zarzynski 6 months ago

A speculation: in main we compress RocksDB (bluestore_rocksdb_options; see https://github.com/ceph/ceph/pull/53343).
The patch isn't in reef yet.

Excluding it's the cause would make further thinking easier.

Actions

Copy link

#10

Updated by Kotresh Hiremath Ravishankar 6 months ago

Copying the slack communication here
-------------------------------------

Kotresh H R
@Radoslaw Zarzynski
Thanks for the input. It could be the reason. Is the PR https://github.com/ceph/ceph/pull/53343 planned to be backported to release branches ? If not I might think towards modifying the testcases. (edited)
5:39
I also need to validate, it is the reason. I will do that.
5:41
Well I see that in commit message, it's said it would be default form Squid, so no backport to reef/quincy expected ?

rzarzynski
5:43 PM
it's going to be in squid if passes extra testing for regressions

Actions

Copy link

#11

Updated by Venky Shankar 5 months ago

I suggested that a test be run with disabling rocksdb compression to check if the space utilization shoots up and makes the test trip.

Actions

Copy link

#12

Updated by Kotresh Hiremath Ravishankar 5 months ago

I reverted the rocksdb compression PR and ran the tests.

http://qa-proxy.ceph.com/teuthology/khiremat-2023-12-01_07:20:13-fs:full-wip-khiremat-test-fs-full-suite-no-rocksdb-compr-distro-default-smithi/7473631/teuthology.log

I observe that the space utilization is unchanged in the main with compression and without compression before writing any data.

2023-12-01T07:42:49.720 INFO:tasks.workunit.client.0.smithi069.stdout:Before write
2023-12-01T07:42:49.720 INFO:tasks.workunit.client.0.smithi069.stderr:+ df -h
2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:Filesystem                Size  Used Avail Use% Mounted on
2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs                     3.2G  1.4M  3.2G   1% /run
2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:/dev/sda1                 916G   11G  860G   2% /
2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs                      16G   32K   16G   1% /dev/shm
2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs                     5.0M     0  5.0M   0% /run/lock
2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs                     3.2G     0  3.2G   0% /run/user/1000
2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:/dev/mapper/vg_nvme-lv_5   15G   72M   15G   1% /var/lib/ceph
2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:ceph-fuse                 944M     0  944M   0% /home/ubuntu/cephtest/mnt.0
2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stderr:+ ceph osd df
...

2023-12-01T07:42:50.095 INFO:tasks.workunit.client.0.smithi069.stdout:ID  CLASS  WEIGHT   REWEIGHT  SIZE   RAW USE  DATA     OMAP     META    AVAIL    %USE  VAR   PGS  STATUS
2023-12-01T07:42:50.095 INFO:tasks.workunit.client.0.smithi069.stdout: 0    ssd  0.00099   1.00000  1 GiB   27 MiB  736 KiB    1 KiB  26 MiB  997 MiB  2.65  1.00   97      up
2023-12-01T07:42:50.095 INFO:tasks.workunit.client.0.smithi069.stdout:                       TOTAL  1 GiB   27 MiB  736 KiB  1.7 KiB  26 MiB  997 MiB  2.65

Actions

Copy link

#13

Updated by Milind Changire 5 months ago

quincy:
https://pulpito.ceph.com/vshankar-2023-12-13_09:42:45-fs-wip-vshankar-testing3-2023-12-13-1225-quincy-testing-default-smithi/7489615

Actions

Copy link

#14

Updated by Kotresh Hiremath Ravishankar 4 months ago

https://pulpito.ceph.com/yuriw-2023-12-26_19:48:51-fs-wip-yuri-testing-2023-12-26-0957-quincy-distro-default-smithi/7501707/

Actions

Copy link

#15

Updated by Kotresh Hiremath Ravishankar 4 months ago

https://pulpito.ceph.com/yuriw-2023-12-27_16:33:25-fs-wip-yuri-testing-2023-12-26-0957-quincy-distro-default-smithi/7501956

Actions

Copy link

#16

Updated by Milind Changire 4 months ago

https://pulpito.ceph.com/yuriw-2024-01-10_16:13:48-fs-wip-yuri6-testing-2024-01-05-0744-quincy-distro-default-smithi/7511953

Actions

Copy link

#17

Updated by Milind Changire 3 months ago

quincy:
https://pulpito.ceph.com/yuriw-2024-01-26_01:07:29-fs-wip-yuri4-testing-2024-01-25-1331-quincy-distro-default-smithi/7533484

Actions

Copy link

#18

Updated by Kotresh Hiremath Ravishankar 3 months ago

Copying Adam's reply into this tracker: (https://tracker.ceph.com/issues/62501#note-10)

----------------
The bug is caused by difference is behaviour of BlueFS.
This PR is the final trigger: https://github.com/ceph/ceph/pull/51774 (quincy variant)

ORIGINAL:
Before that BlueFS DID NOT preallocate size for RocksDB Write-Ahead-Log (WAL).
Settings that we had in configuration:
bluestore_rocksdb_options="max_write_buffer_number=4...write_buffer_size=268435456..."
did nothing to preallocate (ceph-bluestore-tool dump-bluefs-log):
0x1000d7: op_file_update file(ino 36 size 0x4b38e1 ... allocated 500000 ... (note size almost as big as allocated)
0x100311: op_dir_link db.wal/000038.log to 36

WITH PR 51774:
Now BlueFS preallocates log as per RocksDB request:
0x1000d7: op_file_update file(ino 36 size 0x325bd9 ... allocated 11a00000 ... (allocated is ~290MB)
0x100312: op_dir_link db.wal/000038.log to 36

REFERENCE IN ORIGINAL/MAIN
We have different settings that we created in process of optimizing RocksDB performance, starting with PR #47221.
bluestore_rocksdb_options="max_write_buffer_number=64...write_buffer_size=16777216..."
With that, we preallocate just a little:
0x10017d: op_file_update file(ino 31 size 0x31ae8c ... allocated 1200000 ... allocated (~18MB)
0x100354: op_dir_link db.wal/000038.log to 31

So the change is between allocating max 4x 256MB WALs
vs allocating max 64x 16MB WALs

==================================================================================
Taking in all into account I recommend to just expand device size from 1GB to 2GB.
-----------------

Actions

Copy link

#19