Bug #63132
openqa: subvolume_snapshot_rm.sh stalls when waiting for OSD_FULL warning
0%
Description
/a/yuriw-2023-10-06_22:26:38-fs-wip-yuri3-testing-2023-10-06-0948-quincy-distro-default-smithi/7415802
Probably not related to cephfs as such, but let's debug and hand it over to the rados folks if nothing stands out.
Updated by Kotresh Hiremath Ravishankar 7 months ago
The test sets up a 1GB, 1 OSD cluster. The test creates a cephfs subvolume and writes 800GB of file. While writing the 800G file, osd has crashed.
The df on cephfs mount before write:
2023-10-07T05:10:19.014 INFO:tasks.workunit.client.0.smithi156.stderr:+ echo 'Before write' 2023-10-07T05:10:19.015 INFO:tasks.workunit.client.0.smithi156.stderr:+ df /home/ubuntu/cephtest/mnt.0 2023-10-07T05:10:19.015 INFO:tasks.workunit.client.0.smithi156.stdout:Filesystem 1K-blocks Used Available Use% Mounted on 2023-10-07T05:10:19.016 INFO:tasks.workunit.client.0.smithi156.stdout:ceph-fuse 675840 0 675840 0% /home/ubuntu/cephtest/mnt.0
The OSD df output before write:
2023-10-07T05:10:19.416 INFO:tasks.workunit.client.0.smithi156.stdout:ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2023-10-07T05:10:19.417 INFO:tasks.workunit.client.0.smithi156.stdout: 0 ssd 0.00099 1.00000 1 GiB 310 MiB 7.3 MiB 0 B 302 MiB 714 MiB 30.23 1.00 97 up 2023-10-07T05:10:19.417 INFO:tasks.workunit.client.0.smithi156.stdout: TOTAL 1 GiB 310 MiB 7.3 MiB 0 B 302 MiB 714 MiB 30.23 2023-10-07T05:10:19.417 INFO:tasks.workunit.client.0.smithi156.stdout:MIN/MAX VAR: 1.00/1.00 STDDEV: 0
The OSD crash while writing 800MB file:
2023-10-07T05:10:19.429 INFO:tasks.workunit.client.0.smithi156.stderr:+ ignore_failure sudo dd if=/dev/urandom of=/home/ubuntu/cephtest/mnt.0/volumes/_nogroup/sub_0/6b940132-4581-44d4-97d9-dfea73d938c2/800MB_file-1 status=progress bs=1M count=800 conv=fdatasync 2023-10-07T05:10:19.429 INFO:tasks.workunit.client.0.smithi156.stderr:+ sudo dd if=/dev/urandom of=/home/ubuntu/cephtest/mnt.0/volumes/_nogroup/sub_0/6b940132-4581-44d4-97d9-dfea73d938c2/800MB_file-1 status=progress bs=1M count=800 conv=fdatasync 2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_alloc_write failed to allocate 0x400000 allocated 0x 251000 min_alloc_size 0x1000 available 0x 0 2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_write _do_alloc_write failed with (28) No space left on device 2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _txc_add_transaction error (28) No space left on device not handled on operation 10 (op 2, counting from 0) 2023-10-07T05:10:23.407 INFO:tasks.ceph.osd.0.smithi156.stderr:2023-10-07T05:10:23.406+0000 7f689489d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) ENOSPC from bluestore, misconfigured cluster 2023-10-07T05:10:23.411 INFO:tasks.ceph.osd.0.smithi156.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1298-gc656b96b/rpm/el8/BUILD/ceph-17.2.6-1298-gc656b96b/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f689489d700 time 2023-10-07T05:10:23.407401+0000 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1298-gc656b96b/rpm/el8/BUILD/ceph-17.2.6-1298-gc656b96b/src/os/bluestore/BlueStore.cc: 14614: ceph_abort_msg("unexpected error") 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: ceph version 17.2.6-1298-gc656b96b (c656b96b5bee9f08d23a3059b37bb78b669c2895) quincy (stable) 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd7) [0x55eadad815ad] 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x9fb) [0x55eadb480f8b] 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2df) [0x55eadb49212f] 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x53) [0x55eadb04c5a3] 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 5: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7c2) [0x55eadb2a3272] 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 6: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55eadafc41bd] 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 7: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd25) [0x55eadb0213e5] 2023-10-07T05:10:23.412 INFO:tasks.ceph.osd.0.smithi156.stderr: 8: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x288d) [0x55eadb028a4d] 2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55eadae84060] 2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55eadb18c9cd] 2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x112f) [0x55eadaebed3f] 2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55eadb6022f5] 2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55eadb604a14] 2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 14: /lib64/libpthread.so.0(+0x81ca) [0x7f68bb0cb1ca] 2023-10-07T05:10:23.413 INFO:tasks.ceph.osd.0.smithi156.stderr: 15: clone()
The OSD is not expected to crash but suppose to return ENOSPC
Updated by Venky Shankar 7 months ago
ah, so its the way the test config is setup. We should fix that in fs qa suite,
Updated by Venky Shankar 7 months ago
- Related to Bug #62501: pacific(?): qa: mgr-osd-full causes OSD aborts due to ENOSPC (fs/full/subvolume_snapshot_rm.sh) added
Updated by Venky Shankar 7 months ago
- Status changed from New to Triaged
- Assignee set to Kotresh Hiremath Ravishankar
Updated by Kotresh Hiremath Ravishankar 6 months ago
It is strange that the the test are passing in main branch.
On the main branch - Available is 997 MiB
2023-11-04T07:49:50.239 INFO:tasks.workunit.client.0.smithi062.stderr:+ ceph osd df 2023-11-04T07:49:50.621 INFO:tasks.workunit.client.0.smithi062.stdout:ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2023-11-04T07:49:50.622 INFO:tasks.workunit.client.0.smithi062.stdout: 0 ssd 0.00099 1.00000 1 GiB 27 MiB 736 KiB 1 KiB 26 MiB 997 MiB 2.65 1.00 97 up 2023-11-04T07:49:50.622 INFO:tasks.workunit.client.0.smithi062.stdout: TOTAL 1 GiB 27 MiB 736 KiB 1.6 KiB 26 MiB 997 MiB 2.65
On the quincy: Available is 729 MiB
2023-10-07T05:07:29.602 INFO:tasks.workunit.client.0.smithi156.stderr:+ ceph osd df 2023-10-07T05:07:29.998 INFO:tasks.workunit.client.0.smithi156.stdout:ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2023-10-07T05:07:29.998 INFO:tasks.workunit.client.0.smithi156.stdout: 0 ssd 0.00099 1.00000 1 GiB 295 MiB 720 KiB 0 B 294 MiB 729 MiB 28.81 1.00 97 up 2023-10-07T05:07:29.999 INFO:tasks.workunit.client.0.smithi156.stdout: TOTAL 1 GiB 295 MiB 720 KiB 0 B 294 MiB 729 MiB 28.81
For the same test suite configuration of 1G OSD, How can it be different between main and release branches?
Because of this difference, the test fails in release branches.
Venky/Patrick,
Any idea ?
Updated by Venky Shankar 6 months ago
Kotresh Hiremath Ravishankar wrote:
It is strange that the the test are passing in main branch.
On the main branch - Available is 997 MiB
[...]
On the quincy: Available is 729 MiB
[...]
Do you see any config difference? main branch could be using a different allocator. I would start from that...
Updated by Rishabh Dave 6 months ago
- Related to Bug #63539: fs/full/subvolume_clone.sh: Health check failed: 1 full osd(s) (OSD_FULL) added
Updated by Kotresh Hiremath Ravishankar 6 months ago
Venky Shankar wrote:
Kotresh Hiremath Ravishankar wrote:
It is strange that the the test are passing in main branch.
On the main branch - Available is 997 MiB
[...]
On the quincy: Available is 729 MiB
[...]Do you see any config difference? main branch could be using a different allocator. I would start from that...
ok, I didn't go into osd configuration. I will check that out.
Updated by Radoslaw Zarzynski 6 months ago
A speculation: in main we compress RocksDB (bluestore_rocksdb_options
; see https://github.com/ceph/ceph/pull/53343).
The patch isn't in reef yet.
Excluding it's the cause would make further thinking easier.
Updated by Kotresh Hiremath Ravishankar 6 months ago
Copying the slack communication here
-------------------------------------
Kotresh H R
@Radoslaw Zarzynski
Thanks for the input. It could be the reason. Is the PR https://github.com/ceph/ceph/pull/53343 planned to be backported to release branches ? If not I might think towards modifying the testcases. (edited)
5:39
I also need to validate, it is the reason. I will do that.
5:41
Well I see that in commit message, it's said it would be default form Squid, so no backport to reef/quincy expected ?
rzarzynski
5:43 PM
it's going to be in squid if passes extra testing for regressions
Updated by Venky Shankar 5 months ago
I suggested that a test be run with disabling rocksdb compression to check if the space utilization shoots up and makes the test trip.
Updated by Kotresh Hiremath Ravishankar 5 months ago
I reverted the rocksdb compression PR and ran the tests.
I observe that the space utilization is unchanged in the main with compression and without compression before writing any data.
2023-12-01T07:42:49.720 INFO:tasks.workunit.client.0.smithi069.stdout:Before write 2023-12-01T07:42:49.720 INFO:tasks.workunit.client.0.smithi069.stderr:+ df -h 2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:Filesystem Size Used Avail Use% Mounted on 2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs 3.2G 1.4M 3.2G 1% /run 2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:/dev/sda1 916G 11G 860G 2% / 2023-12-01T07:42:49.740 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs 16G 32K 16G 1% /dev/shm 2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs 5.0M 0 5.0M 0% /run/lock 2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:tmpfs 3.2G 0 3.2G 0% /run/user/1000 2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:/dev/mapper/vg_nvme-lv_5 15G 72M 15G 1% /var/lib/ceph 2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stdout:ceph-fuse 944M 0 944M 0% /home/ubuntu/cephtest/mnt.0 2023-12-01T07:42:49.741 INFO:tasks.workunit.client.0.smithi069.stderr:+ ceph osd df ... 2023-12-01T07:42:50.095 INFO:tasks.workunit.client.0.smithi069.stdout:ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2023-12-01T07:42:50.095 INFO:tasks.workunit.client.0.smithi069.stdout: 0 ssd 0.00099 1.00000 1 GiB 27 MiB 736 KiB 1 KiB 26 MiB 997 MiB 2.65 1.00 97 up 2023-12-01T07:42:50.095 INFO:tasks.workunit.client.0.smithi069.stdout: TOTAL 1 GiB 27 MiB 736 KiB 1.7 KiB 26 MiB 997 MiB 2.65
Updated by Kotresh Hiremath Ravishankar 3 months ago
Copying Adam's reply into this tracker: (https://tracker.ceph.com/issues/62501#note-10)
----------------
The bug is caused by difference is behaviour of BlueFS.
This PR is the final trigger: https://github.com/ceph/ceph/pull/51774 (quincy variant)
ORIGINAL:
Before that BlueFS DID NOT preallocate size for RocksDB Write-Ahead-Log (WAL).
Settings that we had in configuration:
bluestore_rocksdb_options="max_write_buffer_number=4...write_buffer_size=268435456..."
did nothing to preallocate (ceph-bluestore-tool dump-bluefs-log):
0x1000d7: op_file_update file(ino 36 size 0x4b38e1 ... allocated 500000 ... (note size almost as big as allocated)
0x100311: op_dir_link db.wal/000038.log to 36
WITH PR 51774:
Now BlueFS preallocates log as per RocksDB request:
0x1000d7: op_file_update file(ino 36 size 0x325bd9 ... allocated 11a00000 ... (allocated is ~290MB)
0x100312: op_dir_link db.wal/000038.log to 36
REFERENCE IN ORIGINAL/MAIN
We have different settings that we created in process of optimizing RocksDB performance, starting with PR #47221.
bluestore_rocksdb_options="max_write_buffer_number=64...write_buffer_size=16777216..."
With that, we preallocate just a little:
0x10017d: op_file_update file(ino 31 size 0x31ae8c ... allocated 1200000 ... allocated (~18MB)
0x100354: op_dir_link db.wal/000038.log to 31
So the change is between allocating max 4x 256MB WALs
vs allocating max 64x 16MB WALs
==================================================================================
Taking in all into account I recommend to just expand device size from 1GB to 2GB.
-----------------
Updated by Kotresh Hiremath Ravishankar 3 months ago
- Status changed from Triaged to In Progress
- Pull request ID set to 55435
Updated by Kotresh Hiremath Ravishankar 3 months ago
- Related to Bug #64012: qa: Command failed qa/workunits/fs/full/subvolume_clone.sh added
Updated by Venky Shankar 2 months ago
- Status changed from In Progress to Pending Backport
- Backport changed from reef,quincy to reef,quincy,squid
Updated by Backport Bot 2 months ago
- Copied to Backport #64581: quincy: qa: subvolume_snapshot_rm.sh stalls when waiting for OSD_FULL warning added
Updated by Backport Bot 2 months ago
- Copied to Backport #64582: reef: qa: subvolume_snapshot_rm.sh stalls when waiting for OSD_FULL warning added
Updated by Backport Bot 2 months ago
- Copied to Backport #64583: squid: qa: subvolume_snapshot_rm.sh stalls when waiting for OSD_FULL warning added