Bug #62501: pacific(?): qa: mgr-osd-full causes OSD aborts due to ENOSPC (fs/full/subvolume_snapshot_rm.sh) - CephFS - Ceph

Actions

Copy link

Bug #62501

closed

pacific(?): qa: mgr-osd-full causes OSD aborts due to ENOSPC (fs/full/subvolume_snapshot_rm.sh)

Added by Patrick Donnelly 8 months ago. Updated 3 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Dhairya Parmar

Category:

Testing

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

qa, qa-failure

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Probably misconfiguration allowing OSDs to actually run out of space during test instead of the OSD refusing further writes due to pool full flag.

2023-08-17T15:30:28.064 INFO:tasks.ceph.osd.0.smithi175.stderr:2023-08-17T15:30:28.054+0000 7f5d0503d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_alloc_write failed to allocate 0x400000 allocated 0x 2a000 min_alloc_size 0x1000 available 0x 0
2023-08-17T15:30:28.065 INFO:tasks.ceph.osd.0.smithi175.stderr:2023-08-17T15:30:28.054+0000 7f5d0503d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _do_write _do_alloc_write failed with (28) No space left on device
2023-08-17T15:30:28.065 INFO:tasks.ceph.osd.0.smithi175.stderr:2023-08-17T15:30:28.054+0000 7f5d0503d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _txc_add_transaction error (28) No space left on device not handled on operation 10 (op 2, counting from 0)
2023-08-17T15:30:28.065 INFO:tasks.ceph.osd.0.smithi175.stderr:2023-08-17T15:30:28.054+0000 7f5d0503d700 -1 bluestore(/var/lib/ceph/osd/ceph-0) ENOSPC from bluestore, misconfigured cluster
2023-08-17T15:30:28.076 INFO:tasks.ceph.osd.0.smithi175.stderr:/build/ceph-16.2.13-531-ge78f2dc9/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f5d0503d700 time 2023-08-17T15:30:28.061403+0000
2023-08-17T15:30:28.077 INFO:tasks.ceph.osd.0.smithi175.stderr:/build/ceph-16.2.13-531-ge78f2dc9/src/os/bluestore/BlueStore.cc: 13506: ceph_abort_msg("unexpected error")
2023-08-17T15:30:28.077 INFO:tasks.ceph.osd.0.smithi175.stderr: ceph version 16.2.13-531-ge78f2dc9 (e78f2dc97637da188e6292122efedae3d18948ca) pacific (stable)
2023-08-17T15:30:28.078 INFO:tasks.ceph.osd.0.smithi175.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x55e13f5500cb]
2023-08-17T15:30:28.078 INFO:tasks.ceph.osd.0.smithi175.stderr: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x95a) [0x55e13fc270aa]
2023-08-17T15:30:28.078 INFO:tasks.ceph.osd.0.smithi175.stderr: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2fd) [0x55e13fc297dd]
2023-08-17T15:30:28.078 INFO:tasks.ceph.osd.0.smithi175.stderr: 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x55e13f805ec8]
2023-08-17T15:30:28.079 INFO:tasks.ceph.osd.0.smithi175.stderr: 5: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7e9) [0x55e13fa28f39]
2023-08-17T15:30:28.079 INFO:tasks.ceph.osd.0.smithi175.stderr: 6: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xecf) [0x55e13f75e2bf]
2023-08-17T15:30:28.079 INFO:tasks.ceph.osd.0.smithi175.stderr: 7: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf95) [0x55e13f7c58b5]
2023-08-17T15:30:28.079 INFO:tasks.ceph.osd.0.smithi175.stderr: 8: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x36b9) [0x55e13f7c9d39]
2023-08-17T15:30:28.080 INFO:tasks.ceph.osd.0.smithi175.stderr: 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe61) [0x55e13f7d4d81]
2023-08-17T15:30:28.080 INFO:tasks.ceph.osd.0.smithi175.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55e13f624b50]
2023-08-17T15:30:28.080 INFO:tasks.ceph.osd.0.smithi175.stderr: 11: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x66) [0x55e13f8dca06]
2023-08-17T15:30:28.080 INFO:tasks.ceph.osd.0.smithi175.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8aa) [0x55e13f651e7a]
2023-08-17T15:30:28.081 INFO:tasks.ceph.osd.0.smithi175.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x55e13fd8c5d3]
2023-08-17T15:30:28.081 INFO:tasks.ceph.osd.0.smithi175.stderr: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55e13fd8f614]
2023-08-17T15:30:28.081 INFO:tasks.ceph.osd.0.smithi175.stderr: 15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5d26a07609]
2023-08-17T15:30:28.081 INFO:tasks.ceph.osd.0.smithi175.stderr: 16: clone()
...
2023-08-17T17:15:36.189 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-08-17T17:15:36.190 INFO:tasks.workunit:Stopping ['fs/full/subvolume_snapshot_rm.sh'] on client.0...
2023-08-17T17:15:36.190 DEBUG:teuthology.orchestra.run.smithi175:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2023-08-17T17:15:36.428 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_e78f2dc97637da188e6292122efedae3d18948ca/qa/tasks/workunit.py", line 129, in task
    p.spawn(_run_tests, ctx, refspec, role, tests,
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_e78f2dc97637da188e6292122efedae3d18948ca/qa/tasks/workunit.py", line 423, in _run_tests
    remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/orchestra/remote.py", line 522, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed (workunit test fs/full/subvolume_snapshot_rm.sh) on smithi175 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e78f2dc97637da188e6292122efedae3d18948ca TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/full/subvolume_snapshot_rm.sh'

/teuthology/yuriw-2023-08-17_14:36:27-fs-wip-yuri7-testing-2023-08-16-1309-pacific-distro-default-smithi/7371810/teuthology.log

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly 8 months ago

Description updated (diff)

Actions

Copy link

Updated by Patrick Donnelly 8 months ago

Subject changed from pacific(?): qa: mgr-osd-full causes OSD aborts due to ENOSPC to pacific(?): qa: mgr-osd-full causes OSD aborts due to ENOSPC (fs/full/subvolume_snapshot_rm.sh)

Actions

Copy link

Updated by Venky Shankar 8 months ago

Category set to Testing
Assignee set to Dhairya Parmar
Target version set to v19.0.0
Backport set to reef,quincy,pacific
Labels (FS) qa, qa-failure added

Dhairya, please take this one.

Actions

Copy link

Updated by Venky Shankar 7 months ago

Related to Bug #63132: qa: subvolume_snapshot_rm.sh stalls when waiting for OSD_FULL warning added

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 6 months ago

Dhairya,

I will take this one as I am debugging the same issue on quincy branch.

https://tracker.ceph.com/issues/63132?issue_count=46&issue_position=13&next_issue_id=58376&prev_issue_id=63259#note-5

Actions

Copy link

Updated by Rishabh Dave 5 months ago

Related to Bug #63539: fs/full/subvolume_clone.sh: Health check failed: 1 full osd(s) (OSD_FULL) added

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 5 months ago

https://pulpito.ceph.com/yuriw-2023-11-30_15:46:13-fs-wip-yuri8-testing-2023-11-29-0706-pacific-distro-default-smithi/7472132

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 4 months ago

https://pulpito.ceph.com/yuriw-2023-12-28_15:45:00-fs-wip-yuri7-testing-2023-12-27-1008-pacific-distro-default-smithi/7502919

Actions

Copy link

Updated by Milind Changire 3 months ago

pacific:
https://pulpito.ceph.com/yuriw-2024-01-21_16:37:18-fs-wip-yuri4-testing-2024-01-18-1257-pacific-distro-default-smithi/7525072

Actions

Copy link

#10

Updated by Adam Kupczyk 3 months ago

The bug is caused by difference is behaviour of BlueFS.
This PR is the final trigger: https://github.com/ceph/ceph/pull/51774 (quincy variant)

ORIGINAL:
Before that BlueFS DID NOT preallocate size for RocksDB Write-Ahead-Log (WAL).
Settings that we had in configuration:
bluestore_rocksdb_options="max_write_buffer_number=4...write_buffer_size=268435456..."
did nothing to preallocate (ceph-bluestore-tool dump-bluefs-log):
0x1000d7: op_file_update file(ino 36 size 0x4b38e1 ... allocated 500000 ... (note size almost as big as allocated)
0x100311: op_dir_link db.wal/000038.log to 36

WITH PR 51774:
Now BlueFS preallocates log as per RocksDB request:
0x1000d7: op_file_update file(ino 36 size 0x325bd9 ... allocated 11a00000 ... (allocated is ~290MB)
0x100312: op_dir_link db.wal/000038.log to 36

REFERENCE IN ORIGINAL/MAIN
We have different settings that we created in process of optimizing RocksDB performance, starting with PR #47221.
bluestore_rocksdb_options="max_write_buffer_number=64...write_buffer_size=16777216..."
With that, we preallocate just a little:
0x10017d: op_file_update file(ino 31 size 0x31ae8c ... allocated 1200000 ... allocated (~18MB)
0x100354: op_dir_link db.wal/000038.log to 31

So the change is between allocating max 4x 256MB WALs
vs allocating max 64x 16MB WALs

==================================================================================
Taking in all into account I recommend to just expand device size from 1GB to 2GB.

Actions

Copy link

#11

Updated by Kotresh Hiremath Ravishankar 3 months ago

Status changed from New to Duplicate

Duplicate of https://tracker.ceph.com/issues/63132
Ideally tracker #63132 should have been closed as duplicate of this tracker as this is older.
But the PR is sent referring the tracker #63132 so marking this as duplicate.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #62501

pacific(?): qa: mgr-osd-full causes OSD aborts due to ENOSPC (fs/full/subvolume_snapshot_rm.sh)

Updated by Patrick Donnelly 8 months ago

Updated by Patrick Donnelly 8 months ago

Updated by Venky Shankar 8 months ago

Updated by Venky Shankar 7 months ago

Updated by Kotresh Hiremath Ravishankar 6 months ago

Updated by Rishabh Dave 5 months ago

Updated by Kotresh Hiremath Ravishankar 5 months ago

Updated by Kotresh Hiremath Ravishankar 4 months ago

Updated by Milind Changire 3 months ago

Updated by Adam Kupczyk 3 months ago

Updated by Kotresh Hiremath Ravishankar 3 months ago