Feature #23242
closedceph-objectstore-tool command to trim the pg log
0%
Description
ceph-objectstore-tool command to trim the pg log
The motive of this bug is to have a command to trim the pg log without consuming a lot of memory and CPU which makes the OSD node unstable.
Red Hat bug - https://bugzilla.redhat.com/show_bug.cgi?id=1552094
Updated by David Zafman about 6 years ago
- Status changed from New to In Progress
- Assignee set to David Zafman
When testing the log trimming code on master the OSD crashes like this.
2018-03-06 16:19:07.413 7f0a38ee4700 20 filestore(/home/dzafman/ceph/build/dev/osd1) sync_entry(4039): woke after 5.001077 2018-03-06 16:19:07.413 7f0a38ee4700 10 journal commit_start max_applied_seq 98, open_ops 0 2018-03-06 16:19:07.413 7f0a38ee4700 10 journal commit_start blocked, all open_ops have completed 2018-03-06 16:19:07.413 7f0a38ee4700 10 journal commit_start committing 98, still blocked 2018-03-06 16:19:07.413 7f0a38ee4700 10 journal commit_start 2018-03-06 16:19:07.413 7f0a38ee4700 15 filestore(/home/dzafman/ceph/build/dev/osd1) sync_entry(4070): committing 98 2018-03-06 16:19:07.413 7f0a38ee4700 10 journal commit_started committing 98, unblocking 2018-03-06 16:19:07.413 7f0a38ee4700 20 filestore dbobjectmap: seq is 2 2018-03-06 16:19:07.413 7f0a38ee4700 15 genericfilestorebackend(/home/dzafman/ceph/build/dev/osd1) syncfs: doing a full sync (syncfs(2) if possible) 2018-03-06 16:19:07.417 7f0a31ed6700 10 monclient: tick 2018-03-06 16:19:07.417 7f0a31ed6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2018-03-06 16:18:37.420105) 2018-03-06 16:19:07.417 7f0a31ed6700 10 log_client log_queue is 1 last_log 2 sent 0 num 1 unsent 2 sending 2 2018-03-06 16:19:07.417 7f0a31ed6700 -1 /home/dzafman/ceph/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7f0a31ed6700 time 2018-03-06 16:19:07.420182 /home/dzafman/ceph/src/common/LogClient.cc: 292: FAILED assert(num_unsent <= log_queue.size()) ceph version 13.0.1-2394-g485c784 (485c784c8f3f91e64c58f52dca852e43e99cb48e) mimic (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f0a507f61d2] 2: (LogClient::_get_mon_log_message()+0xdd5) [0x7f0a507b4965] 3: (LogClient::get_mon_log_message(bool)+0x43) [0x7f0a507b4a43] 4: (MonClient::send_log(bool)+0x1c) [0x7f0a5084859c] 5: (MonClient::tick()+0x612) [0x7f0a50852882] 6: (Context::complete(int)+0x9) [0x563605744579] 7: (SafeTimer::timer_thread()+0x20f) [0x7f0a507f2a3f] 8: (SafeTimerThread::entry()+0xd) [0x7f0a507f407d] 9: (()+0x76ba) [0x7f0a4f3106ba] 10: (clone()+0x6d) [0x7f0a4e59a82d]
Updated by David Zafman about 6 years ago
From PG::log_weirdness():
2018-03-06 16:18:57.413 7f0a593b9dc0 -1 log_channel(cluster) log [ERR] : 1.0 log bound mismatch, info (tail,head] (12'22,12'41] actual [12'22,12'41]
Updated by David Zafman about 6 years ago
The assert(num_unsent <= log_queue.size()) probably doesn't relate directly with this feature. The log_weirdness() function uses clog too early in the start-up process on an OSD after the log is trimmed using ceph-objectstore-tool. We need to fix the "log bound mismatch" for this feature. Another tracker should be filed since that shouldn't crash the osd.
Updated by David Zafman about 6 years ago
- Related to Bug #23269: Early use of clog in OSD startup crashes OSD added
Updated by Vikhyat Umrao about 6 years ago
- Source set to Support
- Backport set to luminous,jewel
Updated by Vikhyat Umrao about 6 years ago
Updated by David Zafman about 6 years ago
- Copied to Backport #23275: luminous: ceph-objectstore-tool command to trim the pg log added
Updated by David Zafman about 6 years ago
- Status changed from In Progress to Pending Backport
Updated by Nathan Cutler about 6 years ago
- Copied to Backport #23307: jewel: ceph-objectstore-tool command to trim the pg log added
Updated by Nathan Cutler about 6 years ago
- Status changed from Pending Backport to Resolved