Bug #50675
closed[pwl ssd] cache larger than 4G will corrupt itself
0%
Description
Unlike in rwl mode where head and tail pointers are log entry indexes and the number of log entries is limited to a million, in ssd mode head and tail pointers are log entry offsets on media. To accommodate ssd mode, m_first_valid_entry and m_first_free_entry were changed to uint64_t, but GenericLogEntry::log_entry_index remains uint32_t -- and despite its name, it is used to store media offsets in ssd mode. Some local variables in ssd/WriteLog.cc (e.g. initial_first_valid_entry and first_valid_entry in retire_entries()) are also uint32_t, cut and pasted from rwl/WriteLog.cc.
The end result is data corruption, once the log reaches the 4G boundary.
ssd/* and the common bits used by ssd mode need to be audited and new test cases added.
Updated by Ilya Dryomov almost 3 years ago
- Related to Bug #50670: [pwl ssd] head / tail pointer corruption added
Updated by Ilya Dryomov almost 3 years ago
- Status changed from New to In Progress
- Assignee set to Ilya Dryomov
Updated by Kefu Chai almost 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 42046
Updated by Ilya Dryomov almost 3 years ago
- Assignee changed from Ilya Dryomov to CONGMIN YIN
Updated by Ilya Dryomov over 2 years ago
- Has duplicate Bug #52081: rbd persistent SSD cache crash at retire_entries added
Updated by Deepika Upadhyay over 2 years ago
@CONGMIN YIN are the new test cases added for this use case or should we be adding them?
Updated by CONGMIN YIN over 2 years ago
Deepika Upadhyay wrote:
@CONGMIN YIN are the new test cases added for this use case or should we be adding them?
Sorry, I didn't notice the message. This problem is Ilya found through observation. After modification, it is obvious that writing test cases for these functions are meaningless, because 32 bits have been changed to 64 bits. Can we change the default size from 1GB to a value greater than 4GB in teuthology test case, such as 8GB. So that similar problems may be found in the teuthology test in the future. If you agree, I will add a commit to modify it.
Updated by Deepika Upadhyay over 2 years ago
we can have a different yaml fragment, having 8GB testing maybe, how much size the cache is actually desired?
rbd/persistent-writeback-cache/4-pool/cache-1GB.yaml and
rbd/persistent-writeback-cache/4-pool/cache-8GB.yaml
Updated by CONGMIN YIN over 2 years ago
The effective_pool_size is 70% configured size. So 8GB is enough.
Updated by Deepika Upadhyay over 2 years ago
pacific backport: https://github.com/ceph/ceph/pull/43918
Updated by Deepika Upadhyay over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 2 years ago
- Copied to Backport #53264: pacific: [pwl ssd] cache larger than 4G will corrupt itself added
Updated by Ilya Dryomov about 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".