Project

General

Profile

Actions

Bug #50675

closed

[pwl ssd] cache larger than 4G will corrupt itself

Added by Ilya Dryomov almost 3 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Unlike in rwl mode where head and tail pointers are log entry indexes and the number of log entries is limited to a million, in ssd mode head and tail pointers are log entry offsets on media. To accommodate ssd mode, m_first_valid_entry and m_first_free_entry were changed to uint64_t, but GenericLogEntry::log_entry_index remains uint32_t -- and despite its name, it is used to store media offsets in ssd mode. Some local variables in ssd/WriteLog.cc (e.g. initial_first_valid_entry and first_valid_entry in retire_entries()) are also uint32_t, cut and pasted from rwl/WriteLog.cc.

The end result is data corruption, once the log reaches the 4G boundary.

ssd/* and the common bits used by ssd mode need to be audited and new test cases added.


Related issues 3 (0 open3 closed)

Related to rbd - Bug #50670: [pwl ssd] head / tail pointer corruptionResolvedCONGMIN YIN

Actions
Has duplicate rbd - Bug #52081: rbd persistent SSD cache crash at retire_entriesDuplicateIlya Dryomov

Actions
Copied to rbd - Backport #53264: pacific: [pwl ssd] cache larger than 4G will corrupt itselfResolvedDeepika UpadhyayActions
Actions #1

Updated by Ilya Dryomov almost 3 years ago

  • Related to Bug #50670: [pwl ssd] head / tail pointer corruption added
Actions #2

Updated by Ilya Dryomov almost 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Ilya Dryomov
Actions #4

Updated by Kefu Chai almost 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 42046
Actions #5

Updated by Ilya Dryomov almost 3 years ago

  • Assignee changed from Ilya Dryomov to CONGMIN YIN
Actions #6

Updated by Ilya Dryomov over 2 years ago

  • Has duplicate Bug #52081: rbd persistent SSD cache crash at retire_entries added
Actions #7

Updated by Deepika Upadhyay over 2 years ago

  • Backport set to pacific
Actions #8

Updated by Deepika Upadhyay over 2 years ago

@CONGMIN YIN are the new test cases added for this use case or should we be adding them?

Actions #9

Updated by CONGMIN YIN over 2 years ago

Deepika Upadhyay wrote:

@CONGMIN YIN are the new test cases added for this use case or should we be adding them?

Sorry, I didn't notice the message. This problem is Ilya found through observation. After modification, it is obvious that writing test cases for these functions are meaningless, because 32 bits have been changed to 64 bits. Can we change the default size from 1GB to a value greater than 4GB in teuthology test case, such as 8GB. So that similar problems may be found in the teuthology test in the future. If you agree, I will add a commit to modify it.

Actions #10

Updated by Deepika Upadhyay over 2 years ago

we can have a different yaml fragment, having 8GB testing maybe, how much size the cache is actually desired?

rbd/persistent-writeback-cache/4-pool/cache-1GB.yaml and
rbd/persistent-writeback-cache/4-pool/cache-8GB.yaml

Actions #11

Updated by CONGMIN YIN over 2 years ago

The effective_pool_size is 70% configured size. So 8GB is enough.

Actions #13

Updated by Deepika Upadhyay over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #14

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53264: pacific: [pwl ssd] cache larger than 4G will corrupt itself added
Actions #15

Updated by Ilya Dryomov about 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF