Bug #50675: [pwl ssd] cache larger than 4G will corrupt itself - rbd - Ceph

Actions

Copy link

Bug #50675

closed

[pwl ssd] cache larger than 4G will corrupt itself

Added by Ilya Dryomov almost 3 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

CONGMIN YIN

Target version:

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

42046

Crash signature (v1):

Crash signature (v2):

Description

Unlike in rwl mode where head and tail pointers are log entry indexes and the number of log entries is limited to a million, in ssd mode head and tail pointers are log entry offsets on media. To accommodate ssd mode, m_first_valid_entry and m_first_free_entry were changed to uint64_t, but GenericLogEntry::log_entry_index remains uint32_t -- and despite its name, it is used to store media offsets in ssd mode. Some local variables in ssd/WriteLog.cc (e.g. initial_first_valid_entry and first_valid_entry in retire_entries()) are also uint32_t, cut and pasted from rwl/WriteLog.cc.

The end result is data corruption, once the log reaches the 4G boundary.

ssd/* and the common bits used by ssd mode need to be audited and new test cases added.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Ilya Dryomov almost 3 years ago

Related to Bug #50670: [pwl ssd] head / tail pointer corruption added

Actions

Copy link

Updated by Ilya Dryomov almost 3 years ago

Status changed from New to In Progress
Assignee set to Ilya Dryomov

Actions

Copy link

Updated by CONGMIN YIN almost 3 years ago

https://github.com/ceph/ceph/pull/41968/commits/cc3db42bdb63814328a14c4f8887df7929f40b95

Actions

Copy link

Updated by Kefu Chai almost 3 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 42046

Actions

Copy link

Updated by Ilya Dryomov almost 3 years ago

Assignee changed from Ilya Dryomov to CONGMIN YIN

Actions

Copy link

Updated by Ilya Dryomov over 2 years ago

Has duplicate Bug #52081: rbd persistent SSD cache crash at retire_entries added

Actions

Copy link

Updated by Deepika Upadhyay over 2 years ago

Backport set to pacific

Actions

Copy link

Updated by Deepika Upadhyay over 2 years ago

@CONGMIN YIN are the new test cases added for this use case or should we be adding them?

Actions

Copy link

Updated by CONGMIN YIN over 2 years ago

Deepika Upadhyay wrote:

@CONGMIN YIN are the new test cases added for this use case or should we be adding them?

Sorry, I didn't notice the message. This problem is Ilya found through observation. After modification, it is obvious that writing test cases for these functions are meaningless, because 32 bits have been changed to 64 bits. Can we change the default size from 1GB to a value greater than 4GB in teuthology test case, such as 8GB. So that similar problems may be found in the teuthology test in the future. If you agree, I will add a commit to modify it.

Actions

Copy link

#10

Updated by Deepika Upadhyay over 2 years ago

we can have a different yaml fragment, having 8GB testing maybe, how much size the cache is actually desired?

rbd/persistent-writeback-cache/4-pool/cache-1GB.yaml and
rbd/persistent-writeback-cache/4-pool/cache-8GB.yaml

Actions

Copy link

#11

Updated by CONGMIN YIN over 2 years ago

The effective_pool_size is 70% configured size. So 8GB is enough.

Actions

Copy link

#12

Updated by Deepika Upadhyay over 2 years ago

pacific backport: https://github.com/ceph/ceph/pull/43918

Actions

Copy link

#13

Updated by Deepika Upadhyay over 2 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#14

Updated by Backport Bot over 2 years ago

Copied to Backport #53264: pacific: [pwl ssd] cache larger than 4G will corrupt itself added

Actions

Copy link

#15

Updated by Ilya Dryomov about 2 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #50675

[pwl ssd] cache larger than 4G will corrupt itself

Updated by Ilya Dryomov almost 3 years ago

Updated by Ilya Dryomov almost 3 years ago

Updated by CONGMIN YIN almost 3 years ago

Updated by Kefu Chai almost 3 years ago

Updated by Ilya Dryomov almost 3 years ago

Updated by Ilya Dryomov over 2 years ago

Updated by Deepika Upadhyay over 2 years ago

Updated by Deepika Upadhyay over 2 years ago

Updated by CONGMIN YIN over 2 years ago

Updated by Deepika Upadhyay over 2 years ago

Updated by CONGMIN YIN over 2 years ago

Updated by Deepika Upadhyay over 2 years ago

Updated by Deepika Upadhyay over 2 years ago

Updated by Backport Bot over 2 years ago

Updated by Ilya Dryomov about 2 years ago