Project

General

Profile

Bug #40341

multisite: failed assert(cursor) in mdlog trimming

Added by Casey Bodley almost 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
multisite
Backport:
luminous mimic nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If the master zone doesn't have a copy of every period, it can crash in mdlog trimming:

/builddir/build/BUILD/ceph-12.2.8/src/rgw/rgw_sync.cc: 2387: FAILED assert(cursor)
 ceph version 12.2.8-128.el7cp (030358773c5213a14c1444a5147258672b2dc15f) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7ff642ce7f50]
 2: (PurgePeriodLogsCR::operate()+0xa25) [0x5555715996a5]
 3: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x5555713a2b4e]
 4: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x41b) [0x5555713a567b]
 5: (RGWSyncLogTrimThread::process()+0x1c5) [0x5555714734e5]
 6: (RGWRadosThread::Worker::entry()+0x123) [0x555571404453]
 7: (()+0x7dd5) [0x7ff64b718dd5]
 8: (clone()+0x6d) [0x7ff63fb9fead]

The problem is identified on startup with errors like this:

2019-06-11 03:26:41.526812 7fe217523000  1 error read_lastest_epoch .rgw.root:periods.9c817b51-79ce-4c88-afe4-5ad70d112e30.latest_epoch
2019-06-11 03:26:41.526821 7fe217523000  0 failed to use_latest_epoch period id 9c817b51-79ce-4c88-afe4-5ad70d112e30 realm  id  : (2) No such file or directory
2019-06-11 03:26:41.526837 7fe217523000  1 rgw period puller: metadata master failed to read period 9c817b51-79ce-4c88-afe4-5ad70d112e30 from local storage: (2) No such file or directory
2019-06-11 03:26:41.526839 7fe217523000  1 failed to read period id=9c817b51-79ce-4c88-afe4-5ad70d112e30 for mdlog history: (2) No such file or directory


Related issues

Copied to rgw - Backport #43134: nautilus: multisite: failed assert(cursor) in mdlog trimming Resolved
Copied to rgw - Backport #43633: mimic: multisite: failed assert(cursor) in mdlog trimming Resolved
Copied to rgw - Backport #43634: luminous: multisite: failed assert(cursor) in mdlog trimming Rejected

History

#1 Updated by Casey Bodley almost 5 years ago

  • Source set to Q/A

#2 Updated by fang yuxiang over 4 years ago

How about the progress?

#3 Updated by Shilpa MJ over 4 years ago

  • Status changed from New to 7
  • Pull request ID set to 31873

#4 Updated by Shilpa MJ over 4 years ago

  • Copied to Backport #43134: nautilus: multisite: failed assert(cursor) in mdlog trimming added

#5 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 7 to Fix Under Review

#6 Updated by Casey Bodley about 4 years ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43633: mimic: multisite: failed assert(cursor) in mdlog trimming added

#8 Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43634: luminous: multisite: failed assert(cursor) in mdlog trimming added

#9 Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF