Project

General

Profile

Bug #21749

PurgeQueue corruption in 12.2.1

Added by John Spray over 6 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From "[ceph-users] how to debug (in order to repair) damaged MDS (rank)?"

Log snippet during MDS startup:

 56 2017-10-10 13:21:55.421122 7f3f2990d700  1 mds.6.journaler.pq(ro) recover start 
 57 2017-10-10 13:21:55.421124 7f3f2990d700  1 mds.6.journaler.pq(ro) read_head     
 58 2017-10-10 13:21:55.421231 7f3f2990d700  0 mds.6.cache creating system inode with ino:0x1
 59 2017-10-10 13:21:55.422532 7f3f2a10e700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
 60 2017-10-10 13:21:55.422548 7f3f2a10e700 10 mds.6.cache.ino(0x106) _fetched got 0 and 536
 61 2017-10-10 13:21:55.422556 7f3f2a10e700 10 mds.6.cache.ino(0x106)  magic is 'ceph fs volume v011' (expecting 'ceph fs volume v011')
 62 2017-10-10 13:21:55.422584 7f3f2a10e700 10  mds.6.cache.snaprealm(0x106 seq 1 0x55b192f65c00) open_parents [1,head]
 63 2017-10-10 13:21:55.422593 7f3f2a10e700 20 mds.6.cache.ino(0x106) decode_snap_blob snaprealm(0x106 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55b192f65c00)
 64 2017-10-10 13:21:55.422598 7f3f2a10e700 10 mds.6.cache.ino(0x106) _fetched [inode 0x106 [...2,head] ~mds6/ auth v19 snaprealm=0x55b192f65c00 f(v0 10=0+10) n(v3 rc2017-10-03 22    :56:32.400835 b6253 88=11+77)/n(v0 11=0+11) (iversion lock) 0x55b193176700]
 65 2017-10-10 13:21:55.831091 7f3f2b110700  1 mds.6.journaler.pq(ro) _finish_read_head loghead(trim 104857600, expire 108687220, write 108868115, stream_format 1).  probing for e    nd of log (from 108868115)...
 66 2017-10-10 13:21:55.831107 7f3f2b110700  1 mds.6.journaler.pq(ro) probing for end of the log
 67 2017-10-10 13:21:55.841213 7f3f2b110700  1 mds.6.journaler.pq(ro) _finish_probe_end write_pos = 134217728 (header had 108868115). recovered.
 68 2017-10-10 13:21:55.841234 7f3f2b110700  4 mds.6.purge_queue operator(): open complete
 69 2017-10-10 13:21:55.841236 7f3f2b110700  4 mds.6.purge_queue operator(): recovering write_pos
 70 2017-10-10 13:21:55.841239 7f3f2b110700 10 mds.6.journaler.pq(ro) _prefetch     
 71 2017-10-10 13:21:55.841241 7f3f2b110700 10 mds.6.journaler.pq(ro) _prefetch 41943040 requested_pos 108868115 < target 134217728 (150811155), prefetching 25349613
 72 2017-10-10 13:21:55.841246 7f3f2b110700 10 mds.6.journaler.pq(ro) _issue_read reading 108868115~25349613, read pointers 108868115/108868115/134217728
 73 2017-10-10 13:21:55.841564 7f3f2b110700 10 mds.6.journaler.pq(ro) wait_for_readable at 108868115 onreadable 0x55b193232840
 74 2017-10-10 13:21:55.842864 7f3f2b110700 10 mds.6.journaler.pq(ro) _finish_read got 108868115~183789
 75 2017-10-10 13:21:55.842882 7f3f2b110700 10 mds.6.journaler.pq(ro) _assimilate_prefetch 108868115~183789
 76 2017-10-10 13:21:55.842886 7f3f2b110700 10 mds.6.journaler.pq(ro) _assimilate_prefetch read_buf now 108868115~183789, read pointers 108868115/109051904/134217728
 77 2017-10-10 13:21:55.842965 7f3f2b110700 -1 mds.6.journaler.pq(ro) _decode error from assimilate_prefetch
 78 2017-10-10 13:21:55.842979 7f3f2b110700 -1 mds.6.purge_queue _recover: Error -22 recovering write_pos
 79 2017-10-10 13:21:55.842983 7f3f2b110700 10 mds.beacon.mds9 set_want_state: up:replay -> down:damaged


Related issues

Duplicates CephFS - Bug #19593: purge queue and standby replay mds Resolved 04/12/2017

History

#1 Updated by Daniel Baumann over 6 years ago

I saved all information/logs/objects, feel free to ask for any of it and further things.

Regards,
Daniel

#2 Updated by Zheng Yan over 6 years ago

likely caused by http://tracker.ceph.com/issues/19593.

ping 'yanzheng' at ceph@OFTC, I will help you to recover the FS

#3 Updated by Daniel Baumann over 6 years ago

Hi Yan,

yes, we had 3 MDS running in standby-replay mode (I switched them to standby now).

Thanks for the offer for help with recovery, I already could bring it back by removing the objects in the purge queue.

(for reference)
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021386.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021390.html

Regards,
Daniel

#4 Updated by Zheng Yan over 6 years ago

  • Status changed from New to Duplicate

dup of #19593

#5 Updated by Patrick Donnelly over 6 years ago

  • Duplicates Bug #19593: purge queue and standby replay mds added

Also available in: Atom PDF