Project

General

Profile

Bug #5257

Ceph OSD bobtail->cuttlefish upgrade goes backward in time with osdmap

Added by Faidon Liambotis over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I upgraded ceph on one of my boxes (12 osds) and the osds came up, printing "183140 PGs are upgrading". During that time, the cluster has those osds as up, the pgs as degraded/not recovering and a bunch of "slow requests" are getting logged. Based on the numbers I'm seeing, the upgrade is going to take quite a while.

debug-osd=10 has one of those osds as:

2013-06-05 18:37:50.120126 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14347 objects done
2013-06-05 18:37:50.274509 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14411 objects done
2013-06-05 18:37:50.281523 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14475 objects done
2013-06-05 18:37:50.288334 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14539 objects done
2013-06-05 18:37:50.295061 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14603 objects done
2013-06-05 18:37:50.301661 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14646 objects done
2013-06-05 18:37:50.448612 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14710 objects done
2013-06-05 18:37:50.452574 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14774 objects done
2013-06-05 18:37:50.456327 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14838 objects done
2013-06-05 18:37:50.460033 7fb7f8907780  1 osd.50 pg_epoch: 183140 pg[3.1212( v 183140'66684 (180299'65683,183140'66684] local-les=181898 n=16706 ec=14 les/c 181898/181952 181875/181875/179540) [] r=0 lpr=0 pi=155667-181874/107 lcod 0'0 mlcod 0'0 inactive] Updating snap_mapper from main collection, 14902 objects done

History

#1 Updated by Faidon Liambotis over 10 years ago

This happened after a while:

2013-06-05 18:55:33.520155 mon.0 10.64.0.167:6789/0 882 : [INF] osd.48 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520204 mon.0 10.64.0.167:6789/0 883 : [INF] osd.49 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520233 mon.0 10.64.0.167:6789/0 884 : [INF] osd.50 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520254 mon.0 10.64.0.167:6789/0 885 : [INF] osd.51 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520275 mon.0 10.64.0.167:6789/0 886 : [INF] osd.52 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520295 mon.0 10.64.0.167:6789/0 887 : [INF] osd.53 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520320 mon.0 10.64.0.167:6789/0 888 : [INF] osd.54 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520347 mon.0 10.64.0.167:6789/0 889 : [INF] osd.55 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520368 mon.0 10.64.0.167:6789/0 890 : [INF] osd.56 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520389 mon.0 10.64.0.167:6789/0 891 : [INF] osd.57 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520410 mon.0 10.64.0.167:6789/0 892 : [INF] osd.58 marked down after no pg stats for 900.373212seconds
2013-06-05 18:55:33.520431 mon.0 10.64.0.167:6789/0 893 : [INF] osd.59 marked down after no pg stats for 900.373212seconds

But in the end it recovered:

2013-06-05 19:19:27.396993 mon.0 10.64.0.167:6789/0 1988 : [INF] pgmap v8300803: 16760 pgs: 106 active, 16650 active+clean, 3 active+recovery_wait, 1 peering; 44456 GB data, 137 TB used, 118 TB / 256 T
B avail; 4/822314693 degraded (0.000%)
2013-06-05 19:19:28.570372 mon.0 10.64.0.167:6789/0 1989 : [INF] pgmap v8300804: 16760 pgs: 105 active, 16651 active+clean, 4 active+recovery_wait; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 30
92B/s rd, 0B/s wr, 4op/s; 4/822314693 degraded (0.000%)
2013-06-05 19:19:29.730863 mon.0 10.64.0.167:6789/0 1990 : [INF] pgmap v8300805: 16760 pgs: 104 active, 16652 active+clean, 4 active+recovery_wait; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 39
20B/s rd, 0B/s wr, 5op/s; 4/822314693 degraded (0.000%)
2013-06-05 19:19:31.941478 mon.0 10.64.0.167:6789/0 1991 : [INF] pgmap v8300806: 16760 pgs: 104 active, 16652 active+clean, 4 active+recovery_wait; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 60
6B/s rd, 0op/s; 4/822314693 degraded (0.000%)
2013-06-05 19:19:33.120122 mon.0 10.64.0.167:6789/0 1992 : [INF] pgmap v8300807: 16760 pgs: 104 active, 16653 active+clean, 3 active+recovery_wait; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 18
30B/s rd, 282KB/s wr, 5op/s; 3/822314696 degraded (0.000%);  recovering 0 o/s, 0B/s
2013-06-05 19:19:34.309884 mon.0 10.64.0.167:6789/0 1993 : [INF] pgmap v8300808: 16760 pgs: 16760 active+clean; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 10933B/s rd, 404KB/s wr, 21op/s;  reco
vering 1 o/s, 14481B/s
2013-06-05 19:19:35.499321 mon.0 10.64.0.167:6789/0 1994 : [INF] pgmap v8300809: 16760 pgs: 16760 active+clean; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 8183B/s rd, 40915B/s wr, 16op/s;  reco

I have 132 more OSDs to upgrade though :)

#2 Updated by Faidon Liambotis over 10 years ago

Just to give a sense of the size of the issue:

2013-06-05 18:35:28.121399 7fb7f8907780 -1 osd.50 183140 PGs are upgrading
2013-06-05 19:17:26.876481 7fb7f8907780 10 osd.50 183140 load_pgs done

During which logs were flooded with slow requests, on a relatively idle workload. Note that the cluster has 144 osds, and ~16k pgs.

#3 Updated by Sage Weil over 10 years ago

are you sure 'nodown' wasn't set? this upgrade happens in load_pgs(), long before the osd sends a message to the mon to add itself to the osdmap. it should have been down during this period.

oh, maybe you restarted all osds at once and there was nobody to mark each other down?

#4 Updated by Faidon Liambotis over 10 years ago

I'm sure nodown wasn't set. I didn't restart all mons at once, just 12 (one box) out of 141.

This is what "grep osd.50" has:

2013-06-05 18:22:29.666308 mon.0 10.64.0.167:6789/0 69 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.116 10.64.32.15:6857/28618
2013-06-05 18:22:30.197649 mon.0 10.64.0.167:6789/0 80 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.8 10.64.0.173:6857/13543
2013-06-05 18:22:30.385626 mon.0 10.64.0.167:6789/0 91 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.133 10.64.0.178:6803/3737
2013-06-05 18:22:30.602108 mon.0 10.64.0.167:6789/0 102 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.97 10.64.32.14:6805/2712
2013-06-05 18:22:30.918930 mon.0 10.64.0.167:6789/0 109 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.5 10.64.0.173:6827/28864
2013-06-05 18:22:30.919910 mon.0 10.64.0.167:6789/0 115 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.114 10.64.32.15:6806/15335
2013-06-05 18:22:31.054675 mon.0 10.64.0.167:6789/0 131 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.128 10.64.32.16:6812/25828
2013-06-05 18:22:31.056202 mon.0 10.64.0.167:6789/0 138 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.117 10.64.32.15:6860/28820
2013-06-05 18:22:31.301086 mon.0 10.64.0.167:6789/0 151 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.10 10.64.0.173:6811/11301
2013-06-05 18:22:31.481864 mon.0 10.64.0.167:6789/0 162 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.135 10.64.0.178:6827/3796
2013-06-05 18:22:31.502976 mon.0 10.64.0.167:6789/0 174 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.24 10.64.0.175:6810/8333
2013-06-05 18:22:31.521458 mon.0 10.64.0.167:6789/0 184 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.23 10.64.0.174:6861/18535
2013-06-05 18:22:31.705898 mon.0 10.64.0.167:6789/0 194 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.122 10.64.32.16:6810/1006
2013-06-05 18:22:31.726825 mon.0 10.64.0.167:6789/0 205 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.38 10.64.0.176:6831/22256
2013-06-05 18:22:31.726867 mon.0 10.64.0.167:6789/0 206 : [INF] osd.50 10.64.32.10:6833/17728 failed (14 reports from 14 peers after 2013-06-05 18:22:51.726815 >= grace 20.000000)
2013-06-05 18:22:31.919562 mon.0 10.64.0.167:6789/0 256 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.37 10.64.0.176:6808/22232
2013-06-05 18:22:31.972086 mon.0 10.64.0.167:6789/0 268 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.105 10.64.32.14:6854/2966
2013-06-05 18:22:31.979197 mon.0 10.64.0.167:6789/0 279 : [DBG] osd.50 10.64.32.10:6833/17728 reported failed by osd.22 10.64.0.174:6802/29161
2013-06-05 18:55:33.520233 mon.0 10.64.0.167:6789/0 884 : [INF] osd.50 marked down after no pg stats for 900.373212seconds
2013-06-05 19:05:57.653950 mon.0 10.64.0.167:6789/0 1213 : [INF] osd.50 out (down for 623.917719)
2013-06-05 19:17:32.843971 mon.0 10.64.0.167:6789/0 1898 : [INF] osd.50 10.64.32.10:6800/11959 boot

And here's the relevant "grep osdmap" snippet too:

2013-06-05 18:22:42.051615 mon.0 10.64.0.167:6789/0 300 : [INF] osdmap e183147: 144 osds: 129 up, 141 in
2013-06-05 18:25:32.737961 mon.0 10.64.0.167:6789/0 6 : [INF] osdmap e183140: 144 osds: 141 up, 141 in
2013-06-05 18:55:33.840988 mon.0 10.64.0.167:6789/0 894 : [INF] osdmap e183141: 144 osds: 129 up, 141 in

#5 Updated by Faidon Liambotis over 10 years ago

There seems to have been a monitor election (without me doing anything) exactly before the "141 up":

2013-06-05 18:25:32.536017 mon.0 10.64.0.167:6789/0 3 : [INF] mon.ms-fe1001@0 won leader election with quorum 0,2

Context:

2013-06-05 18:23:31.717335 mon.0 10.64.0.167:6789/0 345 : [INF] pgmap v8295656: 16760 pgs: 12477 active+clean, 369 peering, 3895 active+degraded, 19 active+clean+scrubbing; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 63931535/822315176 degraded (7.775
%)
2013-06-05 18:23:33.063877 mon.0 10.64.0.167:6789/0 346 : [INF] pgmap v8295657: 16760 pgs: 12477 active+clean, 357 peering, 3907 active+degraded, 19 active+clean+scrubbing; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 64131310/822315176 degraded (7.799
%)
2013-06-05 18:23:35.605623 mon.2 10.64.32.10:6789/0 1 : [INF] mon.ms-be1005 calling new monitor election
2013-06-05 18:23:45.730221 mon.2 10.64.32.10:6789/0 2 : [INF] mon.ms-be1005 calling new monitor election
2013-06-05 18:24:06.806455 mon.2 10.64.32.10:6789/0 3 : [INF] mon.ms-be1005 calling new monitor election
2013-06-05 18:25:23.010752 mon.0 10.64.0.167:6789/0 1 : [INF] mon.ms-fe1001 calling new monitor election
2013-06-05 18:25:27.448939 mon.0 10.64.0.167:6789/0 2 : [INF] mon.ms-fe1001 calling new monitor election
2013-06-05 18:25:32.536017 mon.0 10.64.0.167:6789/0 3 : [INF] mon.ms-fe1001@0 won leader election with quorum 0,2
2013-06-05 18:25:32.685281 mon.0 10.64.0.167:6789/0 4 : [INF] pgmap v8298963: 16760 pgs: 16752 active+clean, 8 active+clean+scrubbing; 44456 GB data, 137 TB used, 118 TB / 256 TB avail
2013-06-05 18:25:32.699598 mon.0 10.64.0.167:6789/0 5 : [INF] mdsmap e1: 0/0/1 up
2013-06-05 18:25:32.737961 mon.0 10.64.0.167:6789/0 6 : [INF] osdmap e183140: 144 osds: 141 up, 141 in
2013-06-05 18:25:32.803069 mon.0 10.64.0.167:6789/0 7 : [INF] monmap e17: 3 mons at {ms-be1005=10.64.32.10:6789/0,ms-fe1001=10.64.0.167:6789/0,ms-fe1003=10.64.16.150:6789/0}
2013-06-05 18:25:33.532222 mon.0 10.64.0.167:6789/0 8 : [INF] pgmap v8298964: 16760 pgs: 16424 active+clean, 328 active+degraded, 8 active+clean+scrubbing; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 273B/s wr, 0op/s; 5485801/822315134 degraded (0.667
%)
2013-06-05 18:25:35.138931 mon.0 10.64.0.167:6789/0 9 : [INF] pgmap v8298965: 16760 pgs: 14902 active+clean, 1850 active+degraded, 8 active+clean+scrubbing; 44456 GB data, 137 TB used, 118 TB / 256 TB avail; 1052B/s wr, 0op/s; 30852729/822315143 degraded (3.
752%)

#6 Updated by Sage Weil over 10 years ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to High

do you have a complete mon log for this? if not, can you capture one the next time around? that osd should have been marked down earlier.

also, the out of order osdmap epochs looks very fishy!

#7 Updated by Faidon Liambotis over 10 years ago

I just upgraded another box and I'm not observing the same behavior. OSDs are now down while PGs are upgrading. This could be the out of order osdmap epoch in combination with complications with the mon upgrade that happened the same time, cf. #5256.

#8 Updated by Sage Weil over 10 years ago

  • Subject changed from Ceph OSD bobtail->cuttlefish upgrade takes a long time, blocks requests to Ceph OSD bobtail->cuttlefish upgrade goes backward in time with osdmap
  • Category set to Monitor
  • Status changed from Need More Info to 12
  • Priority changed from High to Urgent

#9 Updated by Sage Weil over 10 years ago

from the logs on cephdrop, this looks like a non-deterministic store conversion maybe? the quorum 0,1 is happily chugging along:

 -9418> 2013-06-05 18:23:34.464053 7f1979387700  1 -- 10.64.0.167:6789/0 --> mon.1 10.64.16.150:6789/0 -- paxos(commit lc <b>15484179</b> fc 0 pn 1250900 opn 0) v3 -- ?+0 0x4936a00

then we hear from mon.2:
 -9353> 2013-06-05 18:23:35.605439 7f1979387700  1 -- 10.64.0.167:6789/0 <== mon.2 10.64.32.10:6789/0 2 ==== mon_probe(reply c9da36e1-694a-4166-b346-9d8d4d1d1ac1 name ms-be1005 paxos( fc 15490813 lc 15490832 )) v4 ==== 587+0+0 (3601808406 0 0) 0x3196600 con 0x3bda420

with a paxos version way in the future!

#10 Updated by Sage Weil over 10 years ago

Faidon: can you send a tarball of your mon dirs? IIRC the old files are still present post-conversion, so we should be able to see why the states didn't line up.

#11 Updated by Faidon Liambotis over 10 years ago

cephdrop:ceph-5257-mondirs.tar.bz2, fetched earlier today.

#12 Updated by Sage Weil over 10 years ago

the problem:

mon.1 and .2 had newer data, mon.0 had older data.
mon.0 converts, waits to join quorum
mon.1 converts, restarts? *
mon.1 restarts, zeros it stuff and forgets last_commited *

mon.1 syncs with mon.0 even tho it is older
mon.2 starts, has newer version, everyone syncs with mon.2, and forgets the weird interim history

#13 Updated by Sage Weil over 10 years ago

the mon.1 log snippet

2013-06-05 18:15:16.596583 7fd87338f700 -1 mon.ms-fe1003@1(leader) e17 *** Got Signal Terminated ***
2013-06-05 18:15:16.596610 7fd87338f700  1 mon.ms-fe1003@1(leader) e17 shutdown
2013-06-05 18:17:21.384435 7faebcf8b780  0 ceph version 0.61.2-58-g7d549cb (7d549cb82ab8ebcf1cc104fc557d601b486c7635), process ceph-mon, pid 5806
2013-06-05 18:17:21.385155 7faebcf8b780  1 unable to open monitor store at /var/lib/ceph/mon/ceph-ms-fe1003
2013-06-05 18:17:21.385165 7faebcf8b780  1 check for old monitor store format
2013-06-05 18:17:21.385212 7faebcf8b780  1 store(/var/lib/ceph/mon/ceph-ms-fe1003) mount
2013-06-05 18:17:21.385279 7faebcf8b780  1 found old GV monitor store format -- should convert!
2013-06-05 18:17:21.423070 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:21.542914 7faebcf8b780  1 store(/var/lib/ceph/mon/ceph-ms-fe1003) mount
2013-06-05 18:17:22.435903 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:23.448413 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:24.462108 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:25.475052 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:26.489288 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:27.501180 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:28.515549 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:29.529175 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:30.543491 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:31.555943 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:32.570406 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:33.583390 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:34.595585 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:17:35.609602 7faeb9e03700 -1 asok(0x304e000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:19:42.154020 7f2b1d52f780  0 ceph version 0.61.2-58-g7d549cb (7d549cb82ab8ebcf1cc104fc557d601b486c7635), process ceph-mon, pid 5947
2013-06-05 18:19:42.194620 7f2b1a3a7700 -1 asok(0x1d37000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:19:42.551699 7f2b1d52f780 -1 there is an on-going (maybe aborted?) conversion.
2013-06-05 18:19:42.551719 7f2b1d52f780 -1 you should check what happened
2013-06-05 18:19:42.551803 7f2b1d52f780 -1 found errors while attempting to convert the monitor store: (17) File exists
2013-06-05 18:19:42.648250 7f2b1d52f780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
2013-06-05 18:19:42.648798 7f2b1d52f780  1 mon.ms-fe1003@-1(probing) e17 preinit fsid c9da36e1-694a-4166-b346-9d8d4d1d1ac1
2013-06-05 18:19:42.711604 7f2b1d52f780  0 mon.ms-fe1003@-1(probing).osd e183140 crush map has features 262144, adjusting msgr requires
2013-06-05 18:19:42.711614 7f2b1d52f780  0 mon.ms-fe1003@-1(probing).osd e183140 crush map has features 262144, adjusting msgr requires
2013-06-05 18:19:42.711616 7f2b1d52f780  0 mon.ms-fe1003@-1(probing).osd e183140 crush map has features 262144, adjusting msgr requires
2013-06-05 18:19:42.711618 7f2b1d52f780  0 mon.ms-fe1003@-1(probing).osd e183140 crush map has features 262144, adjusting msgr requires
2013-06-05 18:19:42.712576 7f2b1d52f780  0 mon.ms-fe1003@-1(probing) e17  my rank is now 1 (was -1)
2013-06-05 18:19:42.713222 7f2b172a0700  0 -- 10.64.16.150:6789/0 >> 10.64.32.10:6789/0 pipe(0x1d49280 sd=26 :0 s=1 pgs=0 cs=0 l=0).fault
2013-06-05 18:19:42.714907 7f2b1709e700  0 -- 10.64.16.150:6789/0 >> 10.64.0.167:6789/0 pipe(0x1edd280 sd=27 :6789 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state wait
2013-06-05 18:19:48.096957 7f2b18ba4700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 discarding message auth(proto 0 28 bytes epoch 17) v1 and sending client elsewhere
2013-06-05 18:19:48.097000 7f2b18ba4700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 discarding message auth(proto 0 28 bytes epoch 17) v1 and sending client elsewhere
2013-06-05 18:19:48.097013 7f2b18ba4700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 discarding message auth(proto 0 27 bytes epoch 17) v1 and sending client elsewhere
2013-06-05 18:19:48.097024 7f2b18ba4700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 discarding message auth(proto 0 27 bytes epoch 17) v1 and sending client elsewhere
2013-06-05 18:19:48.097035 7f2b18ba4700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 discarding message auth(proto 0 27 bytes epoch 17) v1 and sending client elsewhere
2013-06-05 18:19:48.097046 7f2b18ba4700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 discarding message auth(proto 0 27 bytes epoch 17) v1 and sending client elsewhere

#14 Updated by Joao Eduardo Luis over 10 years ago

Sage Weil wrote:

the problem:

mon.1 and .2 had newer data, mon.0 had older data.
mon.0 converts, waits to join quorum
mon.1 converts, restarts? (1)
mon.1 restarts, zeros it stuff and forgets last_commited
(2)
mon.1 syncs with mon.0 even tho it is older (3)
mon.2 starts, has newer version, everyone syncs with mon.2, and forgets the weird interim history

(1) - my guess is that there was user intervention here, leading to an incomplete conversion

(2) - ceph-mon should have just exited upon finding an aborted conversion; it didn't because of #5255. Then it went about to check its on-disk versions, and if the paxos keys hadn't been converted yet it would find the first and last committed versions to be missing (thus assuming them as zero upon reading them), which triggered the sync. What strikes me as impressive is that it managed to convert the monmap, allowing it to find other monitors in the cluster.

(3) - considering the monitor read 'lc 0' and 'fc 0' due to (2), it would happily sync with anyone.

I don't think the sync itself was the culprit here.

#15 Updated by Sage Weil over 10 years ago

  • Status changed from 12 to Resolved

The prefork fix is backported to cuttlefish, so closing this one out then.

Also available in: Atom PDF