Bug #5256: Upgraded bobtail->cuttlefish mon crashes, then can't resume the conversion - Ceph - Ceph

Actions

Copy link

Bug #5256

closed

Upgraded bobtail->cuttlefish mon crashes, then can't resume the conversion

Added by Faidon Liambotis almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description


    -6> 2013-06-05 18:25:33.071576 7fcd873a7700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 sync_requester_abort no longer a sync requester
    -5> 2013-06-05 18:25:33.071600 7fcd873a7700  1 -- 10.64.16.150:6789/0 --> mon.0 10.64.0.167:6789/0 -- mon_probe(probe c9da36e1-694a-4166-b346-9d8d4d1d1ac1 name ms-fe1003) v4 -- ?+0 0x4788600
    -4> 2013-06-05 18:25:33.071634 7fcd873a7700  1 -- 10.64.16.150:6789/0 --> mon.2 10.64.32.10:6789/0 -- mon_probe(probe c9da36e1-694a-4166-b346-9d8d4d1d1ac1 name ms-fe1003) v4 -- ?+0 0x4788000
    -3> 2013-06-05 18:25:33.071714 7fcd873a7700  1 -- 10.64.16.150:6789/0 <== mon.2 10.64.32.10:6789/0 359 ==== mon_sync( chunk bl 987152 bytes last_key ( osdmap,full_182463 ) ) v1 ==== 987343+0+0 (2285625880 0 0) 0x4f02840 con 0x1b3e6e0
    -2> 2013-06-05 18:25:33.071754 7fcd873a7700  1 mon.ms-fe1003@1(probing) e17 handle_sync_chunk stray message -- drop it.
    -1> 2013-06-05 18:25:33.071758 7fcd873a7700  1 -- 10.64.16.150:6789/0 <== osd.132 10.64.0.178:6805/27958 1 ==== auth(proto 0 28 bytes epoch 17) v1 ==== 58+0+0 (2383461374 0 0) 0x3732900 con 0x231f580
     0> 2013-06-05 18:25:33.072854 7fcd87ba8700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fcd87ba8700

 ceph version 0.61.2-58-g7d549cb (7d549cb82ab8ebcf1cc104fc557d601b486c7635)
 1: /usr/bin/ceph-mon() [0x599a3a]
 2: (()+0xfcb0) [0x7fcd8c112cb0]
 3: (Monitor::C_HeartbeatInterval::finish(int)+0x53) [0x4d06a3]
 4: (Context::complete(int)+0xa) [0x4cb61a]
 5: (SafeTimer::timer_thread()+0x425) [0x649695]
 6: (SafeTimerThread::entry()+0xd) [0x64a2cd]
 7: (()+0x7e9a) [0x7fcd8c10ae9a]
 8: (clone()+0x6d) [0x7fcd8a6baccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.ms-fe1003.log
--- end dump of recent events ---
2013-06-05 18:25:33.245545 7ff10f703780  0 ceph version 0.61.2-58-g7d549cb (7d549cb82ab8ebcf1cc104fc557d601b486c7635), process ceph-mon, pid 11154
2013-06-05 18:25:33.281636 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:33.459061 7ff10f703780 -1 there is an on-going (maybe aborted?) conversion.
2013-06-05 18:25:33.459073 7ff10f703780 -1 you should check what happened
2013-06-05 18:25:33.459771 7ff10f703780 -1 found errors while attempting to convert the monitor store: (17) File exists
2013-06-05 18:25:33.549295 7ff10f703780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
2013-06-05 18:25:33.761693 7ff10f703780 -1 obtain_monmap unable to find a monmap
2013-06-05 18:25:33.761711 7ff10f703780 -1 unable to obtain a monmap: (2) No such file or directory
2013-06-05 18:25:33.761725 7ff10f703780  0 mon.ms-fe1003 does not exist in monmap, will attempt to join an existing cluster
2013-06-05 18:25:33.761943 7ff10f703780 -1 no public_addr or public_network specified, and mon.ms-fe1003 not present in monmap or ceph.conf
2013-06-05 18:25:33.762389 7ff10f703780  1 mon.ms-fe1003@-1(probing) e0 preinit fsid 00000000-0000-0000-0000-000000000000
2013-06-05 18:25:33.762455 7ff10f703780 -1 mon.ms-fe1003@-1(probing) e0 error: cluster_uuid file exists with value 'c9da36e1-694a-4166-b346-9d8d4d1d1ac1', != our uuid 00000000-0000-0000-0000-000000000000
2013-06-05 18:25:34.297132 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:35.313140 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:36.327476 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:37.339902 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:38.354432 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:39.369904 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:40.384275 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Joao Eduardo Luis almost 11 years ago

Category set to Monitor
Status changed from New to In Progress
Assignee set to Joao Eduardo Luis

Actions

Copy link

Updated by Faidon Liambotis almost 11 years ago

Cephdrop now has 5256-ceph-mon.ms-be1005.log.gz, 5256-ceph-mon.ms-fe1001.log.gz, 5256-ceph-mon.ms-fe1003.log.gz.

The timeline of events as I recall them (could be wrong) is:
1) upgrade ms-fe1003. quorum remains the other two, still at 0.56.6
2) upgrade ms-fe1001
- wait until cuttlefish quorum is established.
3) upgrade ms-be1005. this also has OSDs
- ms-fe1003 crashes and is unable to recover
- the effects of #5257 (out of order osdmap epoch) are observed
(the last two could have happened in the opposite order)

Actions

Copy link

Updated by Joao Eduardo Luis almost 11 years ago

Okay, here's what is the likely order of events in this case:

the monitor was converting when it was killed for some reason.
the monitor was restarted, must have found an on-going conversion (as those flags are still there on the later crash) but due to the preforker bug it didn't exit as it should; instead, it kept on going.
given the monitor must have converted the monmap portion of the store, it was able to start; however, it's likely it was killed while converting the paxos portion (as it usually takes the longest), and that must have triggered a sync upon starting.
said sync was aborted for some reason not present in the log, triggering a wide-store clear; due to a bug in there, the monitor ended up removing not only the store contents (by design) but the backup monmap we stashed before starting the sync (this one is a bug).
monitor crashed due to the C_HeartbeatInterval segfault (yet another bug)
monitor was unable to restart due to lacking a monmap.

So these would be the 3 bugs present in this ticket (following events order):

the fact that the monitor started even though its store was marked as with an on-going conversion, which results from the Preforker bug for which a fix has already been merged upstream; and
a sync bug that lead to the monmap being lost upon the failed sync attempt.
the C_HeartbeatInterval segmentation fault, which I haven't been able to track down yet;

The first bug was addressed by Sage last week, but didn't get into 0.61.3 (Release Notes mention this bug). The second bug should be fixed by a small rework of the monitor's sync code on wip-mon-sync-fixes. Third bug is is being tracked down.

Actions

Copy link