Project

General

Profile

Actions

Bug #5256

closed

Upgraded bobtail->cuttlefish mon crashes, then can't resume the conversion

Added by Faidon Liambotis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


    -6> 2013-06-05 18:25:33.071576 7fcd873a7700  1 mon.ms-fe1003@1(synchronizing sync( requester state chunks )) e17 sync_requester_abort no longer a sync requester
    -5> 2013-06-05 18:25:33.071600 7fcd873a7700  1 -- 10.64.16.150:6789/0 --> mon.0 10.64.0.167:6789/0 -- mon_probe(probe c9da36e1-694a-4166-b346-9d8d4d1d1ac1 name ms-fe1003) v4 -- ?+0 0x4788600
    -4> 2013-06-05 18:25:33.071634 7fcd873a7700  1 -- 10.64.16.150:6789/0 --> mon.2 10.64.32.10:6789/0 -- mon_probe(probe c9da36e1-694a-4166-b346-9d8d4d1d1ac1 name ms-fe1003) v4 -- ?+0 0x4788000
    -3> 2013-06-05 18:25:33.071714 7fcd873a7700  1 -- 10.64.16.150:6789/0 <== mon.2 10.64.32.10:6789/0 359 ==== mon_sync( chunk bl 987152 bytes last_key ( osdmap,full_182463 ) ) v1 ==== 987343+0+0 (2285625880 0 0) 0x4f02840 con 0x1b3e6e0
    -2> 2013-06-05 18:25:33.071754 7fcd873a7700  1 mon.ms-fe1003@1(probing) e17 handle_sync_chunk stray message -- drop it.
    -1> 2013-06-05 18:25:33.071758 7fcd873a7700  1 -- 10.64.16.150:6789/0 <== osd.132 10.64.0.178:6805/27958 1 ==== auth(proto 0 28 bytes epoch 17) v1 ==== 58+0+0 (2383461374 0 0) 0x3732900 con 0x231f580
     0> 2013-06-05 18:25:33.072854 7fcd87ba8700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fcd87ba8700

 ceph version 0.61.2-58-g7d549cb (7d549cb82ab8ebcf1cc104fc557d601b486c7635)
 1: /usr/bin/ceph-mon() [0x599a3a]
 2: (()+0xfcb0) [0x7fcd8c112cb0]
 3: (Monitor::C_HeartbeatInterval::finish(int)+0x53) [0x4d06a3]
 4: (Context::complete(int)+0xa) [0x4cb61a]
 5: (SafeTimer::timer_thread()+0x425) [0x649695]
 6: (SafeTimerThread::entry()+0xd) [0x64a2cd]
 7: (()+0x7e9a) [0x7fcd8c10ae9a]
 8: (clone()+0x6d) [0x7fcd8a6baccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.ms-fe1003.log
--- end dump of recent events ---
2013-06-05 18:25:33.245545 7ff10f703780  0 ceph version 0.61.2-58-g7d549cb (7d549cb82ab8ebcf1cc104fc557d601b486c7635), process ceph-mon, pid 11154
2013-06-05 18:25:33.281636 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:33.459061 7ff10f703780 -1 there is an on-going (maybe aborted?) conversion.
2013-06-05 18:25:33.459073 7ff10f703780 -1 you should check what happened
2013-06-05 18:25:33.459771 7ff10f703780 -1 found errors while attempting to convert the monitor store: (17) File exists
2013-06-05 18:25:33.549295 7ff10f703780 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={4=}
2013-06-05 18:25:33.761693 7ff10f703780 -1 obtain_monmap unable to find a monmap
2013-06-05 18:25:33.761711 7ff10f703780 -1 unable to obtain a monmap: (2) No such file or directory
2013-06-05 18:25:33.761725 7ff10f703780  0 mon.ms-fe1003 does not exist in monmap, will attempt to join an existing cluster
2013-06-05 18:25:33.761943 7ff10f703780 -1 no public_addr or public_network specified, and mon.ms-fe1003 not present in monmap or ceph.conf
2013-06-05 18:25:33.762389 7ff10f703780  1 mon.ms-fe1003@-1(probing) e0 preinit fsid 00000000-0000-0000-0000-000000000000
2013-06-05 18:25:33.762455 7ff10f703780 -1 mon.ms-fe1003@-1(probing) e0 error: cluster_uuid file exists with value 'c9da36e1-694a-4166-b346-9d8d4d1d1ac1', != our uuid 00000000-0000-0000-0000-000000000000
2013-06-05 18:25:34.297132 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:35.313140 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:36.327476 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:37.339902 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:38.354432 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:39.369904 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined
2013-06-05 18:25:40.384275 7ff10c57b700 -1 asok(0x2463000) AdminSocket: request 'mon_status' not defined

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5292: mon: monitor crashing due to not being in the monmap (no monmap to be in)ResolvedJoao Eduardo Luis06/10/2013

Actions
Actions #1

Updated by Joao Eduardo Luis almost 11 years ago

  • Category set to Monitor
  • Status changed from New to In Progress
  • Assignee set to Joao Eduardo Luis
Actions #2

Updated by Faidon Liambotis almost 11 years ago

Cephdrop now has 5256-ceph-mon.ms-be1005.log.gz, 5256-ceph-mon.ms-fe1001.log.gz, 5256-ceph-mon.ms-fe1003.log.gz.

The timeline of events as I recall them (could be wrong) is:
1) upgrade ms-fe1003. quorum remains the other two, still at 0.56.6
2) upgrade ms-fe1001
- wait until cuttlefish quorum is established.
3) upgrade ms-be1005. this also has OSDs
- ms-fe1003 crashes and is unable to recover
- the effects of #5257 (out of order osdmap epoch) are observed
(the last two could have happened in the opposite order)

Actions #3

Updated by Joao Eduardo Luis almost 11 years ago

Okay, here's what is the likely order of events in this case:

  • the monitor was converting when it was killed for some reason.
  • the monitor was restarted, must have found an on-going conversion (as those flags are still there on the later crash) but due to the preforker bug it didn't exit as it should; instead, it kept on going.
  • given the monitor must have converted the monmap portion of the store, it was able to start; however, it's likely it was killed while converting the paxos portion (as it usually takes the longest), and that must have triggered a sync upon starting.
  • said sync was aborted for some reason not present in the log, triggering a wide-store clear; due to a bug in there, the monitor ended up removing not only the store contents (by design) but the backup monmap we stashed before starting the sync (this one is a bug).
  • monitor crashed due to the C_HeartbeatInterval segfault (yet another bug)
  • monitor was unable to restart due to lacking a monmap.

So these would be the 3 bugs present in this ticket (following events order):

  1. the fact that the monitor started even though its store was marked as with an on-going conversion, which results from the Preforker bug for which a fix has already been merged upstream; and
  2. a sync bug that lead to the monmap being lost upon the failed sync attempt.
  3. the C_HeartbeatInterval segmentation fault, which I haven't been able to track down yet;

The first bug was addressed by Sage last week, but didn't get into 0.61.3 (Release Notes mention this bug). The second bug should be fixed by a small rework of the monitor's sync code on wip-mon-sync-fixes. Third bug is is being tracked down.

Actions #4

Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Urgent
Actions #5

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF