Project

General

Profile

Actions

Bug #6834

closed

nightlies: monitor crashed in emperor

Added by Tamilarasi muthamizhan over 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

logs: ubuntu@teuthology:/a/teuthology-2013-11-18_19:31:27-upgrade-parallel-next-testing-basic-plana/107772

2013-11-19T15:18:39.840 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: test/librados/aio.cc:153: Failure
2013-11-19T15:18:39.840 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: Value of: test_data.init()
2013-11-19T15:18:39.841 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]:   Actual: "create_one_pool(test-rados-api-plana31-666-1) failed: error rados_connect failed with error -110" 
2013-11-19T15:18:39.841 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: Expected: "" 
2013-11-19T15:18:39.841 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: [  FAILED  ] LibRadosAio.SimpleWrite (304099 ms)
2013-11-19T15:18:39.841 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: [ RUN      ] LibRadosAio.SimpleWritePP
2013-11-19T15:23:40.910 INFO:teuthology.task.ceph.osd.1.err:[10.214.131.16]: daemon-helper: command crashed with signal 6
2013-11-19T15:23:40.910 INFO:teuthology.task.ceph.osd.2.err:[10.214.131.16]: daemon-helper: command crashed with signal 6
2013-11-19T15:23:42.215 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: test/librados/aio.cc:181: Failure
2013-11-19T15:23:42.215 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: Value of: test_data.init()
2013-11-19T15:23:42.215 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]:   Actual: "create_one_pool(test-rados-api-plana31-666-2) failed: error cluster.connect failed with error -110" 
2013-11-19T15:23:42.215 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: Expected: "" 
2013-11-19T15:23:42.216 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: [  FAILED  ] LibRadosAio.SimpleWritePP (302325 ms)
2013-11-19T15:23:42.216 INFO:teuthology.task.workunit.client.0.out:[10.214.131.9]: [ RUN      ] LibRadosAio.WaitForSafe
2013-11-19T15:28:19.001 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]: terminate called after throwing an instance of 'ceph::buffer::bad_alloc'
2013-11-19T15:28:19.001 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:   what():  buffer::bad_alloc
2013-11-19T15:28:19.001 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]: *** Caught signal (Aborted) **
2013-11-19T15:28:19.002 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  in thread 7fd3278e8700
2013-11-19T15:28:19.505 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  ceph version 0.72-205-g703f9a0 (703f9a09e2449712a99f0865db982cb0c66d820d)
2013-11-19T15:28:19.505 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  1: ceph-mon() [0x81b77a]
2013-11-19T15:28:19.506 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  2: (()+0xfcb0) [0x7fd32be44cb0]
2013-11-19T15:28:19.506 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  3: (gsignal()+0x35) [0x7fd32a543425]
2013-11-19T15:28:19.506 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  4: (abort()+0x17b) [0x7fd32a546b8b]
2013-11-19T15:28:19.506 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd32ae9669d]
2013-11-19T15:28:19.506 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  6: (()+0xb5846) [0x7fd32ae94846]
2013-11-19T15:28:19.506 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  7: (()+0xb5873) [0x7fd32ae94873]
2013-11-19T15:28:19.506 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  8: (()+0xb596e) [0x7fd32ae9496e]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  9: (ceph::buffer::create_page_aligned(unsigned int)+0xda) [0x735eca]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  10: (ceph::buffer::list::append(char const*, unsigned int)+0x49) [0x736009]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  11: (pg_stat_t::encode(ceph::buffer::list&) const+0x33) [0x7bb523]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  12: (PGMonitor::encode_pending(MonitorDBStore::Transaction*)+0x489) [0x5f3ba9]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  13: (PaxosService::propose_pending()+0x35d) [0x59aacd]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  14: (PGMonitor::check_osd_map(unsigned int)+0x1560) [0x601c20]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  15: (PGMonitor::on_active()+0xb6) [0x602076]
2013-11-19T15:28:19.507 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  16: (PaxosService::_active()+0x432) [0x59ea82]
2013-11-19T15:28:19.508 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  17: (Context::complete(int)+0x9) [0x570b99]
2013-11-19T15:28:19.509 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  18: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x95) [0x5734b5]
2013-11-19T15:28:19.509 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  19: (Paxos::handle_accept(MMonPaxos*)+0x86d) [0x59466d]
2013-11-19T15:28:19.509 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  20: (Paxos::dispatch(PaxosServiceMessage*)+0x27b) [0x5976eb]
2013-11-19T15:28:19.509 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  21: (Monitor::dispatch(MonSession*, Message*, bool)+0x558) [0x56ee48]
2013-11-19T15:28:19.509 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  22: (Monitor::_ms_dispatch(Message*)+0x204) [0x56f254]
2013-11-19T15:28:19.510 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  23: (Monitor::ms_dispatch(Message*)+0x32) [0x589262]
2013-11-19T15:28:19.510 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  24: (DispatchQueue::entry()+0x549) [0x7f16a9]
2013-11-19T15:28:19.510 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  25: (DispatchQueue::DispatchThread::entry()+0xd) [0x71e51d]
2013-11-19T15:28:19.510 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  26: (()+0x7e9a) [0x7fd32be3ce9a]
2013-11-19T15:28:19.510 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]:  27: (clone()+0x6d) [0x7fd32a6013fd]
2013-11-19T15:28:20.343 INFO:teuthology.task.ceph.mon.a.err:[10.214.131.16]: 2013-11-19 15:28:19.501135 7fd3278e8700 -1 *** Caught signal (Aborted) **

ubuntu@teuthology:/a/teuthology-2013-11-18_19:31:27-upgrade-parallel-next-testing-basic-plana/107772$ cat config.yaml 
archive_path: /var/lib/teuthworker/archive/teuthology-2013-11-18_19:31:27-upgrade-parallel-next-testing-basic-plana/107772
description: upgrade-parallel/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/more.yaml 5-workload/readwrite.yaml
  6-next-mon/monb.yaml 7-workload/rados_api_tests.yaml distro/ubuntu_12.04.yaml}
email: null
job_id: '107772'
kernel: &id001
  kdb: true
  sha1: 68174f0c97e7c0561aa844059569e3cbf0a43de1
last_in_suite: false
machine_type: plana
name: teuthology-2013-11-18_19:31:27-upgrade-parallel-next-testing-basic-plana
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug ms: 1
        debug osd: 5
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 703f9a09e2449712a99f0865db982cb0c66d820d
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
  install:
    ceph:
      sha1: 703f9a09e2449712a99f0865db982cb0c66d820d
  s3tests:
    branch: next
  workunit:
    sha1: 703f9a09e2449712a99f0865db982cb0c66d820d
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - osd.3
  - osd.4
  - osd.5
  - client.0
  - mon.c
targets:
  ubuntu@plana24.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDo6/Qy7WndiDde6qjAsEfYeTL0XjBdqukKtDiEReu57As/y3sremLi9BX3xwMh4Z6y8KSzmENWNTqaq2ysyhv+H+LwsRMCt0HeY6lRP7zdO9nEUNY+/WCPmJiRlph7TjVCUxcHy6l6bfx/LBWthTyQSNbXYWsKHNuUsvNWeTDDEhMQUytDlkPzO3gCwpnHHYFoyI/eOltAcsTlgCbTcjjGfNhoC0VpYI9/zjCubohXyf6qOfgDuSTis1QhASa/PODMJlBTiAYD/Y4Ad9bkmS++433sycnFHi6h0ZCVRAbw350uxzzbdZy3+r5v5jItpoCUhz7Cgr4SEuNTVQM0rbW9
  ubuntu@plana31.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDq2i6+YCh22Z7z0r+7zM12TylDnJ+WIKp/XLn+TY3SpahJSrvAPHOe8QVapSBSQFebWJhF+s20ONozhVYUhGv8LjbgsTcI1pRvlk5J9w8u2CjVDtbnzyhbCqbxUZLnhjleb26LaukM96DJCsEBuWL6+hZeB153Ky7kgehU9Kp/W8MygBFE73F271YhAt/seCEI6KPRUSPleLkIlr8jEfP5mpuv4gFpo+lA7xKuLNsn4tjiNFOMHG99kJZMYcGmvmVEhxeu0I4iKL/NLbwuf780o0idfj2Ss6ZvpwOot2UHTh64x6LOSOaB/5Op2/A2sv4hnhhAMPcXt1qw+IUqmbo3
tasks:
- internal.lock_machines:
  - 2
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph: null
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test.sh
teuthology_branch: next
verbose: true

Actions #1

Updated by Greg Farnum over 10 years ago

  • Assignee changed from Greg Farnum to Tamilarasi muthamizhan
  • Priority changed from Urgent to Normal

From this backtrace it looks like either there was a hardware problem, or the monitor was using so much memory it couldn't get any more. Figuring that out will require backtraces, logs, etc, and they don't appear to be available here.

Actions #2

Updated by Joao Eduardo Luis over 10 years ago

A user came forward with high monitor memory usage on ticket #6810. I wonder if this is the same thing and somehow the OS was unable to allocate the necessary memory.

Tamil, if you happen to be able to reproduce this, reliably even, please let me know.

Actions #3

Updated by Tamilarasi muthamizhan over 10 years ago

I was not able to reproduce this on my local cluster though.

looks like it did happen last night as well,

ubuntu@teuthology:/a/teuthology-2013-11-21_19:40:02-upgrade-parallel-master-testing-basic-plana/112928

Actions #4

Updated by Tamilarasi muthamizhan over 10 years ago

  • Assignee changed from Tamilarasi muthamizhan to Joao Eduardo Luis
Actions #5

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to Can't reproduce

It either showed up again and has been associated with another issue or it has been fixed.

Actions

Also available in: Atom PDF