Bug #8036: levedb: throws std::bad_allow on 14.04 - Ceph - Ceph

Actions

Copy link

Bug #8036

closed

levedb: throws std::bad_allow on 14.04

Added by Yuri Weinstein about 10 years ago. Updated almost 10 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Joao Eduardo Luis

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps/177687/

014-04-08T02:10:47.043 INFO:teuthology.orchestra.run.err:[10.214.138.56]: marked in osd.0.
2014-04-08T02:10:47.275 INFO:teuthology.task.thrashosds.thrasher:Added osd 0
2014-04-08T02:10:52.276 INFO:teuthology.task.thrashosds.thrasher:in_osds:  [4, 1, 2, 0]  out_osds:  [5, 3] dead_osds:  [] live_osds:  [1, 4, 2, 3, 5, 0]
2014-04-08T02:10:52.276 INFO:teuthology.task.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2014-04-08T02:10:52.276 INFO:teuthology.task.thrashosds.thrasher:fixing pg num pool unique_pool_0
2014-04-08T02:10:52.277 DEBUG:teuthology.orchestra.run:Running [10.214.138.56]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json'
2014-04-08T02:10:58.356 INFO:teuthology.orchestra.run.err:[10.214.138.56]: Traceback (most recent call last):
2014-04-08T02:10:58.356 INFO:teuthology.orchestra.run.err:[10.214.138.56]:   File "/usr/bin/ceph", line 830, in <module>
2014-04-08T02:10:58.361 INFO:teuthology.orchestra.run.err:[10.214.138.56]:     sys.exit(main())
2014-04-08T02:10:58.362 INFO:teuthology.orchestra.run.err:[10.214.138.56]:   File "/usr/bin/ceph", line 590, in main
2014-04-08T02:10:58.362 INFO:teuthology.orchestra.run.err:[10.214.138.56]:     conffile=conffile)
2014-04-08T02:10:58.362 INFO:teuthology.orchestra.run.err:[10.214.138.56]:   File "/usr/lib/python2.7/dist-packages/rados.py", line 208, in __init__
2014-04-08T02:10:58.701 INFO:teuthology.orchestra.run.err:[10.214.138.56]:     self.librados = CDLL(librados_path)
2014-04-08T02:10:58.701 INFO:teuthology.orchestra.run.err:[10.214.138.56]:   File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
2014-04-08T02:10:59.537 INFO:teuthology.orchestra.run.err:[10.214.138.56]:     self._handle = _dlopen(self._name, mode)
2014-04-08T02:10:59.537 INFO:teuthology.orchestra.run.err:[10.214.138.56]: OSError: librados.so.2: cannot map zero-fill pages: Cannot allocate memory
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: terminate called after throwing an instance of 'std::bad_alloc'
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:   what():  std::bad_alloc
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: *** Caught signal (Aborted) **
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  in thread 7febfedec700
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  1: ceph-mon() [0x86967f]
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  2: (()+0x10340) [0x7fec066ae340]
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  3: (gsignal()+0x39) [0x7fec04982f79]
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  4: (abort()+0x148) [0x7fec04986388]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec0528e6b5]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  6: (()+0x5e836) [0x7fec0528c836]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  7: (()+0x5e863) [0x7fec0528c863]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  8: (()+0x5eaa2) [0x7fec0528caa2]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  9: (()+0x12c6e) [0x7fec068cec6e]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  10: (tc_new()+0x1e0) [0x7fec068eeb60]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7fec052e83b9]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  12: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7fec052e8f7b]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  13: (std::string::reserve(unsigned long)+0x34) [0x7fec052e9014]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  14: (std::string::append(unsigned long, char)+0x46) [0x7fec052e93d6]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  15: (leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder*, leveldb::BlockHandle*)+0x75) [0x7fec05567295]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  16: (leveldb::TableBuilder::Flush()+0x5c) [0x7fec0556740c]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  17: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0xb7) [0x7fec05567597]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  18: (leveldb::BuildTable(std::string const&, leveldb::Env*, leveldb::Options const&, leveldb::TableCache*, leveldb::Iterator*, leveldb::FileMetaData*)+0x27e) [0x7fec05543bee]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  19: (leveldb::DBImpl::WriteLevel0Table(leveldb::MemTable*, leveldb::VersionEdit*, leveldb::Version*)+0x104) [0x7fec05549704]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  20: (leveldb::DBImpl::CompactMemTable()+0xe3) [0x7fec0554aec3]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  21: (leveldb::DBImpl::BackgroundCompaction()+0x36) [0x7fec0554be16]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  22: (leveldb::DBImpl::BackgroundCall()+0x62) [0x7fec0554c9b2]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  23: (()+0x38b3b) [0x7fec0556ab3b]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  24: (()+0x8182) [0x7fec066a6182]
2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  25: (clone()+0x6d) [0x7fec04a4730d]
2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 2014-04-08 09:12:01.463118 7febfedec700 -1 *** Caught signal (Aborted) **
2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  in thread 7febfedec700
2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 
2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)
2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  1: ceph-mon() [0x86967f]
2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  2: (()+0x10340) [0x7fec066ae340]
2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  3: (gsignal()+0x39) [0x7fec04982f79]
2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  4: (abort()+0x148) [0x7fec04986388]
2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec0528e6b5]
2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  6: (()+0x5e836) [0x7fec0528c836]
2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  7: (()+0x5e863) [0x7fec0528c863]
2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  8: (()+0x5eaa2) [0x7fec0528caa2]
2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  9: (()+0x12c6e) [0x7fec068cec6e]
2014-04-08T02:12:01.480 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  10: (tc_new()+0x1e0) [0x7fec068eeb60]
2014-04-08T02:12:01.480 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7fec052e83b9]
2014-04-08T02:12:01.480 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  12: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7fec052e8f7b]

2014-04-08T02:14:21.939 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-firefly/teuthology/run_tasks.py", line 92, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/thrashosds.py", line 172, in task
    thrash_proc.do_join()
  File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph_manager.py", line 153, in do_join
    self.thread.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
CommandFailedError: Command failed on 10.214.138.56 with status 1: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json'

archive_path: /var/lib/teuthworker/archive/teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps/177687
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rados_api_tests.yaml
  6-next-mon/monb.yaml 7-workload/rados_api_tests.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml}
email: null
job_id: '177687'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: vps
name: teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps
nuke-on-error: true
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: firefly
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    sha1: 010dff12c38882238591bb042f8e497a1f7ba020
  ceph-deploy:
    branch:
      dev: firefly
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 010dff12c38882238591bb042f8e497a1f7ba020
  s3tests:
    branch: master
  workunit:
    sha1: 010dff12c38882238591bb042f8e497a1f7ba020
owner: scheduled_teuthology@teuthology
roles:
- - mon.a
  - mon.b
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - osd.3
  - osd.4
  - osd.5
  - mon.c
- - client.0
targets:
  ubuntu@vpm031.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDq7gmPqczEb6bxQUuUXQFnR2z6vfoN2b7ICm7PljWcJH5vvT3dyid6rrbKq/I8zHWFYa7uBu0VEztFc1VkCwqpQwhrWnDM6xni7mKGLwMHfYX8+6BVCIqjesmQIaISRYFYIAiOeiHJFdmP+5B2hrQPkagvW59pqHESqJACjxHQ6FmOnUxk5oTNQSQJVIbxsYzqodh5jX46ZVrbDHb1v+YjBU2wieyJuA9Pua7g5seOOoeJ2e+ty2nlRjfhpmwZvXh0wMZhBbOaNUVJYouMx3l92a0bGYD/PXdcdC/bBFFHGTKI7BaA4snhR8pkI8hKosbckOFxXcFzFtfHkEYsEssH
  ubuntu@vpm032.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDK/wagN/I7tt/S7YeIvefzygjStwb2VyzJCjuXSpm9gnOWVC7xKMGG4oHM30pV/+C0VWYRePZqbPGO9+Qf5CDffuYVMJCTBOlGtHB7KyDxaoFBY4CKWrg2st/uDxXaoNkE1c8MgVglFOsOtmWS4lAPlbff0OL2a6FcnTRidXDo+5zvqWg1WArPGghNTzwJ73jk9zACFaiisQZx8Hd+ZM6Gz7V8SmcXEkNEHp9fJJsTWy+rh1b0yQTCKWvsJjj1O0ykwPdB/cnHigzuzPPJOxgpNWiRoswo74lC2d5iUd4yB9Vfirpj2/a60/r/CWP2Fy16lG6Xo1C+U3AkEY14cdvB
  ubuntu@vpm033.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCZGaf59QU1K0RVezxHArei9Y+UDyau5D7V8GqBYOMRMJ9E90vYvNw2dJZZI3C1Oj0SNc/BdjAlfpW/aRrYQ2xx8bCyvY3m6u3pocqO2EYfU8/wEaOc5THzsJvz6zxkKdhGl3BSs1w38qIwvxZAxDbAqelexzVdnQ1AAIkOXDU++uueTqPcvNFOzXegfbMoMp7yql2dbYUExkNTWJPhGRCSYa0zGKdiGTPOUqInsWkamaQPZy3SzMgB8Xjxs8E5joxggy+TxDMyP2VYH4gMgJIwfI2sHTv2/H6pJKeDEMyh1vduh5k1S+jtSu+TiTtD9rnlpQvrDf2Tz7Mi+vLe/3mT
tasks:
- internal.lock_machines:
  - 3
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0: null
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    thrash_primary_affinity: false
    timeout: 1200
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- install.upgrade:
    mon.c: null
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rados/test-upgrade-firefly.sh
- workunit:
    branch: dumpling
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- rgw:
    client.0:
      idle_timeout: 120
- swift:
    client.0:
      rgw_server: client.0
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
teuthology_branch: firefly
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.17019

description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rados_api_tests.yaml
  6-next-mon/monb.yaml 7-workload/rados_api_tests.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml
  rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml}
duration: 14274.8196849823
failure_reason: 'Command failed on 10.214.138.56 with status 1: ''adjust-ulimits ceph-coverage
  /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json'''
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil about 10 years ago

Subject changed from "ceph pg dump" (coredump) in upgrade:dumpling-x:stress-split-firefly-distro-basic-vps to levedb: throws std::bad_allow on 14.04
Status changed from New to 12
Priority changed from Normal to High
Source changed from other to Q/A

2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: terminate called after throwing an instance of 'std::bad_alloc'
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:   what():  std::bad_alloc
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: *** Caught signal (Aborted) **
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  in thread 7febfedec700
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020)
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  1: ceph-mon() [0x86967f]
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  2: (()+0x10340) [0x7fec066ae340]
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  3: (gsignal()+0x39) [0x7fec04982f79]
2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  4: (abort()+0x148) [0x7fec04986388]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec0528e6b5]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  6: (()+0x5e836) [0x7fec0528c836]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  7: (()+0x5e863) [0x7fec0528c863]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  8: (()+0x5eaa2) [0x7fec0528caa2]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  9: (()+0x12c6e) [0x7fec068cec6e]
2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  10: (tc_new()+0x1e0) [0x7fec068eeb60]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7fec052e83b9]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  12: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7fec052e8f7b]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  13: (std::string::reserve(unsigned long)+0x34) [0x7fec052e9014]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  14: (std::string::append(unsigned long, char)+0x46) [0x7fec052e93d6]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  15: (leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder*, leveldb::BlockHandle*)+0x75) [0x7fec05567295]
2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  16: (leveldb::TableBuilder::Flush()+0x5c) [0x7fec0556740c]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  17: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0xb7) [0x7fec05567597]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  18: (leveldb::BuildTable(std::string const&, leveldb::Env*, leveldb::Options const&, leveldb::TableCache*, leveldb::Iterator*, leveldb::FileMetaData*)+0x27e) [0x7fec05543bee]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  19: (leveldb::DBImpl::WriteLevel0Table(leveldb::MemTable*, leveldb::VersionEdit*, leveldb::Version*)+0x104) [0x7fec05549704]
2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  20: (leveldb::DBImpl::CompactMemTable()+0xe3) [0x7fec0554aec3]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  21: (leveldb::DBImpl::BackgroundCompaction()+0x36) [0x7fec0554be16]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  22: (leveldb::DBImpl::BackgroundCall()+0x62) [0x7fec0554c9b2]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  23: (()+0x38b3b) [0x7fec0556ab3b]
2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  24: (()+0x8182) [0x7fec066a6182]
2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]:  25: (clone()+0x6d) [0x7fec04a4730d]

Actions

Copy link

Updated by Joao Eduardo Luis about 10 years ago

Assignee set to Joao Eduardo Luis

Actions

Copy link

Updated by Joao Eduardo Luis about 10 years ago

core is corrupted:

BFD: Warning: /home/ubuntu/joao/issues/8036/177687/remote/ubuntu@vpm033.front.sepia.ceph.com/coredump/1396948322.7626.core is truncated: expected core file size >= 758149120, found: 99594240

coredump should have been 758MB is size, only 99MB made it. The size of the core leads me to believe the mon crashed due to an ENOMEM (having the std::bad_alloc error also helps). It even looks like teuthology itself was unable to allocate memory:

OSError: librados.so.2: cannot map zero-fill pages: Cannot allocate memory

Now it would be wonderful to figure out how we ran out of memory though.

Actions

Copy link

Updated by Josh Durgin about 10 years ago

This was run on vms, so they have much less memory than the usual physical machines.

Actions

Copy link

Updated by Yuri Weinstein about 10 years ago

It's a good practice to run tests on scaled down machines, the question then is - do we fix bugs related like this one to memory size.

Actions

Copy link

Updated by Joao Eduardo Luis about 10 years ago

It would be interesting to know why the monitor's virtual mem usage got to 700MB, although a portion of it should go to libs and friends. On the other hand, it's not uncommon for monitor's mem to grow, specially during compaction, as it appears to be the case.

According to the config file on that run, that monitor should have been running on a server along with another monitor, an mds and 3 OSDs. I don't know how much memory a VM typically has, but if they're running with "much less" memory than physical machines then it's understandable how memory would just run out during memory allocation peaks.

Also, I would think that while testing on scaled down machines is a good practice, someone trying this sort of software on a lower-end machine would also attempt to scale deployment with reason. For instance, running two monitors on a pi with 512 RAM is a big no-no.

I am tempted to consider this as an unfortunate side effect of having not enough memory for a greedy deployment. I will however take a look at the monitor's stores in the morning, in hope to confirm that this was a peak due to compaction and that other in-memory maps might have taken their toll as well. Other than that, we may be leaking memory. I recall we having a way to run valgrind during teuth runs; do we still run them on this sort of scaled down hardware deployments?

Actions

Copy link

Updated by Joao Eduardo Luis about 10 years ago

Have been spending a fair amount of time trying to figure out what may have gone wrong with this (and #8067, which appears to be the same thing), and so far I came up short.

mon stores from this run are fairly boring: few dozen MB in size, nothing out of the ordinary with regard to amount of maps; nothing in the logs pops out either.

Have rerun this test many times now. Initially attempted to add a valgrind override for the monitors on the yaml file to track potential mem leaks, but the runs would just hang waiting for OSDs to be started; don't really have an explanation for it besides maybe bad timing, or something in the upgrade suite going wrong when valgrind is in place?

Finally dropped valgrind from the yaml file and start monitoring mem usage the good old fashioned way: htop and later a script outputting current VM usage for all monitors. Monitors tend to peak at 350-400 MB, but that's far from the 800MB reported. A couple of times the test did cause the mons to commit suicide however, but that was due to the disk they were sitting on dropping to 5% avail space (while the mon stores were themselves just the same few dozen MB in size as they were expected). Intend to perform some more tests later today with adjusted 'mon data avail crit' values to allow for the monitors to run longer without committing suicide, and maybe then we'll get something out of it after a few runs.

Actions

Copy link