Bug #8036
closedlevedb: throws std::bad_allow on 14.04
0%
Description
014-04-08T02:10:47.043 INFO:teuthology.orchestra.run.err:[10.214.138.56]: marked in osd.0. 2014-04-08T02:10:47.275 INFO:teuthology.task.thrashosds.thrasher:Added osd 0 2014-04-08T02:10:52.276 INFO:teuthology.task.thrashosds.thrasher:in_osds: [4, 1, 2, 0] out_osds: [5, 3] dead_osds: [] live_osds: [1, 4, 2, 3, 5, 0] 2014-04-08T02:10:52.276 INFO:teuthology.task.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0 2014-04-08T02:10:52.276 INFO:teuthology.task.thrashosds.thrasher:fixing pg num pool unique_pool_0 2014-04-08T02:10:52.277 DEBUG:teuthology.orchestra.run:Running [10.214.138.56]: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json' 2014-04-08T02:10:58.356 INFO:teuthology.orchestra.run.err:[10.214.138.56]: Traceback (most recent call last): 2014-04-08T02:10:58.356 INFO:teuthology.orchestra.run.err:[10.214.138.56]: File "/usr/bin/ceph", line 830, in <module> 2014-04-08T02:10:58.361 INFO:teuthology.orchestra.run.err:[10.214.138.56]: sys.exit(main()) 2014-04-08T02:10:58.362 INFO:teuthology.orchestra.run.err:[10.214.138.56]: File "/usr/bin/ceph", line 590, in main 2014-04-08T02:10:58.362 INFO:teuthology.orchestra.run.err:[10.214.138.56]: conffile=conffile) 2014-04-08T02:10:58.362 INFO:teuthology.orchestra.run.err:[10.214.138.56]: File "/usr/lib/python2.7/dist-packages/rados.py", line 208, in __init__ 2014-04-08T02:10:58.701 INFO:teuthology.orchestra.run.err:[10.214.138.56]: self.librados = CDLL(librados_path) 2014-04-08T02:10:58.701 INFO:teuthology.orchestra.run.err:[10.214.138.56]: File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__ 2014-04-08T02:10:59.537 INFO:teuthology.orchestra.run.err:[10.214.138.56]: self._handle = _dlopen(self._name, mode) 2014-04-08T02:10:59.537 INFO:teuthology.orchestra.run.err:[10.214.138.56]: OSError: librados.so.2: cannot map zero-fill pages: Cannot allocate memory 2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: terminate called after throwing an instance of 'std::bad_alloc' 2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: what(): std::bad_alloc 2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: *** Caught signal (Aborted) ** 2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: in thread 7febfedec700 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020) 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 1: ceph-mon() [0x86967f] 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 2: (()+0x10340) [0x7fec066ae340] 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 3: (gsignal()+0x39) [0x7fec04982f79] 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 4: (abort()+0x148) [0x7fec04986388] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec0528e6b5] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 6: (()+0x5e836) [0x7fec0528c836] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 7: (()+0x5e863) [0x7fec0528c863] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 8: (()+0x5eaa2) [0x7fec0528caa2] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 9: (()+0x12c6e) [0x7fec068cec6e] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 10: (tc_new()+0x1e0) [0x7fec068eeb60] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7fec052e83b9] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 12: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7fec052e8f7b] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 13: (std::string::reserve(unsigned long)+0x34) [0x7fec052e9014] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 14: (std::string::append(unsigned long, char)+0x46) [0x7fec052e93d6] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 15: (leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder*, leveldb::BlockHandle*)+0x75) [0x7fec05567295] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 16: (leveldb::TableBuilder::Flush()+0x5c) [0x7fec0556740c] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 17: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0xb7) [0x7fec05567597] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 18: (leveldb::BuildTable(std::string const&, leveldb::Env*, leveldb::Options const&, leveldb::TableCache*, leveldb::Iterator*, leveldb::FileMetaData*)+0x27e) [0x7fec05543bee] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 19: (leveldb::DBImpl::WriteLevel0Table(leveldb::MemTable*, leveldb::VersionEdit*, leveldb::Version*)+0x104) [0x7fec05549704] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 20: (leveldb::DBImpl::CompactMemTable()+0xe3) [0x7fec0554aec3] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 21: (leveldb::DBImpl::BackgroundCompaction()+0x36) [0x7fec0554be16] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 22: (leveldb::DBImpl::BackgroundCall()+0x62) [0x7fec0554c9b2] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 23: (()+0x38b3b) [0x7fec0556ab3b] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 24: (()+0x8182) [0x7fec066a6182] 2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 25: (clone()+0x6d) [0x7fec04a4730d] 2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 2014-04-08 09:12:01.463118 7febfedec700 -1 *** Caught signal (Aborted) ** 2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: in thread 7febfedec700 2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020) 2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 1: ceph-mon() [0x86967f] 2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 2: (()+0x10340) [0x7fec066ae340] 2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 3: (gsignal()+0x39) [0x7fec04982f79] 2014-04-08T02:12:01.478 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 4: (abort()+0x148) [0x7fec04986388] 2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec0528e6b5] 2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 6: (()+0x5e836) [0x7fec0528c836] 2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 7: (()+0x5e863) [0x7fec0528c863] 2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 8: (()+0x5eaa2) [0x7fec0528caa2] 2014-04-08T02:12:01.479 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 9: (()+0x12c6e) [0x7fec068cec6e] 2014-04-08T02:12:01.480 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 10: (tc_new()+0x1e0) [0x7fec068eeb60] 2014-04-08T02:12:01.480 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7fec052e83b9] 2014-04-08T02:12:01.480 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 12: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7fec052e8f7b]
2014-04-08T02:14:21.939 ERROR:teuthology.run_tasks:Manager failed: thrashosds Traceback (most recent call last): File "/home/teuthworker/teuthology-firefly/teuthology/run_tasks.py", line 92, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/teuthworker/teuthology-firefly/teuthology/task/thrashosds.py", line 172, in task thrash_proc.do_join() File "/home/teuthworker/teuthology-firefly/teuthology/task/ceph_manager.py", line 153, in do_join self.thread.get() File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get raise self._exception CommandFailedError: Command failed on 10.214.138.56 with status 1: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json'
archive_path: /var/lib/teuthworker/archive/teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps/177687 description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rados_api_tests.yaml 6-next-mon/monb.yaml 7-workload/rados_api_tests.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml} email: null job_id: '177687' kernel: &id001 kdb: true sha1: distro last_in_suite: false machine_type: vps name: teuthology-2014-04-07_22:35:16-upgrade:dumpling-x:stress-split-firefly-distro-basic-vps nuke-on-error: true os_type: ubuntu os_version: '14.04' overrides: admin_socket: branch: firefly ceph: conf: mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - wrongly marked me down - objects unfound and apparently lost - log bound mismatch sha1: 010dff12c38882238591bb042f8e497a1f7ba020 ceph-deploy: branch: dev: firefly conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: 010dff12c38882238591bb042f8e497a1f7ba020 s3tests: branch: master workunit: sha1: 010dff12c38882238591bb042f8e497a1f7ba020 owner: scheduled_teuthology@teuthology roles: - - mon.a - mon.b - mds.a - osd.0 - osd.1 - osd.2 - - osd.3 - osd.4 - osd.5 - mon.c - - client.0 targets: ubuntu@vpm031.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDq7gmPqczEb6bxQUuUXQFnR2z6vfoN2b7ICm7PljWcJH5vvT3dyid6rrbKq/I8zHWFYa7uBu0VEztFc1VkCwqpQwhrWnDM6xni7mKGLwMHfYX8+6BVCIqjesmQIaISRYFYIAiOeiHJFdmP+5B2hrQPkagvW59pqHESqJACjxHQ6FmOnUxk5oTNQSQJVIbxsYzqodh5jX46ZVrbDHb1v+YjBU2wieyJuA9Pua7g5seOOoeJ2e+ty2nlRjfhpmwZvXh0wMZhBbOaNUVJYouMx3l92a0bGYD/PXdcdC/bBFFHGTKI7BaA4snhR8pkI8hKosbckOFxXcFzFtfHkEYsEssH ubuntu@vpm032.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDK/wagN/I7tt/S7YeIvefzygjStwb2VyzJCjuXSpm9gnOWVC7xKMGG4oHM30pV/+C0VWYRePZqbPGO9+Qf5CDffuYVMJCTBOlGtHB7KyDxaoFBY4CKWrg2st/uDxXaoNkE1c8MgVglFOsOtmWS4lAPlbff0OL2a6FcnTRidXDo+5zvqWg1WArPGghNTzwJ73jk9zACFaiisQZx8Hd+ZM6Gz7V8SmcXEkNEHp9fJJsTWy+rh1b0yQTCKWvsJjj1O0ykwPdB/cnHigzuzPPJOxgpNWiRoswo74lC2d5iUd4yB9Vfirpj2/a60/r/CWP2Fy16lG6Xo1C+U3AkEY14cdvB ubuntu@vpm033.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCZGaf59QU1K0RVezxHArei9Y+UDyau5D7V8GqBYOMRMJ9E90vYvNw2dJZZI3C1Oj0SNc/BdjAlfpW/aRrYQ2xx8bCyvY3m6u3pocqO2EYfU8/wEaOc5THzsJvz6zxkKdhGl3BSs1w38qIwvxZAxDbAqelexzVdnQ1AAIkOXDU++uueTqPcvNFOzXegfbMoMp7yql2dbYUExkNTWJPhGRCSYa0zGKdiGTPOUqInsWkamaQPZy3SzMgB8Xjxs8E5joxggy+TxDMyP2VYH4gMgJIwfI2sHTv2/H6pJKeDEMyh1vduh5k1S+jtSu+TiTtD9rnlpQvrDf2Tz7Mi+vLe/3mT tasks: - internal.lock_machines: - 3 - vps - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.sudo: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: branch: dumpling - ceph: fs: xfs - install.upgrade: osd.0: null - ceph.restart: daemons: - osd.0 - osd.1 - osd.2 - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 thrash_primary_affinity: false timeout: 1200 - ceph.restart: daemons: - mon.a wait-for-healthy: false wait-for-osds-up: true - workunit: branch: dumpling clients: client.0: - rados/test-upgrade-firefly.sh - ceph.restart: daemons: - mon.b wait-for-healthy: false wait-for-osds-up: true - workunit: branch: dumpling clients: client.0: - rados/test-upgrade-firefly.sh - install.upgrade: mon.c: null - ceph.restart: daemons: - mon.c wait-for-healthy: false wait-for-osds-up: true - ceph.wait_for_mon_quorum: - a - b - c - workunit: branch: dumpling clients: client.0: - rados/test-upgrade-firefly.sh - workunit: branch: dumpling clients: client.0: - rbd/test_librbd_python.sh - rgw: client.0: idle_timeout: 120 - swift: client.0: rgw_server: client.0 - rados: clients: - client.0 objects: 500 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000 teuthology_branch: firefly verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.17019
description: upgrade/dumpling-x/stress-split/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/rados_api_tests.yaml 6-next-mon/monb.yaml 7-workload/rados_api_tests.yaml 8-next-mon/monc.yaml 9-workload/{rados_api_tests.yaml rbd-python.yaml rgw-s3tests.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml} duration: 14274.8196849823 failure_reason: 'Command failed on 10.214.138.56 with status 1: ''adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json''' flavor: basic owner: scheduled_teuthology@teuthology success: false
Updated by Sage Weil about 10 years ago
- Subject changed from "ceph pg dump" (coredump) in upgrade:dumpling-x:stress-split-firefly-distro-basic-vps to levedb: throws std::bad_allow on 14.04
- Status changed from New to 12
- Priority changed from Normal to High
- Source changed from other to Q/A
2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: terminate called after throwing an instance of 'std::bad_alloc' 2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: what(): std::bad_alloc 2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: *** Caught signal (Aborted) ** 2014-04-08T02:12:01.166 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: in thread 7febfedec700 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: ceph version 0.79-42-g010dff1 (010dff12c38882238591bb042f8e497a1f7ba020) 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 1: ceph-mon() [0x86967f] 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 2: (()+0x10340) [0x7fec066ae340] 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 3: (gsignal()+0x39) [0x7fec04982f79] 2014-04-08T02:12:01.472 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 4: (abort()+0x148) [0x7fec04986388] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec0528e6b5] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 6: (()+0x5e836) [0x7fec0528c836] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 7: (()+0x5e863) [0x7fec0528c863] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 8: (()+0x5eaa2) [0x7fec0528caa2] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 9: (()+0x12c6e) [0x7fec068cec6e] 2014-04-08T02:12:01.473 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 10: (tc_new()+0x1e0) [0x7fec068eeb60] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7fec052e83b9] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 12: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7fec052e8f7b] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 13: (std::string::reserve(unsigned long)+0x34) [0x7fec052e9014] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 14: (std::string::append(unsigned long, char)+0x46) [0x7fec052e93d6] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 15: (leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder*, leveldb::BlockHandle*)+0x75) [0x7fec05567295] 2014-04-08T02:12:01.474 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 16: (leveldb::TableBuilder::Flush()+0x5c) [0x7fec0556740c] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 17: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0xb7) [0x7fec05567597] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 18: (leveldb::BuildTable(std::string const&, leveldb::Env*, leveldb::Options const&, leveldb::TableCache*, leveldb::Iterator*, leveldb::FileMetaData*)+0x27e) [0x7fec05543bee] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 19: (leveldb::DBImpl::WriteLevel0Table(leveldb::MemTable*, leveldb::VersionEdit*, leveldb::Version*)+0x104) [0x7fec05549704] 2014-04-08T02:12:01.475 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 20: (leveldb::DBImpl::CompactMemTable()+0xe3) [0x7fec0554aec3] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 21: (leveldb::DBImpl::BackgroundCompaction()+0x36) [0x7fec0554be16] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 22: (leveldb::DBImpl::BackgroundCall()+0x62) [0x7fec0554c9b2] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 23: (()+0x38b3b) [0x7fec0556ab3b] 2014-04-08T02:12:01.476 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 24: (()+0x8182) [0x7fec066a6182] 2014-04-08T02:12:01.477 INFO:teuthology.task.ceph.mon.a.err:[10.214.138.56]: 25: (clone()+0x6d) [0x7fec04a4730d]
Updated by Joao Eduardo Luis about 10 years ago
- Assignee set to Joao Eduardo Luis
Updated by Joao Eduardo Luis about 10 years ago
core is corrupted:
BFD: Warning: /home/ubuntu/joao/issues/8036/177687/remote/ubuntu@vpm033.front.sepia.ceph.com/coredump/1396948322.7626.core is truncated: expected core file size >= 758149120, found: 99594240
coredump should have been 758MB is size, only 99MB made it. The size of the core leads me to believe the mon crashed due to an ENOMEM (having the std::bad_alloc error also helps). It even looks like teuthology itself was unable to allocate memory:
OSError: librados.so.2: cannot map zero-fill pages: Cannot allocate memory
Now it would be wonderful to figure out how we ran out of memory though.
Updated by Josh Durgin about 10 years ago
This was run on vms, so they have much less memory than the usual physical machines.
Updated by Yuri Weinstein about 10 years ago
It's a good practice to run tests on scaled down machines, the question then is - do we fix bugs related like this one to memory size.
Updated by Joao Eduardo Luis about 10 years ago
It would be interesting to know why the monitor's virtual mem usage got to 700MB, although a portion of it should go to libs and friends. On the other hand, it's not uncommon for monitor's mem to grow, specially during compaction, as it appears to be the case.
According to the config file on that run, that monitor should have been running on a server along with another monitor, an mds and 3 OSDs. I don't know how much memory a VM typically has, but if they're running with "much less" memory than physical machines then it's understandable how memory would just run out during memory allocation peaks.
Also, I would think that while testing on scaled down machines is a good practice, someone trying this sort of software on a lower-end machine would also attempt to scale deployment with reason. For instance, running two monitors on a pi with 512 RAM is a big no-no.
I am tempted to consider this as an unfortunate side effect of having not enough memory for a greedy deployment. I will however take a look at the monitor's stores in the morning, in hope to confirm that this was a peak due to compaction and that other in-memory maps might have taken their toll as well. Other than that, we may be leaking memory. I recall we having a way to run valgrind during teuth runs; do we still run them on this sort of scaled down hardware deployments?
Updated by Joao Eduardo Luis about 10 years ago
Have been spending a fair amount of time trying to figure out what may have gone wrong with this (and #8067, which appears to be the same thing), and so far I came up short.
mon stores from this run are fairly boring: few dozen MB in size, nothing out of the ordinary with regard to amount of maps; nothing in the logs pops out either.
Have rerun this test many times now. Initially attempted to add a valgrind override for the monitors on the yaml file to track potential mem leaks, but the runs would just hang waiting for OSDs to be started; don't really have an explanation for it besides maybe bad timing, or something in the upgrade suite going wrong when valgrind is in place?
Finally dropped valgrind from the yaml file and start monitoring mem usage the good old fashioned way: htop and later a script outputting current VM usage for all monitors. Monitors tend to peak at 350-400 MB, but that's far from the 800MB reported. A couple of times the test did cause the mons to commit suicide however, but that was due to the disk they were sitting on dropping to 5% avail space (while the mon stores were themselves just the same few dozen MB in size as they were expected). Intend to perform some more tests later today with adjusted 'mon data avail crit' values to allow for the monitors to run longer without committing suicide, and maybe then we'll get something out of it after a few runs.
Updated by Joao Eduardo Luis about 10 years ago
- Status changed from 12 to In Progress
Updated by Sage Weil almost 10 years ago
- Status changed from In Progress to Can't reproduce