Bug #14457
closedtcmalloc oom bug
0%
Description
Run: http://pulpito.ceph.com/teuthology-2016-01-20_14:48:08-upgrade:infernalis-x-jewel-distro-basic-vps/
Jobs: 25541, 35542
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-20_14:48:08-upgrade:infernalis-x-jewel-distro-basic-vps/35541/teuthology.log
2016-01-20T18:17:07.358 INFO:tasks.ceph:Waiting until ceph osds are all up... 2016-01-20T18:17:07.358 INFO:teuthology.orchestra.run.vpm122:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd dump --format=json' 2016-01-20T18:17:08.771 INFO:tasks.ceph.mon.b.vpm122.stdout:starting mon.b rank 1 at 172.21.2.122:6790/0 mon_data /var/lib/ceph/mon/ceph-b fsid 3fda6002-eea7-4a91-a94b-63e0a1a801c0 2016-01-20T18:17:09.273 INFO:teuthology.misc.health.vpm122.stderr:2016-01-21 02:17:09.269914 7f6a706a8700 -1 asok(0x7f6a68000f80) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.23496.asok': (13) Permission denied
Per IRC chat, suspect either missing sudo
or changed ceph
cli behavior.
Dan's assessment:
so /var/run/ceph is owned by ceph.ceph and is 770 dmick @ 10:39 it is indeed the ceph osd dump command that's failing to create a socket in /var/run/ceph /var/run/ceph is also that ownership/permission on one of the LRC hosts, so I doubt it's new but...it certainly used to be the case that you could run as nonroot as long as you had access to the ceph.conf and keyring files 10:41 hm and it still is the case there but that ceph command doesn't try to open a client-admin socket 10:43 so perhaps that's new behavior interestingly, that message is apparently just a warning, and not fatal and this would be librados
Updated by Yuri Weinstein over 8 years ago
Run: http://pulpito.ceph.com/teuthology-2016-01-21_13:17:25-upgrade:infernalis-x-jewel-distro-basic-vps/
Further debugging by Josh (using gdb on osd 2 process) revealed that osd 2 process was stuck in tcmalloc in several threads.
so the action there is probably to get the updated tcmalloc installed on vps machines, so we can set the tcmalloc environment variable to mitigate this problem
Updated by Yuri Weinstein over 8 years ago
- Project changed from Ceph to teuthology
Updated by Dan Mick over 8 years ago
What is this updated tcmalloc you talk about? Do we need something later than the distro package?
Updated by Yuri Weinstein over 8 years ago
- Project changed from teuthology to Ceph
Updated by Samuel Just over 8 years ago
- Subject changed from "failed: AdminSocket::bind_and_listen..Permission denied" in upgrade:infernalis-x-jewel-distro-basic-vps to tcmalloc oom bug
Updated by Yuri Weinstein over 8 years ago
Suspect the same root cause of failures in this run http://pulpito.ceph.com/teuthology-2016-01-26_02:10:02-upgrade:hammer-x-jewel-distro-basic-vps/
Updated by Sage Weil about 8 years ago
- Status changed from New to Need More Info
- Priority changed from Urgent to High
waiting for VPS with more memory to see if this is low memory related.
Updated by Josh Durgin about 8 years ago
- Status changed from Need More Info to Duplicate
I think this was http://tracker.ceph.com/issues/13522
Updated by Sage Weil about 8 years ago
- Related to Bug #13522: Apparent deadlock between tcmalloc getting a stacktrace and dlopen allocating memory added