Project

General

Profile

Bug #16686

cephfs.test_volume_client.TestVolumeClient failure (with "src/tcmalloc.cc:283] Attempt to free invalid pointer")

Added by Ramana Raja almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:
Crash signature:

Description

Ran into the following test failure, when running VolumeClient
tests using vstart_runner locally on a vstart Ceph cluster
using Ceph master branch (HEAD commit c5e5ef7179886c39ab826405fb5454e6640efa83),

2016-07-14 18:40:24,328.328 INFO:__main__:run args=['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2016-07-14 18:40:24,328.328 INFO:__main__:Running ['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2016-07-14 18:40:24,530.530 INFO:__main__:run args=['./bin/ceph', 'daemon', 'mon.a', 'config', 'get', 'mon_pg_warn_max_per_osd']
2016-07-14 18:40:24,530.530 INFO:__main__:Running ['./bin/ceph', 'daemon', 'mon.a', 'config', 'get', 'mon_pg_warn_max_per_osd']
2016-07-14 18:40:24,660.660 INFO:tasks.cephfs.filesystem:_json_asok output: {
"mon_pg_warn_max_per_osd": "300"
}

2016-07-14 18:40:24,660.660 INFO:tasks.cephfs.test_volume_client:max_per_osd 300
2016-07-14 18:40:24,661.661 INFO:tasks.cephfs.test_volume_client:osd_count 3
2016-07-14 18:40:24,661.661 INFO:tasks.cephfs.test_volume_client:max_overall 900
2016-07-14 18:40:24,661.661 INFO:tasks.cephfs.test_volume_client:existing_pg_count 26
2016-07-14 18:40:24,661.661 INFO:tasks.cephfs.test_volume_client:expected_pg_num 87
2016-07-14 18:40:24,661.661 INFO:__main__:run args=['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2016-07-14 18:40:24,661.661 INFO:__main__:Running ['./bin/ceph', 'osd', 'dump', '--format=json-pretty']
2016-07-14 18:40:24,845.845 INFO:__main__:run args=['python', '-c', '\nfrom ceph_volume_client import CephFSVolumeClient, VolumePath\nimport logging\nlog = logging.getLogger("ceph_volume_client")\nlog.addHandler(logging.StreamHandler())\nlog.setLevel(logging.DEBUG)\nvc = CephFSVolumeClient("manila", "./ceph.conf", "ceph", None, None)\nvc.connect()\n\nvp = VolumePath("grpid", "volid")\nvc.create_volume(vp, 10, data_isolated=True)\n\nvc.disconnect()\n ']
2016-07-14 18:40:24,846.846 INFO:__main__:Running ['python', '-c', '\nfrom ceph_volume_client import CephFSVolumeClient, VolumePath\nimport logging\nlog = logging.getLogger("ceph_volume_client")\nlog.addHandler(logging.StreamHandler())\nlog.setLevel(logging.DEBUG)\nvc = CephFSVolumeClient("manila", "./ceph.conf", "ceph", None, None)\nvc.connect()\n\nvp = VolumePath("grpid", "volid")\nvc.create_volume(vp, 10, data_isolated=True)\n\nvc.disconnect()\n ']
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x557694b27cc0
2016-07-14 18:40:25,142.142 INFO:__main__:test_data_isolated (tasks.cephfs.test_volume_client.TestVolumeClient) ... ERROR
2016-07-14 18:40:25,143.143 ERROR:__main__:Traceback (most recent call last):
File "/home/rraja/git/ceph-qa-suite/tasks/cephfs/test_volume_client.py", line 370, in test_data_isolated
guest_entity=guest_entity
File "/home/rraja/git/ceph-qa-suite/tasks/cephfs/test_volume_client.py", line 34, in _volume_client_python
""".format(payload=script, conf_path=client.config_path, vol_prefix=vol_prefix, ns_prefix=ns_prefix))
File "/home/rraja/git/ceph-qa-suite/tasks/cephfs/mount.py", line 134, in run_python
p.wait()
File "/home/rraja/git/ceph-qa-suite/tasks/cephfs/vstart_runner.py", line 113, in wait
raise CommandFailedError(self.args, self.exitstatus)
CommandFailedError: Command failed with status -6: ['python', '-c', '\nfrom ceph_volume_client import CephFSVolumeClient, VolumePath\nimport logging\nlog = logging.getLogger("ceph_volume_client")\nlog.addHandler(logging.StreamHandler())\nlog.setLevel(logging.DEBUG)\nvc = CephFSVolumeClient("manila", "./ceph.conf", "ceph", None, None)\nvc.connect()\n\nvp = VolumePath("grpid", "volid")\nvc.create_volume(vp, 10, data_isolated=True)\n\nvc.disconnect()\n ']

The culprit seems to be,
"src/tcmalloc.cc:283] Attempt to free invalid pointer 0x557694b27cc0"

vstartrunner_output_with_master.log View - complete vstart_runner output (181 KB) Ramana Raja, 07/14/2016 04:58 PM

History

#1 Updated by Ramana Raja almost 4 years ago

  • Subject changed from cephfs.test_volume_client.TestVolumeClient test failure (due to "src/tcmalloc.cc:283] Attempt to free invalid pointer") to cephfs.test_volume_client.TestVolumeClient failure (due to "src/tcmalloc.cc:283] Attempt to free invalid pointer")

#2 Updated by Ramana Raja almost 4 years ago

  • Subject changed from cephfs.test_volume_client.TestVolumeClient failure (due to "src/tcmalloc.cc:283] Attempt to free invalid pointer") to cephfs.test_volume_client.TestVolumeClient failure (with "src/tcmalloc.cc:283] Attempt to free invalid pointer")

#3 Updated by Ramana Raja almost 4 years ago

The above issue is reproducible with Ceph master
(HEAD commit bf58aeb6d5a6a07d83374c750d177a78ce027c3c) built using
cmake.

However, I don't run into this crash when using autotools to build
Ceph.

#4 Updated by John Spray almost 4 years ago

  • Assignee set to Kefu Chai

Kefu has offered to look at this as a possible link issue (thanks Kefu!)

#5 Updated by Kefu Chai almost 4 years ago

  • Status changed from New to Fix Under Review

#6 Updated by Kefu Chai almost 4 years ago

  • Project changed from fs to Ceph
  • Category deleted (Testing)
  • Status changed from Fix Under Review to In Progress

moving to ceph subproject, as autotools build passes.

#7 Updated by John Spray almost 4 years ago

Reproduce with a script like this:

from ceph_volume_client import CephFSVolumeClient, VolumePath
import logging
log = logging.getLogger("ceph_volume_client")
log.addHandler(logging.StreamHandler())
log.setLevel(logging.DEBUG)
vc = CephFSVolumeClient("admin", "./ceph.conf", "ceph", "_vols", "_ns")
vc.connect()
vp = VolumePath("grp", "vol")
vc.create_volume(vp, 10, data_isolated=False)
vc.disconnect()

Which I run like this:

 LD_LIBRARY_PATH=`pwd`/lib PYTHONPATH=`pwd`/../src/pybind:`pwd`/lib/cython_modules/lib.linux-x86_64-2.7/ python ./foo.py

from my cmake build/ directory with a vstart cluster.

#8 Updated by Kefu Chai almost 4 years ago

in ceph_volume_client.py, we have

from contextlib import contextmanager
import errno
import fcntl
import json
import logging
import os
import re
import struct
import sys
import threading
import time
import uuid

from ceph_argparse import json_command

import cephfs
import rados

it happens when importing the second extension: "rados".

#9 Updated by Kefu Chai almost 4 years ago

reverting https://github.com/ceph/ceph/pull/10145 fixes this issue.

#10 Updated by Kefu Chai almost 4 years ago

backtrace with libc allocator.

(gdb) bt
#0  0x00007ffff6f271c8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff6f2864a in __GI_abort () at abort.c:89
#2  0x00007ffff6f62f4a in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffff705bb30 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff6f686b6 in malloc_printerr (action=3, str=0x7ffff705bc88 "double free or corruption (fasttop)", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5004
#4  0x00007ffff6f68e9e in _int_free (av=0x7ffff728fb20 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3865
#5  0x00007ffff6f29b78 in __run_exit_handlers (status=0, listp=0x7ffff728f5f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#6  0x00007ffff6f29bc5 in __GI_exit (status=<optimized out>) at exit.c:104
#7  0x00007ffff6f14737 in __libc_start_main (main=0x49d710 <main>, argc=2, argv=0x7fffffffe0a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe098) at ../csu/libc-start.c:325
#8  0x000000000049d639 in _start ()

#11 Updated by Kefu Chai almost 4 years ago

found it. it's in libcommon, to be specific, it is

const string WorkerPool::name = "AsyncMessenger::WorkerPool";

#12 Updated by Kefu Chai almost 4 years ago

  • Status changed from In Progress to Fix Under Review

#13 Updated by Kefu Chai almost 4 years ago

  • Status changed from Fix Under Review to Resolved

#14 Updated by Jeff Layton almost 4 years ago

I believe I hit this as well on my own branch (wip-jlayton-nlink), which has commit 3173d2ea1. Test results are here:

http://qa-proxy.ceph.com/teuthology/jlayton-2016-07-26_12:03:12-fs-wip-jlayton-nlink---basic-mira/334949/

Interestingly:

$ file 1469612731.18395.core 
1469612731.18395.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), too many program header sections (278)

...so I'm not quite sure what program crashed there.

#15 Updated by Jeff Layton almost 4 years ago

Ahh, my home box has a more up to date "file" program:

$ file /tmp/1469612731.18395.core
/tmp/1469612731.18395.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python'

Here's the stack trace I see in the core (missing a lot of symbols of course):

(gdb) bt
#0  0x00007f91a62dfcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f91a62e30d8 in __GI_abort () at abort.c:89
#2  0x00007f919ac79114 in tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem) () from /usr/lib/libtcmalloc.so.4
#3  0x00007f919ac6e184 in ?? () from /usr/lib/libtcmalloc.so.4
#4  0x00007f919ac8cdf4 in tc_delete () from /usr/lib/libtcmalloc.so.4
#5  0x00007f919b8d8da6 in ?? ()
#6  0x0000000000d20b08 in ?? ()
#7  0x00007fff5ae2caa0 in ?? ()
#8  0x0000000000e2cb80 in ?? ()
#9  0x00000000016ea060 in ?? ()
#10 0x00007f9199e7af18 in vtable for std::ostream () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007f9199e78f50 in vtable for std::ios_base () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x0000000000000006 in ?? ()
#13 0x0000000000000000 in ?? ()
(gdb) info libraries
Undefined info command: "libraries".  Try "help info".
(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007f91a66739f0  0x00007f91a66804a1  Yes         /lib/x86_64-linux-gnu/libpthread.so.0
0x00007f91a62c84a0  0x00007f91a640d583  Yes         /lib/x86_64-linux-gnu/libc.so.6
0x00007f91a60a5ed0  0x00007f91a60a69ce  Yes         /lib/x86_64-linux-gnu/libdl.so.2
0x00007f91a5ea2f10  0x00007f91a5ea3804  Yes         /lib/x86_64-linux-gnu/libutil.so.1
0x00007f91a5c8ae00  0x00007f91a5c9abf8  Yes (*)     /lib/x86_64-linux-gnu/libz.so.1
0x00007f91a5988610  0x00007f91a59f71b6  Yes         /lib/x86_64-linux-gnu/libm.so.6
0x00007f91a688cae0  0x00007f91a68a74e0  Yes         /lib64/ld-linux-x86-64.so.2
0x00007f91a54aef70  0x00007f91a54b7612  Yes (*)     /usr/lib/python2.7/lib-dynload/_json.x86_64-linux-gnu.so
0x00007f91a5212f90  0x00007f91a5223642  Yes (*)     /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
0x00007f91a5005650  0x00007f91a5009d38  Yes (*)     /usr/lib/x86_64-linux-gnu/libffi.so.6
0x00007f91a4dc0610  0x00007f91a4dc1c96  Yes (*)     /usr/lib/python2.7/lib-dynload/_hashlib.x86_64-linux-gnu.so
0x00007f91a4a42e00  0x00007f91a4b32cdf  Yes (*)     /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
0x00007f91a47e0660  0x00007f91a47e1d17  Yes (*)     /lib/x86_64-linux-gnu/libuuid.so.1
0x00007f91a45d9470  0x00007f91a45dc935  Yes (*)     /usr/lib/python2.7/lib-dynload/_ssl.x86_64-linux-gnu.so
0x00007f91a4389e80  0x00007f91a43bd66d  Yes (*)     /lib/x86_64-linux-gnu/libssl.so.1.0.0
                                        No          /usr/lib/python2.7/dist-packages/cephfs.x86_64-linux-gnu.so
                                        No          /usr/lib/libcephfs.so.1
0x00007f919b1190c0  0x00007f919b2a7782  Yes (*)     /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
0x00007f919aec7350  0x00007f919aeca33c  Yes         /lib/x86_64-linux-gnu/librt.so.1
0x00007f919ac6c9e0  0x00007f919ac8ad9c  Yes (*)     /usr/lib/libtcmalloc.so.4
0x00007f919aa42ad0  0x00007f919aa51e69  Yes         /lib/x86_64-linux-gnu/libresolv.so.2
0x00007f919a71aa10  0x00007f919a7fb562  Yes (*)     /usr/lib/x86_64-linux-gnu/libnss3.so
0x00007f919a4d36f0  0x00007f919a4f29bf  Yes (*)     /usr/lib/x86_64-linux-gnu/libnspr4.so
0x00007f919a2bcf40  0x00007f919a2c2883  Yes (*)     /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
0x00007f919a0af250  0x00007f919a0afdc3  Yes (*)     /usr/lib/x86_64-linux-gnu/libboost_system.so.1.54.0
0x00007f9199ea1710  0x00007f9199ea7d82  Yes (*)     /usr/lib/x86_64-linux-gnu/libboost_iostreams.so.1.54.0
0x00007f9199bec620  0x00007f9199c4f803  Yes (*)     /usr/lib/x86_64-linux-gnu/libstdc++.so.6
0x00007f919997dab0  0x00007f919998d985  Yes (*)     /lib/x86_64-linux-gnu/libgcc_s.so.1
0x00007f9199761f30  0x00007f91997695c9  Yes (*)     /usr/lib/x86_64-linux-gnu/libunwind.so.8
0x00007f919953f2e0  0x00007f919954da53  Yes (*)     /usr/lib/x86_64-linux-gnu/libnssutil3.so
0x00007f91993303f0  0x00007f9199331ae6  Yes (*)     /usr/lib/x86_64-linux-gnu/libplc4.so
0x00007f919912bf40  0x00007f919912ce78  Yes (*)     /usr/lib/x86_64-linux-gnu/libplds4.so
0x00007f9198f1c3c0  0x00007f9198f2808f  Yes (*)     /lib/x86_64-linux-gnu/libbz2.so.1.0
0x00007f9198cfb0d0  0x00007f9198d0f24d  Yes (*)     /lib/x86_64-linux-gnu/liblzma.so.5
                                        No          /usr/lib/python2.7/dist-packages/rados.x86_64-linux-gnu.so
                                        No          /usr/lib/librados.so.2
0x00007f918fe48580  0x00007f918fe4e273  Yes (*)     /usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0
0x00007f918fc41bd0  0x00007f918fc43fbc  Yes (*)     /usr/lib/liburcu-bp.so.1
0x00007f918f9e9a60  0x00007f918f9f7bcc  Yes (*)     /usr/lib/python2.7/lib-dynload/datetime.x86_64-linux-gnu.so
0x00007f918ef662d0  0x00007f918ef91f3c  Yes (*)     /usr/lib/x86_64-linux-gnu/nss/libsoftokn3.so
0x00007f918ecb2d60  0x00007f918ed3da0f  Yes (*)     /usr/lib/x86_64-linux-gnu/libsqlite3.so.0
---Type <return> to continue, or q <return> to quit---
0x00007f918ea2d9c0  0x00007f918ea8232d  Yes (*)     /usr/lib/x86_64-linux-gnu/nss/libfreebl3.so
(*): Shared library is missing debugging information.
(gdb) 

#16 Updated by Ramana Raja almost 4 years ago

Jeff, I couldn't reproduce this issue using your branch, wip-jlayton-nlink, in my local machine. were you able to reproduce the issue using http://tracker.ceph.com/issues/16686#note-7 in your local setup?

Also available in: Atom PDF