Bug #43982: segfault in _dl_catch_error() on "rbd unmap" startup - rbd - Ceph

Actions

Copy link

Bug #43982

closed

segfault in _dl_catch_error() on "rbd unmap" startup

Added by Ilya Dryomov about 4 years ago. Updated over 1 year ago.

Status:

Can't reproduce

Priority:

Low

Assignee:

Ilya Dryomov

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2020-01-29T15:41:43.139 INFO:tasks.workunit.client.0.smithi101.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/krbd_udev_enumerate.sh: line 1: 60694 Segmentation fault      sudo rbd unmap /dev/rbd$i
...
CommandFailedError: Command failed (workunit test rbd/krbd_udev_enumerate.sh) on smithi101 with status 139: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=4047f4716eb1c6c64c9bca2769eec8712422cb66 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/krbd_udev_enumerate.sh'

2020-01-29T15:41:42.408296+00:00 smithi101 kernel: rbd[60880]: segfault at ffffffff8116b7df ip 00007f9cf7e90784 sp 00007ffe611323d0 error 15 in ld-2.17.so[7f9cf7e81000+22000]

Rather mystifying, because error 15 is an instruction fetch. The faulting address (ffffffff8116b7df) should be same as the instruction pointer (00007f9cf7e90784)...

The kernel and gdb agree on the values in rip and rsp:

(gdb) info registers
rax            0x7f9cf8088760    140312152934240
rbx            0x7f9cf80a3f88    140312153046920
rcx            0x7f9cf80a3000    140312153042944
rdx            0x0    0
rsi            0x0    0
rdi            0x7f9cf80a3930    140312153045296
rbp            0x7ffe61132e60    0x7ffe61132e60
rsp            0x7ffe611323d0    0x7ffe611323d0
r8             0x1b    27
r9             0x70000021    1879048225
r10            0x31    49
r11            0x246    582
r12            0x0    0
r13            0x7ffe611325f0    140730527065584
r14            0x0    0
r15            0x7f9cea6cd957    140311924627799
rip            0x7f9cf7e90784    0x7f9cf7e90784 <_dl_catch_error+100>
eflags         0x10206    [ PF IF RF ]
cs             0x33    51
ss             0x2b    43
ds             0x0    0
es             0x0    0
fs             0x0    0
gs             0x0    0

and the faulting instruction is a mov, which given the value in rsp should have executed just fine:

   0:   24 58                   and    $0x58,%al
   2:   31 f6                   xor    %esi,%esi
   4:   48 89 44 24 38          mov    %rax,0x38(%rsp)
   9:   e8 88 a1 00 00          callq  0xa196
   e:   85 c0                   test   %eax,%eax
  10:   75 4d                   jne    0x5f
  12:   48 8b 1c 24             mov    (%rsp),%rbx
  16:   48 8d 44 24 40          lea    0x40(%rsp),%rax
  1b:   48 8b 7c 24 28          mov    0x28(%rsp),%rdi
  20:   48 8b 4c 24 20          mov    0x20(%rsp),%rcx
  25:   48 89 03                mov    %rax,(%rbx)
  28:   ff d1                   callq  *%rcx
  2a:*  48 8b 44 24 38          mov    0x38(%rsp),%rax          <--------
  2f:   31 d2                   xor    %edx,%edx
  31:   48 89 03                mov    %rax,(%rbx)
  34:   48 8b 44 24 08          mov    0x8(%rsp),%rax
  39:   48 c7 00 00 00 00 00    movq   $0x0,(%rax)

(gdb) x/g 0x7ffe611323d0 + 0x38
0x7ffe61132408:    0x0000000000000000

Looking at glibc sources:

155    int
156    internal_function
157    _dl_catch_error (const char **objname, const char **errstring,
158             bool *mallocedp, void (*operate) (void *), void *args)
159    {
160      int errcode;
161      struct catch *volatile old;
162      struct catch c;
163      /* We need not handle `receiver' since setting a `catch' is handled
164         before it.  */
165    
166      /* Some systems (e.g., SPARC) handle constructors to local variables
167         inefficient.  So we initialize `c' by hand.  */
168      c.errstring = NULL;
169    
170      struct catch **const catchp = &CATCH_HOOK;
171      old = *catchp;
172      /* Do not save the signal mask.  */
173      errcode = __sigsetjmp (c.env, 0);
174      if (__builtin_expect (errcode, 0) == 0)
175        {
176          *catchp = &c;
177          (*operate) (args);
178          *catchp = old;           <--------
179          *objname = NULL;
180          *errstring = NULL;
181          *mallocedp = false;
182          return 0;
183        }
184    
185      /* We get here only if we longjmp'd out of OPERATE.  */
186      *catchp = old;
187      *objname = c.objname;
188      *errstring = c.errstring;
189      *mallocedp = c.malloced;
190      return errcode == -1 ? 0 : errcode;
191    }

and looking at the stack, operate was openaux:

(gdb) x/g 0x7ffe611323d0 + 0x20
0x7ffe611323f0:    0x00007f9cf7e8dbe0
(gdb) x/i 0x00007f9cf7e8dbe0
   0x7f9cf7e8dbe0 <openaux>:    push   %rbx

58    static void
59    openaux (void *a)
60    {
61      struct openaux_args *args = (struct openaux_args *) a;
62    
63      args->aux = _dl_map_object (args->map, args->name,
64                      (args->map->l_type == lt_executable
65                       ? lt_library : args->map->l_type),
66                      args->trace_mode, args->open_mode,
67                      args->map->l_ns);
68    }

The stack trace fits:

Core was generated by `rbd unmap /dev/rbd96'.
Program terminated with signal 11, Segmentation fault.
#0  _dl_catch_error (objname=objname@entry=0x7ffe61132e00, errstring=errstring@entry=0x7ffe61132df8, 
    mallocedp=mallocedp@entry=0x7ffe61132df0, operate=operate@entry=0x7f9cf7e8dbe0 <openaux>, args=args@entry=0x7ffe61132e08)
    at dl-error.c:178
178          *catchp = old;
(gdb) bt
#0  _dl_catch_error (objname=objname@entry=0x7ffe61132e00, 
    errstring=errstring@entry=0x7ffe61132df8, 
    mallocedp=mallocedp@entry=0x7ffe61132df0, 
    operate=operate@entry=0x7f9cf7e8dbe0 <openaux>, 
    args=args@entry=0x7ffe61132e08) at dl-error.c:178
#1  0x00007f9cf7e8e41d in _dl_map_object_deps (map=map@entry=0x7f9cf80a4150, 
    preloads=<optimized out>, npreloads=npreloads@entry=0, 
    trace_mode=trace_mode@entry=0, open_mode=open_mode@entry=0)
    at dl-deps.c:256
#2  0x00007f9cf7e84662 in dl_main (phdr=<optimized out>, 
    phdr@entry=0x55c20610f040, phnum=<optimized out>, phnum@entry=10, 
    user_entry=user_entry@entry=0x7ffe61133f98, auxv=<optimized out>)
    at rtld.c:1714
#3  0x00007f9cf7e98fbe in _dl_sysdep_start (
    start_argptr=start_argptr@entry=0x7ffe61134050, 
    dl_main=dl_main@entry=0x7f9cf7e82ee0 <dl_main>) at ../elf/dl-sysdep.c:244
#4  0x00007f9cf7e82bb1 in _dl_start_final (arg=0x7ffe61134050) at rtld.c:400
#5  _dl_start (arg=0x7ffe61134050) at rtld.c:512
#6  0x00007f9cf7e82128 in _start () from /lib64/ld-linux-x86-64.so.2
#7  0x0000000000000003 in ?? ()
#8  0x00007ffe61135ec6 in ?? ()
#9  0x00007ffe61135eca in ?? ()
#10 0x00007ffe61135ed0 in ?? ()
#11 0x0000000000000000 in ?? ()

No trace of ffffffff8116b7df anywhere in sight. Pretty sure it's not related to rbd, but might be a build issue?

http://qa-proxy.ceph.com/teuthology/dis-2020-01-29_15:09:37-krbd-nautilus-testing-basic-smithi/4717468/teuthology.log

So far observed only on centos 7.

Actions

Copy link

Updated by Ilya Dryomov about 4 years ago

Ah, here is the same-looking segfault, but in sudo!

2020-02-04T21:09:00.065 INFO:tasks.workunit.client.0.smithi115.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/krbd_udev_enumerate.sh: line 1: 287712 Segmentation fault      sudo rbd unmap /dev/rbd$i

[ 1192.436490] sudo[287712]: segfault at ffffffff8114728c ip 00007f0d42326784 sp 00007ffdc62630a0 error 15 in ld-2.17.so[7f0d42317000+22000]

http://qa-proxy.ceph.com/teuthology/dis-2020-02-04_16:53:40-krbd-nautilus-master-basic-smithi/4732334/teuthology.log

So not a build issue, but our centos 7 enviroment (fog image + updates that get installed with pulling packages) seems unstable...

Actions

Copy link

Updated by Ilya Dryomov over 1 year ago

Status changed from Need More Info to Can't reproduce
Assignee set to Ilya Dryomov

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #43982

segfault in _dl_catch_error() on "rbd unmap" startup

Updated by Ilya Dryomov about 4 years ago

Updated by Ilya Dryomov over 1 year ago