Bug #43982
closedsegfault in _dl_catch_error() on "rbd unmap" startup
0%
Description
2020-01-29T15:41:43.139 INFO:tasks.workunit.client.0.smithi101.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/krbd_udev_enumerate.sh: line 1: 60694 Segmentation fault sudo rbd unmap /dev/rbd$i ... CommandFailedError: Command failed (workunit test rbd/krbd_udev_enumerate.sh) on smithi101 with status 139: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=4047f4716eb1c6c64c9bca2769eec8712422cb66 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/krbd_udev_enumerate.sh'
2020-01-29T15:41:42.408296+00:00 smithi101 kernel: rbd[60880]: segfault at ffffffff8116b7df ip 00007f9cf7e90784 sp 00007ffe611323d0 error 15 in ld-2.17.so[7f9cf7e81000+22000]
Rather mystifying, because error 15 is an instruction fetch. The faulting address (ffffffff8116b7df) should be same as the instruction pointer (00007f9cf7e90784)...
The kernel and gdb agree on the values in rip and rsp:
(gdb) info registers rax 0x7f9cf8088760 140312152934240 rbx 0x7f9cf80a3f88 140312153046920 rcx 0x7f9cf80a3000 140312153042944 rdx 0x0 0 rsi 0x0 0 rdi 0x7f9cf80a3930 140312153045296 rbp 0x7ffe61132e60 0x7ffe61132e60 rsp 0x7ffe611323d0 0x7ffe611323d0 r8 0x1b 27 r9 0x70000021 1879048225 r10 0x31 49 r11 0x246 582 r12 0x0 0 r13 0x7ffe611325f0 140730527065584 r14 0x0 0 r15 0x7f9cea6cd957 140311924627799 rip 0x7f9cf7e90784 0x7f9cf7e90784 <_dl_catch_error+100> eflags 0x10206 [ PF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
and the faulting instruction is a mov, which given the value in rsp should have executed just fine:
0: 24 58 and $0x58,%al 2: 31 f6 xor %esi,%esi 4: 48 89 44 24 38 mov %rax,0x38(%rsp) 9: e8 88 a1 00 00 callq 0xa196 e: 85 c0 test %eax,%eax 10: 75 4d jne 0x5f 12: 48 8b 1c 24 mov (%rsp),%rbx 16: 48 8d 44 24 40 lea 0x40(%rsp),%rax 1b: 48 8b 7c 24 28 mov 0x28(%rsp),%rdi 20: 48 8b 4c 24 20 mov 0x20(%rsp),%rcx 25: 48 89 03 mov %rax,(%rbx) 28: ff d1 callq *%rcx 2a:* 48 8b 44 24 38 mov 0x38(%rsp),%rax <-------- 2f: 31 d2 xor %edx,%edx 31: 48 89 03 mov %rax,(%rbx) 34: 48 8b 44 24 08 mov 0x8(%rsp),%rax 39: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
(gdb) x/g 0x7ffe611323d0 + 0x38 0x7ffe61132408: 0x0000000000000000
Looking at glibc sources:
155 int 156 internal_function 157 _dl_catch_error (const char **objname, const char **errstring, 158 bool *mallocedp, void (*operate) (void *), void *args) 159 { 160 int errcode; 161 struct catch *volatile old; 162 struct catch c; 163 /* We need not handle `receiver' since setting a `catch' is handled 164 before it. */ 165 166 /* Some systems (e.g., SPARC) handle constructors to local variables 167 inefficient. So we initialize `c' by hand. */ 168 c.errstring = NULL; 169 170 struct catch **const catchp = &CATCH_HOOK; 171 old = *catchp; 172 /* Do not save the signal mask. */ 173 errcode = __sigsetjmp (c.env, 0); 174 if (__builtin_expect (errcode, 0) == 0) 175 { 176 *catchp = &c; 177 (*operate) (args); 178 *catchp = old; <-------- 179 *objname = NULL; 180 *errstring = NULL; 181 *mallocedp = false; 182 return 0; 183 } 184 185 /* We get here only if we longjmp'd out of OPERATE. */ 186 *catchp = old; 187 *objname = c.objname; 188 *errstring = c.errstring; 189 *mallocedp = c.malloced; 190 return errcode == -1 ? 0 : errcode; 191 }
and looking at the stack, operate was openaux:
(gdb) x/g 0x7ffe611323d0 + 0x20 0x7ffe611323f0: 0x00007f9cf7e8dbe0 (gdb) x/i 0x00007f9cf7e8dbe0 0x7f9cf7e8dbe0 <openaux>: push %rbx
58 static void 59 openaux (void *a) 60 { 61 struct openaux_args *args = (struct openaux_args *) a; 62 63 args->aux = _dl_map_object (args->map, args->name, 64 (args->map->l_type == lt_executable 65 ? lt_library : args->map->l_type), 66 args->trace_mode, args->open_mode, 67 args->map->l_ns); 68 }
The stack trace fits:
Core was generated by `rbd unmap /dev/rbd96'. Program terminated with signal 11, Segmentation fault. #0 _dl_catch_error (objname=objname@entry=0x7ffe61132e00, errstring=errstring@entry=0x7ffe61132df8, mallocedp=mallocedp@entry=0x7ffe61132df0, operate=operate@entry=0x7f9cf7e8dbe0 <openaux>, args=args@entry=0x7ffe61132e08) at dl-error.c:178 178 *catchp = old; (gdb) bt #0 _dl_catch_error (objname=objname@entry=0x7ffe61132e00, errstring=errstring@entry=0x7ffe61132df8, mallocedp=mallocedp@entry=0x7ffe61132df0, operate=operate@entry=0x7f9cf7e8dbe0 <openaux>, args=args@entry=0x7ffe61132e08) at dl-error.c:178 #1 0x00007f9cf7e8e41d in _dl_map_object_deps (map=map@entry=0x7f9cf80a4150, preloads=<optimized out>, npreloads=npreloads@entry=0, trace_mode=trace_mode@entry=0, open_mode=open_mode@entry=0) at dl-deps.c:256 #2 0x00007f9cf7e84662 in dl_main (phdr=<optimized out>, phdr@entry=0x55c20610f040, phnum=<optimized out>, phnum@entry=10, user_entry=user_entry@entry=0x7ffe61133f98, auxv=<optimized out>) at rtld.c:1714 #3 0x00007f9cf7e98fbe in _dl_sysdep_start ( start_argptr=start_argptr@entry=0x7ffe61134050, dl_main=dl_main@entry=0x7f9cf7e82ee0 <dl_main>) at ../elf/dl-sysdep.c:244 #4 0x00007f9cf7e82bb1 in _dl_start_final (arg=0x7ffe61134050) at rtld.c:400 #5 _dl_start (arg=0x7ffe61134050) at rtld.c:512 #6 0x00007f9cf7e82128 in _start () from /lib64/ld-linux-x86-64.so.2 #7 0x0000000000000003 in ?? () #8 0x00007ffe61135ec6 in ?? () #9 0x00007ffe61135eca in ?? () #10 0x00007ffe61135ed0 in ?? () #11 0x0000000000000000 in ?? ()
No trace of ffffffff8116b7df anywhere in sight. Pretty sure it's not related to rbd, but might be a build issue?
So far observed only on centos 7.
Updated by Ilya Dryomov about 4 years ago
Ah, here is the same-looking segfault, but in sudo!
2020-02-04T21:09:00.065 INFO:tasks.workunit.client.0.smithi115.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/krbd_udev_enumerate.sh: line 1: 287712 Segmentation fault sudo rbd unmap /dev/rbd$i
[ 1192.436490] sudo[287712]: segfault at ffffffff8114728c ip 00007f0d42326784 sp 00007ffdc62630a0 error 15 in ld-2.17.so[7f0d42317000+22000]
So not a build issue, but our centos 7 enviroment (fog image + updates that get installed with pulling packages) seems unstable...
Updated by Ilya Dryomov over 1 year ago
- Status changed from Need More Info to Can't reproduce
- Assignee set to Ilya Dryomov