Project

General

Profile

Actions

Bug #16414

closed

scrub_test now failing, probably related to 9e2b455f45cca2f31b19c73d761661a9028c22ff

Added by Samuel Just almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268758


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #16454: "tcmalloc: large alloc 4294967296 bytes" in rados-wip-yuri-testing_2016_6_22-distro-basic-smithiDuplicate06/23/2016

Actions
Actions #1

Updated by Samuel Just almost 8 years ago

Another case, the osd segfaults here (also in the other one?).

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267260/remote

osd.3.log does not contain a segfault, but teuthology.log does.

2016-06-20T13:43:49.310 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.395112 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 shard 3: soid 0:e8020acf:::benchmark_data_smithi001_8057_object2273:head data_digest 0x7d5db345 != known data_digest 0x7cf03004 from auth shard 2, size 4097 != known size 4096
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447117 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 0 missing, 1 inconsistent objects
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447119 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 1 errors
2016-06-20T13:44:34.039 INFO:tasks.ceph.osd.3.smithi053.stderr:*** Caught signal (Segmentation fault) **

Actions #2

Updated by Samuel Just almost 8 years ago

  • Assignee set to David Zafman

Another case, the osd segfaults here (also in the other one?).

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267260/remote

osd.3.log does not contain a segfault, but teuthology.log does.

2016-06-20T13:43:49.310 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.395112 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 shard 3: soid 0:e8020acf:::benchmark_data_smithi001_8057_object2273:head data_digest 0x7d5db345 != known data_digest 0x7cf03004 from auth shard 2, size 4097 != known size 4096
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447117 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 0 missing, 1 inconsistent objects
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447119 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 1 errors
2016-06-20T13:44:34.039 INFO:tasks.ceph.osd.3.smithi053.stderr:*** Caught signal (Segmentation fault) **

Actions #3

Updated by David Zafman almost 8 years ago

I'll look for changes in logging code. This must be why there was no information in the ceph-osd.3.log.


Thread 1 (Thread 0x7f9658f6f700 (LWP 20969)):
#0  0x00007f965c1f620b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007f965de96d55 in ?? ()
#2  <signal handler called>
#3  0x00007f965dedecef in ceph::log::Log::flush() ()
#4  0x00007f965dededbe in ceph::log::Log::entry() ()
#5  0x00007f965c1ee182 in start_thread (arg=0x7f9658f6f700) at pthread_create.c:312
#6  0x00007f965a10147d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

   0x00007f965dedece4 <+564>:   lea    0x10(%rdi),%rdx
   0x00007f965dedece8 <+568>:   je     0x7f965deded28 <_ZN4ceph3log3Log5flushEv+632>
   0x00007f965dedecea <+570>:   mov    $0xffffffff,%eax
=> 0x00007f965dedecef <+575>:   lock xadd %eax,(%rdx)
   0x00007f965dedecf3 <+579>:   test   %eax,%eax
   0x00007f965dedecf5 <+581>:   jg     0x7f965dedebd9 <_ZN4ceph3log3Log5flushEv+297>
   0x00007f965dedecfb <+587>:   mov    %r13,%rsi
   0x00007f965dedecfe <+590>:   callq  0x7f965d863c30 <_ZNSs4_Rep10_M_destroyERKSaIcE@plt>

Actions #4

Updated by David Zafman almost 8 years ago

This shows a break at the exact crash point. In this case the value of dx (0x7f759006e430) holds the __mem value. In the crash the value of dx is 0xfffffffffffffff8 (-8). There haven't been change in the flush routine itself, but this could be caused by a double free somehow.

Breakpoint 3, 0x00007f75e6eb994f in _exchange_and_add (_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:49
49 { return __atomic_fetch_add(
_mem, _val, __ATOMIC_ACQ_REL); }
(gdb) bt
#0 0x00007f75e6eb994f in __exchange_and_add (
_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:49
#1 __exchange_and_add_dispatch (
_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:82
#2 _M_dispose (
_a=..., this=0x7f759006e420) at /usr/include/c++/4.8/bits/basic_string.h:246
#3 ~basic_string (this=0x7f759006e360, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/basic_string.h:539
#4 ~PrebufferedStreambuf (this=0x7f759006e310, __in_chrg=<optimized out>) at ./common/PrebufferedStreambuf.h:17
#5 ~Entry (this=0x7f759006e2f0, __in_chrg=<optimized out>) at log/Entry.h:16
#6 ceph::log::Log::flush (this=this@entry=0x7f75f16c39c0) at log/Log.cc:286
#7 0x00007f75e6eb9a1e in ceph::log::Log::entry (this=0x7f75f16c39c0) at log/Log.cc:453
#8 0x00007f75e546d182 in start_thread (arg=0x7f75e2c66700) at pthread_create.c:312
#9 0x00007f75e359b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Actions #5

Updated by David Zafman almost 8 years ago

Added valgrind to the same run yaml. It is unfortunate that valgrind didn't detect anything, but rather it has become a victim.

    valgrind:
      osd:
      - --tool=memcheck

  -768> 2016-06-23 15:52:31.630521 311dc700  5 filestore(/var/lib/ceph/osd/ceph-1) queue_transactions new 0x2fdec580 osr(fuse 0x93bb3b0)
  -767> 2016-06-23 15:52:31.630581 311dc700 10 journal prepare_entry [Transaction(0x312a3dc0)]
  -690> 2016-06-23 15:53:15.616121 311dc700 10 journal  len 4349 -> 8192 (head 40 pre_pad 0 bl 4349 post_pad 3763 tail 40) (bl alignment -1)
  -665> 2016-06-23 15:53:15.712363 311dc700 -1 *** Caught signal (Segmentation fault) **
 in thread 311dc700 thread_name:fusestore

 ceph version 10.2.0-2551-g334ca8c (334ca8cb26fe2a75cd1890eef408d25d5b8565ab)
 1: (()+0x949c92) [0xa51c92]
 2: (()+0x10340) [0xb090340]
 3: (_vgr20180ZZ_libcZdsoZa_memcpyZAZAGLIBCZu2Zd14()+0x2a3) [0x9f54953]
 4: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0xb5b8cc]
 5: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x204) [0xb5cee4]
 6: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x85b) [0x8dd0cb]
 7: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x4e9) [0x7e61d9]
 8: (ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, Context*)+0x1b6) [0x767126]
 9: (os_flush(char const*, fuse_file_info*)+0x261) [0x76dd61]
 10: (()+0xdfc7) [0xae52fc7]
 11: (()+0xe241) [0xae53241]
 12: (()+0x14506) [0xae59506]
 13: (()+0x1522b) [0xae5a22b]
 14: (fuse_session_loop()+0xac) [0xae56a6c]
 15: (fuse_loop()+0x1e8) [0xae4f1c8]
 16: (FuseStore::loop()+0x67) [0x768d07]
 17: (FuseStore::FuseThread::entry()+0xd) [0x76eccd]
 18: (()+0x8182) [0xb088182]
 19: (clone()+0x6d) [0xd1b247d]
Actions #6

Updated by Samuel Just almost 8 years ago

  • Has duplicate Bug #16454: "tcmalloc: large alloc 4294967296 bytes" in rados-wip-yuri-testing_2016_6_22-distro-basic-smithi added
Actions #7

Updated by David Zafman almost 8 years ago

Seen in the teuthology.log before a ceph-osd crashed.

2016-06-23 09:46:26,185.185 INFO:teuthology.orchestra.run.smithi030:Running: "sudo cp /tmp/tmp8361_Y '/var/lib/ceph/osd/ceph-1/fuse/0.8_head/all/#0:1e142346:::benchmark_data_smithi030_15137_object22:head#/data'" 
2016-06-23 09:46:26,585.585 INFO:tasks.ceph.osd.1.smithi030.stderr:tcmalloc: large alloc 4294967296 bytes == 0x3a00e000 @  0xa16ec4c 0xa171547 0xa18fa02 0xb5c205 0xb5c4b5 0xb5c4e4 0x7696c9 0xae5165c 0xae517e8 0xae5a389 0xae56a6c 0xae4f1c8 0x768d07 0x76eccd 0xb088182 0xd1b247d (nil)
Actions #8

Updated by Samuel Just almost 8 years ago

2/2 I checked are btrfs. FuseStore::os_flush is a bit suspicious in that it calls apply_transaction and then flush_commit, but osr is locally allocated (it's a shared_ptr, but apply_transaction does not take a ref). I think that's correct, but it might be worth investigating further.

Actions #9

Updated by David Zafman almost 8 years ago

  • Assignee changed from David Zafman to Sage Weil

Yaml file I used which doesn't not include a valgrind variant. It reproduced in some form every time I ran with this. Needs 2 machines locked. You'll need to specify the suite-branch which enables the fuse change again. You can try a valgrind memcheck again, but it didn't yield anything useful (see comment #5). Can tcmalloc debugging be enabled?


description: rados/basic/{rados.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml
  mon_kv_backend/leveldb.yaml msgr/async.yaml msgr-failures/few.yaml tasks/scrub_test.yaml}
machine_type: smithi
interactive-on-error: true
openstack:
- volumes:
    count: 3
    size: 10
overrides:
  admin_socket:
    branch: master_2016_6_20
  ceph:
    conf:
      global:
        enable experimental unrecoverable data corrupting features: '*'
        ms inject socket failures: 5000
        ms type: async
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon keyvaluedb: leveldb
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 25
        osd op queue: debug_random
        osd op queue cut off: debug_random
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    - '!= best guess digest'
    - '!= best guess data_digest'
    - '!= best guess omap_digest'
    - '!= known digest'
    - '!= known data_digest'
    - '!= known omap_digest'
    - deep-scrub 0 missing, 1 inconsistent objects
    - deep-scrub 1 errors
    - repair 0 missing, 1 inconsistent objects
    - repair 1 errors, 1 fixed
    - shard [0-9]+ missing
    - deep-scrub 1 missing, 0 inconsistent objects
    sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
  ceph-deploy:
    branch:
      dev-commit: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
  workunit:
    sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
priority: 100
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - osd.3
  - osd.4
  - osd.5
  - client.1
sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
suite: rados
suite_branch: master
#suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master
tasks:
- install: null
- ceph: null
- scrub_test: null
teuthology_branch: master
tube: smithi
verbose: true

Actions #10

Updated by Samuel Just over 7 years ago

  • Priority changed from Urgent to High
Actions #11

Updated by Samuel Just over 7 years ago

  • Assignee changed from Sage Weil to David Zafman

@david Is this still failing?

Actions #12

Updated by David Zafman over 7 years ago

  • Status changed from New to Closed

Not seeing this anymore

Actions

Also available in: Atom PDF