Bug #16414: scrub_test now failing, probably related to 9e2b455f45cca2f31b19c73d761661a9028c22ff - Ceph - Ceph

Custom queries

Backports: mimic
Backports: missing release
Backports: nautilus
Bluestore
Bug queue
Bug queue - no subprojects
Bug triage
Ceph backlog
Crash queue
Crash triage
Feature Requests
Feedback
My issues
Need Review
Pending backports
Priority queue
Product Backlog Scrub
Project Triage
Test Failures

Actions

Copy link

Bug #16414

closed

scrub_test now failing, probably related to 9e2b455f45cca2f31b19c73d761661a9028c22ff

Added by Samuel Just almost 8 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268758

Related issues 1 (0 open — 1 closed)

Has duplicate Ceph - Bug #16454: "tcmalloc: large alloc 4294967296 bytes" in rados-wip-yuri-testing_2016_6_22-distro-basic-smithi

Duplicate

06/23/2016

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by Samuel Just almost 8 years ago

Another case, the osd segfaults here (also in the other one?).

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267260/remote

osd.3.log does not contain a segfault, but teuthology.log does.

2016-06-20T13:43:49.310 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.395112 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 shard 3: soid 0:e8020acf:::benchmark_data_smithi001_8057_object2273:head data_digest 0x7d5db345 != known data_digest 0x7cf03004 from auth shard 2, size 4097 != known size 4096
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447117 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 0 missing, 1 inconsistent objects
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447119 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 1 errors
2016-06-20T13:44:34.039 INFO:tasks.ceph.osd.3.smithi053.stderr:*** Caught signal (Segmentation fault) **

Actions

Copy link

Updated by Samuel Just almost 8 years ago

Assignee set to David Zafman

Another case, the osd segfaults here (also in the other one?).

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267260/remote

osd.3.log does not contain a segfault, but teuthology.log does.

Actions

Copy link

Updated by David Zafman almost 8 years ago

I'll look for changes in logging code. This must be why there was no information in the ceph-osd.3.log.


Thread 1 (Thread 0x7f9658f6f700 (LWP 20969)):
#0  0x00007f965c1f620b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007f965de96d55 in ?? ()
#2  <signal handler called>
#3  0x00007f965dedecef in ceph::log::Log::flush() ()
#4  0x00007f965dededbe in ceph::log::Log::entry() ()
#5  0x00007f965c1ee182 in start_thread (arg=0x7f9658f6f700) at pthread_create.c:312
#6  0x00007f965a10147d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

   0x00007f965dedece4 <+564>:   lea    0x10(%rdi),%rdx
   0x00007f965dedece8 <+568>:   je     0x7f965deded28 <_ZN4ceph3log3Log5flushEv+632>
   0x00007f965dedecea <+570>:   mov    $0xffffffff,%eax
=> 0x00007f965dedecef <+575>:   lock xadd %eax,(%rdx)
   0x00007f965dedecf3 <+579>:   test   %eax,%eax
   0x00007f965dedecf5 <+581>:   jg     0x7f965dedebd9 <_ZN4ceph3log3Log5flushEv+297>
   0x00007f965dedecfb <+587>:   mov    %r13,%rsi
   0x00007f965dedecfe <+590>:   callq  0x7f965d863c30 <_ZNSs4_Rep10_M_destroyERKSaIcE@plt>

Actions

Copy link

Updated by David Zafman almost 8 years ago

This shows a break at the exact crash point. In this case the value of dx (0x7f759006e430) holds the __mem value. In the crash the value of dx is 0xfffffffffffffff8 (-8). There haven't been change in the flush routine itself, but this could be caused by a double free somehow.

Breakpoint 3, 0x00007f75e6eb994f in _exchange_and_add (_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:49
49 { return __atomic_fetch_add(_mem, _val, __ATOMIC_ACQ_REL); }
(gdb) bt
#0 0x00007f75e6eb994f in __exchange_and_add (_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:49
#1 __exchange_and_add_dispatch (_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:82
#2 _M_dispose (_a=..., this=0x7f759006e420) at /usr/include/c++/4.8/bits/basic_string.h:246
#3 ~basic_string (this=0x7f759006e360, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/basic_string.h:539
#4 ~PrebufferedStreambuf (this=0x7f759006e310, __in_chrg=<optimized out>) at ./common/PrebufferedStreambuf.h:17
#5 ~Entry (this=0x7f759006e2f0, __in_chrg=<optimized out>) at log/Entry.h:16
#6 ceph::log::Log::flush (this=this@entry=0x7f75f16c39c0) at log/Log.cc:286
#7 0x00007f75e6eb9a1e in ceph::log::Log::entry (this=0x7f75f16c39c0) at log/Log.cc:453
#8 0x00007f75e546d182 in start_thread (arg=0x7f75e2c66700) at pthread_create.c:312
#9 0x00007f75e359b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Actions

Copy link

Updated by David Zafman almost 8 years ago

Added valgrind to the same run yaml. It is unfortunate that valgrind didn't detect anything, but rather it has become a victim.

    valgrind:
      osd:
      - --tool=memcheck

  -768> 2016-06-23 15:52:31.630521 311dc700  5 filestore(/var/lib/ceph/osd/ceph-1) queue_transactions new 0x2fdec580 osr(fuse 0x93bb3b0)
  -767> 2016-06-23 15:52:31.630581 311dc700 10 journal prepare_entry [Transaction(0x312a3dc0)]
  -690> 2016-06-23 15:53:15.616121 311dc700 10 journal  len 4349 -> 8192 (head 40 pre_pad 0 bl 4349 post_pad 3763 tail 40) (bl alignment -1)
  -665> 2016-06-23 15:53:15.712363 311dc700 -1 *** Caught signal (Segmentation fault) **
 in thread 311dc700 thread_name:fusestore

 ceph version 10.2.0-2551-g334ca8c (334ca8cb26fe2a75cd1890eef408d25d5b8565ab)
 1: (()+0x949c92) [0xa51c92]
 2: (()+0x10340) [0xb090340]
 3: (_vgr20180ZZ_libcZdsoZa_memcpyZAZAGLIBCZu2Zd14()+0x2a3) [0x9f54953]
 4: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0xb5b8cc]
 5: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x204) [0xb5cee4]
 6: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x85b) [0x8dd0cb]
 7: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x4e9) [0x7e61d9]
 8: (ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, Context*)+0x1b6) [0x767126]
 9: (os_flush(char const*, fuse_file_info*)+0x261) [0x76dd61]
 10: (()+0xdfc7) [0xae52fc7]
 11: (()+0xe241) [0xae53241]
 12: (()+0x14506) [0xae59506]
 13: (()+0x1522b) [0xae5a22b]
 14: (fuse_session_loop()+0xac) [0xae56a6c]
 15: (fuse_loop()+0x1e8) [0xae4f1c8]
 16: (FuseStore::loop()+0x67) [0x768d07]
 17: (FuseStore::FuseThread::entry()+0xd) [0x76eccd]
 18: (()+0x8182) [0xb088182]
 19: (clone()+0x6d) [0xd1b247d]

Actions

Copy link

Updated by Samuel Just almost 8 years ago

Has duplicate Bug #16454: "tcmalloc: large alloc 4294967296 bytes" in rados-wip-yuri-testing_2016_6_22-distro-basic-smithi added

Actions

Copy link

Updated by David Zafman almost 8 years ago

Seen in the teuthology.log before a ceph-osd crashed.

2016-06-23 09:46:26,185.185 INFO:teuthology.orchestra.run.smithi030:Running: "sudo cp /tmp/tmp8361_Y '/var/lib/ceph/osd/ceph-1/fuse/0.8_head/all/#0:1e142346:::benchmark_data_smithi030_15137_object22:head#/data'" 
2016-06-23 09:46:26,585.585 INFO:tasks.ceph.osd.1.smithi030.stderr:tcmalloc: large alloc 4294967296 bytes == 0x3a00e000 @  0xa16ec4c 0xa171547 0xa18fa02 0xb5c205 0xb5c4b5 0xb5c4e4 0x7696c9 0xae5165c 0xae517e8 0xae5a389 0xae56a6c 0xae4f1c8 0x768d07 0x76eccd 0xb088182 0xd1b247d (nil)

Actions

Copy link

Updated by Samuel Just almost 8 years ago

2/2 I checked are btrfs. FuseStore::os_flush is a bit suspicious in that it calls apply_transaction and then flush_commit, but osr is locally allocated (it's a shared_ptr, but apply_transaction does not take a ref). I think that's correct, but it might be worth investigating further.

Actions

Copy link

Updated by David Zafman almost 8 years ago

Assignee changed from David Zafman to Sage Weil

Yaml file I used which doesn't not include a valgrind variant. It reproduced in some form every time I ran with this. Needs 2 machines locked. You'll need to specify the suite-branch which enables the fuse change again. You can try a valgrind memcheck again, but it didn't yield anything useful (see comment #5). Can tcmalloc debugging be enabled?


description: rados/basic/{rados.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml
  mon_kv_backend/leveldb.yaml msgr/async.yaml msgr-failures/few.yaml tasks/scrub_test.yaml}
machine_type: smithi
interactive-on-error: true
openstack:
- volumes:
    count: 3
    size: 10
overrides:
  admin_socket:
    branch: master_2016_6_20
  ceph:
    conf:
      global:
        enable experimental unrecoverable data corrupting features: '*'
        ms inject socket failures: 5000
        ms type: async
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon keyvaluedb: leveldb
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 25
        osd op queue: debug_random
        osd op queue cut off: debug_random
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    - '!= best guess digest'
    - '!= best guess data_digest'
    - '!= best guess omap_digest'
    - '!= known digest'
    - '!= known data_digest'
    - '!= known omap_digest'
    - deep-scrub 0 missing, 1 inconsistent objects
    - deep-scrub 1 errors
    - repair 0 missing, 1 inconsistent objects
    - repair 1 errors, 1 fixed
    - shard [0-9]+ missing
    - deep-scrub 1 missing, 0 inconsistent objects
    sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
  ceph-deploy:
    branch:
      dev-commit: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
  workunit:
    sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
priority: 100
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - client.0
- - mon.b
  - osd.3
  - osd.4
  - osd.5
  - client.1
sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab
suite: rados
suite_branch: master
#suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master
tasks:
- install: null
- ceph: null
- scrub_test: null
teuthology_branch: master
tube: smithi
verbose: true

Actions

Copy link

#10

Updated by Samuel Just over 7 years ago

Priority changed from Urgent to High

Actions

Copy link

#11

Updated by Samuel Just over 7 years ago

Assignee changed from Sage Weil to David Zafman

@david Is this still failing?

Actions

Copy link

#12

Updated by David Zafman over 7 years ago

Status changed from New to Closed

Not seeing this anymore

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #16414

scrub_test now failing, probably related to 9e2b455f45cca2f31b19c73d761661a9028c22ff

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by David Zafman almost 8 years ago

Updated by David Zafman almost 8 years ago

Updated by David Zafman almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by David Zafman almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by David Zafman almost 8 years ago

Updated by Samuel Just over 7 years ago

Updated by Samuel Just over 7 years ago

Updated by David Zafman over 7 years ago