Bug #16414
closedscrub_test now failing, probably related to 9e2b455f45cca2f31b19c73d761661a9028c22ff
Added by Samuel Just almost 8 years ago. Updated over 7 years ago.
0%
Description
sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268758
Updated by Samuel Just almost 8 years ago
Another case, the osd segfaults here (also in the other one?).
sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267260/remote
osd.3.log does not contain a segfault, but teuthology.log does.
2016-06-20T13:43:49.310 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.395112 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 shard 3: soid 0:e8020acf:::benchmark_data_smithi001_8057_object2273:head data_digest 0x7d5db345 != known data_digest 0x7cf03004 from auth shard 2, size 4097 != known size 4096
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447117 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 0 missing, 1 inconsistent objects
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447119 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 1 errors
2016-06-20T13:44:34.039 INFO:tasks.ceph.osd.3.smithi053.stderr:*** Caught signal (Segmentation fault) **
Updated by Samuel Just almost 8 years ago
- Assignee set to David Zafman
Another case, the osd segfaults here (also in the other one?).
sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267260/remote
osd.3.log does not contain a segfault, but teuthology.log does.
2016-06-20T13:43:49.310 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.395112 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 shard 3: soid 0:e8020acf:::benchmark_data_smithi001_8057_object2273:head data_digest 0x7d5db345 != known data_digest 0x7cf03004 from auth shard 2, size 4097 != known size 4096
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447117 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 0 missing, 1 inconsistent objects
2016-06-20T13:43:49.362 INFO:tasks.ceph.osd.3.smithi053.stderr:2016-06-20 20:43:49.447119 7f9640abd700 -1 log_channel(cluster) log [ERR] : 0.17 deep-scrub 1 errors
2016-06-20T13:44:34.039 INFO:tasks.ceph.osd.3.smithi053.stderr:*** Caught signal (Segmentation fault) **
Updated by David Zafman almost 8 years ago
I'll look for changes in logging code. This must be why there was no information in the ceph-osd.3.log.
Thread 1 (Thread 0x7f9658f6f700 (LWP 20969)): #0 0x00007f965c1f620b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x00007f965de96d55 in ?? () #2 <signal handler called> #3 0x00007f965dedecef in ceph::log::Log::flush() () #4 0x00007f965dededbe in ceph::log::Log::entry() () #5 0x00007f965c1ee182 in start_thread (arg=0x7f9658f6f700) at pthread_create.c:312 #6 0x00007f965a10147d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 0x00007f965dedece4 <+564>: lea 0x10(%rdi),%rdx 0x00007f965dedece8 <+568>: je 0x7f965deded28 <_ZN4ceph3log3Log5flushEv+632> 0x00007f965dedecea <+570>: mov $0xffffffff,%eax => 0x00007f965dedecef <+575>: lock xadd %eax,(%rdx) 0x00007f965dedecf3 <+579>: test %eax,%eax 0x00007f965dedecf5 <+581>: jg 0x7f965dedebd9 <_ZN4ceph3log3Log5flushEv+297> 0x00007f965dedecfb <+587>: mov %r13,%rsi 0x00007f965dedecfe <+590>: callq 0x7f965d863c30 <_ZNSs4_Rep10_M_destroyERKSaIcE@plt>
Updated by David Zafman almost 8 years ago
This shows a break at the exact crash point. In this case the value of dx (0x7f759006e430) holds the __mem value. In the crash the value of dx is 0xfffffffffffffff8 (-8). There haven't been change in the flush routine itself, but this could be caused by a double free somehow.
Breakpoint 3, 0x00007f75e6eb994f in _exchange_and_add (_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:49
49 { return __atomic_fetch_add(_mem, _val, __ATOMIC_ACQ_REL); }
(gdb) bt
#0 0x00007f75e6eb994f in __exchange_and_add (_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:49
#1 __exchange_and_add_dispatch (_val=-1, _mem=0x7f759006e430) at /usr/include/c++/4.8/ext/atomicity.h:82
#2 _M_dispose (_a=..., this=0x7f759006e420) at /usr/include/c++/4.8/bits/basic_string.h:246
#3 ~basic_string (this=0x7f759006e360, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/basic_string.h:539
#4 ~PrebufferedStreambuf (this=0x7f759006e310, __in_chrg=<optimized out>) at ./common/PrebufferedStreambuf.h:17
#5 ~Entry (this=0x7f759006e2f0, __in_chrg=<optimized out>) at log/Entry.h:16
#6 ceph::log::Log::flush (this=this@entry=0x7f75f16c39c0) at log/Log.cc:286
#7 0x00007f75e6eb9a1e in ceph::log::Log::entry (this=0x7f75f16c39c0) at log/Log.cc:453
#8 0x00007f75e546d182 in start_thread (arg=0x7f75e2c66700) at pthread_create.c:312
#9 0x00007f75e359b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Updated by David Zafman almost 8 years ago
Added valgrind to the same run yaml. It is unfortunate that valgrind didn't detect anything, but rather it has become a victim.
valgrind: osd: - --tool=memcheck -768> 2016-06-23 15:52:31.630521 311dc700 5 filestore(/var/lib/ceph/osd/ceph-1) queue_transactions new 0x2fdec580 osr(fuse 0x93bb3b0) -767> 2016-06-23 15:52:31.630581 311dc700 10 journal prepare_entry [Transaction(0x312a3dc0)] -690> 2016-06-23 15:53:15.616121 311dc700 10 journal len 4349 -> 8192 (head 40 pre_pad 0 bl 4349 post_pad 3763 tail 40) (bl alignment -1) -665> 2016-06-23 15:53:15.712363 311dc700 -1 *** Caught signal (Segmentation fault) ** in thread 311dc700 thread_name:fusestore ceph version 10.2.0-2551-g334ca8c (334ca8cb26fe2a75cd1890eef408d25d5b8565ab) 1: (()+0x949c92) [0xa51c92] 2: (()+0x10340) [0xb090340] 3: (_vgr20180ZZ_libcZdsoZa_memcpyZAZAGLIBCZu2Zd14()+0x2a3) [0x9f54953] 4: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0xb5b8cc] 5: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x204) [0xb5cee4] 6: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x85b) [0x8dd0cb] 7: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x4e9) [0x7e61d9] 8: (ObjectStore::apply_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, Context*)+0x1b6) [0x767126] 9: (os_flush(char const*, fuse_file_info*)+0x261) [0x76dd61] 10: (()+0xdfc7) [0xae52fc7] 11: (()+0xe241) [0xae53241] 12: (()+0x14506) [0xae59506] 13: (()+0x1522b) [0xae5a22b] 14: (fuse_session_loop()+0xac) [0xae56a6c] 15: (fuse_loop()+0x1e8) [0xae4f1c8] 16: (FuseStore::loop()+0x67) [0x768d07] 17: (FuseStore::FuseThread::entry()+0xd) [0x76eccd] 18: (()+0x8182) [0xb088182] 19: (clone()+0x6d) [0xd1b247d]
Updated by Samuel Just almost 8 years ago
- Has duplicate Bug #16454: "tcmalloc: large alloc 4294967296 bytes" in rados-wip-yuri-testing_2016_6_22-distro-basic-smithi added
Updated by David Zafman almost 8 years ago
Seen in the teuthology.log before a ceph-osd crashed.
2016-06-23 09:46:26,185.185 INFO:teuthology.orchestra.run.smithi030:Running: "sudo cp /tmp/tmp8361_Y '/var/lib/ceph/osd/ceph-1/fuse/0.8_head/all/#0:1e142346:::benchmark_data_smithi030_15137_object22:head#/data'" 2016-06-23 09:46:26,585.585 INFO:tasks.ceph.osd.1.smithi030.stderr:tcmalloc: large alloc 4294967296 bytes == 0x3a00e000 @ 0xa16ec4c 0xa171547 0xa18fa02 0xb5c205 0xb5c4b5 0xb5c4e4 0x7696c9 0xae5165c 0xae517e8 0xae5a389 0xae56a6c 0xae4f1c8 0x768d07 0x76eccd 0xb088182 0xd1b247d (nil)
Updated by Samuel Just almost 8 years ago
2/2 I checked are btrfs. FuseStore::os_flush is a bit suspicious in that it calls apply_transaction and then flush_commit, but osr is locally allocated (it's a shared_ptr, but apply_transaction does not take a ref). I think that's correct, but it might be worth investigating further.
Updated by David Zafman almost 8 years ago
- Assignee changed from David Zafman to Sage Weil
Yaml file I used which doesn't not include a valgrind variant. It reproduced in some form every time I ran with this. Needs 2 machines locked. You'll need to specify the suite-branch which enables the fuse change again. You can try a valgrind memcheck again, but it didn't yield anything useful (see comment #5). Can tcmalloc debugging be enabled?
description: rados/basic/{rados.yaml clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml mon_kv_backend/leveldb.yaml msgr/async.yaml msgr-failures/few.yaml tasks/scrub_test.yaml} machine_type: smithi interactive-on-error: true openstack: - volumes: count: 3 size: 10 overrides: admin_socket: branch: master_2016_6_20 ceph: conf: global: enable experimental unrecoverable data corrupting features: '*' ms inject socket failures: 5000 ms type: async mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon keyvaluedb: leveldb osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 25 osd op queue: debug_random osd op queue cut off: debug_random osd op thread timeout: 60 osd sloppy crc: true fs: btrfs log-whitelist: - slow request - '!= best guess digest' - '!= best guess data_digest' - '!= best guess omap_digest' - '!= known digest' - '!= known data_digest' - '!= known omap_digest' - deep-scrub 0 missing, 1 inconsistent objects - deep-scrub 1 errors - repair 0 missing, 1 inconsistent objects - repair 1 errors, 1 fixed - shard [0-9]+ missing - deep-scrub 1 missing, 0 inconsistent objects sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab ceph-deploy: branch: dev-commit: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab workunit: sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab priority: 100 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - client.0 - - mon.b - osd.3 - osd.4 - osd.5 - client.1 sha1: 334ca8cb26fe2a75cd1890eef408d25d5b8565ab suite: rados suite_branch: master #suite_path: /var/lib/teuthworker/src/ceph-qa-suite_master tasks: - install: null - ceph: null - scrub_test: null teuthology_branch: master tube: smithi verbose: true
Updated by Samuel Just over 7 years ago
- Assignee changed from Sage Weil to David Zafman
@david Is this still failing?