Project

General

Profile

Backport #16915

Jewel: OSD crash with Hammer to Jewel Upgrade: void FileStore::init_temp_collections()

Added by David Zafman almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Release:
jewel
Crash signature:


Related issues

Copied from Ceph - Bug #16672: OSD crash with Hammer to Jewel Upgrade: void FileStore::init_temp_collections() Resolved 07/13/2016

History

#1 Updated by David Zafman almost 4 years ago

  • Copied from Bug #16672: OSD crash with Hammer to Jewel Upgrade: void FileStore::init_temp_collections() added

#2 Updated by David Zafman almost 4 years ago

https://github.com/ceph/ceph/pull/10561

We should put the master pull request first after review and testing. Then we can merge this Jewel backport.

#3 Updated by Nathan Cutler almost 4 years ago

  • Description updated (diff)

original description

During a upgrade from 0.94.7 to 10.2.2 I'm seeing multiple OSDs crash with this backtrace:

-7> 2016-07-13 10:23:42.811650 7f3f5fd0c800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-476) detect_features: splice is supported
-6> 2016-07-13 10:23:42.843643 7f3f5fd0c800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-476) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
-5> 2016-07-13 10:23:42.844736 7f3f5fd0c800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-476) detect_feature: extsize is disabled by conf
-4> 2016-07-13 10:23:42.847300 7f3f5fd0c800  1 leveldb: Recovering log #876
-3> 2016-07-13 10:23:42.894086 7f3f5fd0c800  1 leveldb: Delete type=0 #876

-2> 2016-07-13 10:23:42.894171 7f3f5fd0c800  1 leveldb: Delete type=3 #875

-1> 2016-07-13 10:23:42.895373 7f3f5fd0c800  0 filestore(/var/lib/ceph/osd/ceph-476) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
 0> 2016-07-13 10:23:42.905145 7f3f5fd0c800 -1 os/filestore/FileStore.cc: In function 'void FileStore::init_temp_collections()' thread 7f3f5fd0c800 time 2016-07-13 10:23:42.902566
os/filestore/FileStore.cc: 1735: FAILED assert(r == 0)

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f3f607375b5]
2: (FileStore::init_temp_collections()+0xa63) [0x7f3f603ff613]
3: (FileStore::mount()+0x3ed0) [0x7f3f604036d0]
4: (OSD::init()+0x27d) [0x7f3f600c704d]
5: (main()+0x2c55) [0x7f3f6002cbe5]
6: (__libc_start_main()+0xf5) [0x7f3f5cc1eb15]
7: (()+0x353009) [0x7f3f60077009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 0 osd
0/ 5 optracker
0/ 5 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
0/ 0 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent     10000
max_new         1000
log_file /var/log/ceph/ceph-osd.476.log
--- end dump of recent events ---
2016-07-13 10:23:42.909150 7f3f5fd0c800 -1 *** Caught signal (Aborted) **
in thread 7f3f5fd0c800 thread_name:ceph-osd

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x91341a) [0x7f3f6063741a]
2: (()+0xf100) [0x7f3f5e66f100]
3: (gsignal()+0x37) [0x7f3f5cc325f7]
4: (abort()+0x148) [0x7f3f5cc33ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3f60737797]
6: (FileStore::init_temp_collections()+0xa63) [0x7f3f603ff613]
7: (FileStore::mount()+0x3ed0) [0x7f3f604036d0]
8: (OSD::init()+0x27d) [0x7f3f600c704d]
9: (main()+0x2c55) [0x7f3f6002cbe5]
10: (__libc_start_main()+0xf5) [0x7f3f5cc1eb15]
11: (()+0x353009) [0x7f3f60077009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
 0> 2016-07-13 10:23:42.909150 7f3f5fd0c800 -1 *** Caught signal (Aborted) **
in thread 7f3f5fd0c800 thread_name:ceph-osd

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x91341a) [0x7f3f6063741a]
2: (()+0xf100) [0x7f3f5e66f100]
3: (gsignal()+0x37) [0x7f3f5cc325f7]
4: (abort()+0x148) [0x7f3f5cc33ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f3f60737797]
6: (FileStore::init_temp_collections()+0xa63) [0x7f3f603ff613]
7: (FileStore::mount()+0x3ed0) [0x7f3f604036d0]
8: (OSD::init()+0x27d) [0x7f3f600c704d]
9: (main()+0x2c55) [0x7f3f6002cbe5]
10: (__libc_start_main()+0xf5) [0x7f3f5cc1eb15]
11: (()+0x353009) [0x7f3f60077009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 0 osd
0/ 5 optracker
0/ 5 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
0/ 0 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent     10000
max_new         1000
log_file /var/log/ceph/ceph-osd.476.log
--- end dump of recent events ---

If I try to start them again it works, but sometimes requires two additional tries.

systemctl reset-failed ceph-osd@476.service
systemctl start ceph-osd@476.service

It is a bit hard to diagnose since it doesn't happen on all OSDs and it doesn't always happen a second time on the same OSD.

#4 Updated by Nathan Cutler almost 4 years ago

  • Description updated (diff)

#5 Updated by David Zafman almost 4 years ago

  • Status changed from In Progress to Resolved

#6 Updated by Loic Dachary over 3 years ago

  • Target version set to v10.2.3

Also available in: Atom PDF