Project

General

Profile

Bug #11512

"FAILED assert(0 == "unexpected error")" in upgrade:giant-x-hammer-distro-basic-vps run

Added by Yuri Weinstein almost 9 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/giant-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2015-04-29_17:05:01-upgrade:giant-x-hammer-distro-basic-vps/
Job: ['868461']
Logs: http://pulpito.ceph.com/teuthology-2015-04-29_17:05:01-upgrade:giant-x-hammer-distro-basic-vps/868461

Assertion: os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")
ceph version 0.94.1-10-gabc0741 (abc0741d57f30a39a18106bf03576e980ad89177)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x968536]
 2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x9702c4]
 3: (JournalingObjectStore::journal_replay(unsigned long)+0x61e) [0x97e6ce]
 4: (FileStore::mount()+0x2d0e) [0x95688e]
 5: (OSD::init()+0x273) [0x688c23]
 6: (main()+0x384f) [0x62e15f]
 7: (__libc_start_main()+0xfd) [0x7f22ddc48d1d]
 8: ceph-osd() [0x629869]

dump.txt View (20.2 KB) Jeff Epstein, 09/19/2015 07:09 AM

History

#1 Updated by Daniel Lundqvist almost 9 years ago

I have also encountered this problem running Giant. All but 1 of my 6 OSDs give this error.

starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f097f71f7c0 time 2015-05-05 11:31:28.448952
os/FileStore.cc: 2715: FAILED assert(0 "unexpected error")
ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0xb01b02]
2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xc7b) [0x8d2dfb]
3: (FileStore::_do_transactions(std::list&lt;ObjectStore::Transaction*, std::allocator&lt;ObjectStore::Transaction*&gt; >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x8d7b24]
4: (JournalingObjectStore::journal_replay(unsigned long)+0x8da) [0x8f319a]
5: (FileStore::mount()+0x2fd2) [0x8c0972]
6: (OSD::do_convertfs(ObjectStore*)+0x2d) [0x6023cd]
7: (main()+0x2495) [0x5eaf15]
8: (__libc_start_main()+0xf0) [0x7f097d148800]
9: (_start()+0x29) [0x5f09b9]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
2015-05-05 11:31:28.450803 7f097f71f7c0 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f097f71f7c0 time 2015-05-05 11:31:28.448952
os/FileStore.cc: 2715: FAILED assert(0 "unexpected error")

#2 Updated by Daniel Lundqvist almost 9 years ago

I have also encountered this problem running Giant. All but 1 of my 6 OSDs give this error.

starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f097f71f7c0 time 2015-05-05 11:31:28.448952
os/FileStore.cc: 2715: FAILED assert(0 == "unexpected error")
 ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0xb01b02]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xc7b) [0x8d2dfb]
 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x8d7b24]
 4: (JournalingObjectStore::journal_replay(unsigned long)+0x8da) [0x8f319a]
 5: (FileStore::mount()+0x2fd2) [0x8c0972]
 6: (OSD::do_convertfs(ObjectStore*)+0x2d) [0x6023cd]
 7: (main()+0x2495) [0x5eaf15]
 8: (__libc_start_main()+0xf0) [0x7f097d148800]
 9: (_start()+0x29) [0x5f09b9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-05-05 11:31:28.450803 7f097f71f7c0 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f097f71f7c0 time 2015-05-05 11:31:28.448952
os/FileStore.cc: 2715: FAILED assert(0 == "unexpected error")

#3 Updated by Samuel Just almost 9 years ago

  • Regression set to No

from the teuthology case {
"ops": [ {
"op_num": 0,
"op_name": "remove",
"collection": "17.2_head",
"oid": "2\/\/head\/\/17"
}, {
"op_num": 1,
"op_name": "rmcoll",
"collection": "17.2_head"
}, {
"op_num": 2,
"op_name": "rmcoll",
"collection": "17.2_TEMP"
}
]
}

0> 2015-04-30 01:17:33.390509 7f22e0231800 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f22e0231800 time 2015-04-30 01:17:33.362932
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")
ceph version 0.94.1-10-gabc0741 (abc0741d57f30a39a18106bf03576e980ad89177)
1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x968536]
2: (FileStore::_do_transactions(std::list&lt;ObjectStore::Transaction*, std::allocator&lt;ObjectStore::Transaction*&gt; >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x9702c4]
3: (JournalingObjectStore::journal_replay(unsigned long)+0x61e) [0x97e6ce]
4: (FileStore::mount()+0x2d0e) [0x95688e]
5: (OSD::init()+0x273) [0x688c23]
6: (main()+0x384f) [0x62e15f]
7: (__libc_start_main()+0xfd) [0x7f22ddc48d1d]
8: ceph-osd() [0x629869]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#4 Updated by Sage Weil almost 9 years ago

    -3> 2015-04-30 01:17:33.360082 7f22e0231800  0 filestore(/var/lib/ceph/osd/ceph-4)  error (39) Directory not empty not handled on operation 0x3c993de (17852.0.2, or op 2, counting from 0)
    -2> 2015-04-30 01:17:33.360113 7f22e0231800  0 filestore(/var/lib/ceph/osd/ceph-4) ENOTEMPTY suggests garbage data in osd data dir
    -1> 2015-04-30 01:17:33.360130 7f22e0231800  0 filestore(/var/lib/ceph/osd/ceph-4)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "17.2_head",
            "oid": "2\/\/head\/\/17" 
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "17.2_head" 
        },
        {
            "op_num": 2,
            "op_name": "rmcoll",
            "collection": "17.2_TEMP" 
        }
    ]
}

     0> 2015-04-30 01:17:33.390509 7f22e0231800 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f22e0231800 time 2015-04-30 01:17:33.362932
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

 ceph version 0.94.1-10-gabc0741 (abc0741d57f30a39a18106bf03576e980ad89177)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x968536]
 2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x9702c4]
 3: (JournalingObjectStore::journal_replay(unsigned long)+0x61e) [0x97e6ce]
 4: (FileStore::mount()+0x2d0e) [0x95688e]
 5: (OSD::init()+0x273) [0x688c23]
 6: (main()+0x384f) [0x62e15f]
 7: (__libc_start_main()+0xfd) [0x7f22ddc48d1d]
 8: ceph-osd() [0x629869]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---

#5 Updated by Samuel Just almost 9 years ago

Daniel: that assert can be triggered by many things. Can you paste the 100 lines above that?

#6 Updated by Samuel Just over 8 years ago

  • Status changed from New to Can't reproduce

#7 Updated by Jeff Epstein over 8 years ago

This bug also affects me. ceph-osd crashes every few hours with this. Enclosed please find the logs.

#8 Updated by Nathan Cutler over 8 years ago

  • Status changed from Can't reproduce to Fix Under Review

#9 Updated by Jeff Epstein over 8 years ago

We believe that this crash is a result of ceph-osd running out of file descriptors, which stems from running it directly from the command line, rather than through the init script which sets ulimit appropriately.

#10 Updated by Sage Weil over 8 years ago

Jeff Epstein wrote:

We believe that this crash is a result of ceph-osd running out of file descriptors, which stems from running it directly from the command line, rather than through the init script which sets ulimit appropriately.

I would expect a different error if we hit the fd limit, but... does using the init script stop teh crashes for you?

#11 Updated by Sage Weil over 8 years ago

  • Status changed from Fix Under Review to Need More Info

#12 Updated by Jeff Epstein over 8 years ago

Sage Weil wrote:

I would expect a different error if we hit the fd limit, but... does using the init script stop teh crashes for you?

Yes, we haven't reproduced this error when the daemon is started from the init script. Without the init script, the daemon fails within a few minutes of beginning a large rebalancing.

#13 Updated by Samuel Just over 8 years ago

  • Status changed from Need More Info to Can't reproduce

I don't think this new information is related to the old reports. If you think the daemon is corrupting the store when it runs out of fds, please reproduce and open another bug with debug osd = 20, debug filestore = 20, debug ms = 1.

Also available in: Atom PDF