Project

General

Profile

Actions

Bug #16688

closed

Crash in "FileStore::queue_transactions" in rados run

Added by Yuri Weinstein almost 8 years ago. Updated over 7 years ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/yuriw-2016-07-12_13:07:42-rados-wip-yuri-testing_2016_7_12-distro-basic-smithi/
Jobs: 309938
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2016-07-12_13:07:42-rados-wip-yuri-testing_2016_7_12-distro-basic-smithi/309938/

in remote/*/ceph-osd.0.log.gz

ceph-osd.0.log.gz:2037417389- in thread 7f718bcb6700
ceph-osd.0.log.gz:2037417413-
ceph-osd.0.log.gz:2037417414: ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
ceph-osd.0.log.gz:2037417478- 1: ceph-osd() [0x98207f]
ceph-osd.0.log.gz:2037417504- 2: (()+0x10340) [0x7f71a2b9a340]
ceph-osd.0.log.gz:2037417538- 3: (gsignal()+0x39) [0x7f71a1241cc9]
ceph-osd.0.log.gz:2037417576- 4: (abort()+0x148) [0x7f71a12450d8]
ceph-osd.0.log.gz:2037417613- 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x204) [0x7f71a2dc7114]
ceph-osd.0.log.gz:2037417769- 6: (()+0x14184) [0x7f71a2dbc184]
ceph-osd.0.log.gz:2037417803- 7: (operator delete(void*)+0x1e4) [0x7f71a2ddadf4]
ceph-osd.0.log.gz:2037417855- 8: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x6fa) [0x86eb6a]
ceph-osd.0.log.gz:2037418072- 9: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction*, Context*, Context*, Context*, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x165) [0x665a65]
ceph-osd.0.log.gz:2037418264- 10: (remove_dir(CephContext*, ObjectStore*, SnapMapper*, OSDriver*, ObjectStore::Sequencer*, coll_t, std::tr1::shared_ptr<DeletingState>, ThreadPool::TPHandle&)+0x479) [0x603c69]
ceph-osd.0.log.gz:2037418444- 11: (OSD::RemoveWQ::_process(std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, ThreadPool::TPHandle&)+0x1ee) [0x609b0e]
ceph-osd.0.log.gz:2037418590- 12: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> > >::_void_process(void*, ThreadPool::TPHandle&)+0x10a) [0x6662ba]
ceph-osd.0.log.gz:2037418835- 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xae9) [0xa55ad9]
ceph-osd.0.log.gz:2037418903- 14: (ThreadPool::WorkThread::entry()+0x10) [0xa569c0]
ceph-osd.0.log.gz:2037418958- 15: (()+0x8182) [0x7f71a2b92182]
ceph-osd.0.log.gz:2037418992- 16: (clone()+0x6d) [0x7f71a130547d]
ceph-osd.0.log.gz:2037419029- NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #1

Updated by Kefu Chai almost 8 years ago

  • Assignee set to Kefu Chai

will take a look

Actions #2

Updated by David Zafman over 7 years ago

This was seen in another rados run:

loic-2016-07-18_07:08:24-rados-master---basic-smithi/320873

It was the same 11429.yaml which and upgrading still running firefly. In the log the following message came out:

2016-07-18 18:24:08.177793 7f138d7ff700 0 filestore(/var/lib/ceph/osd/ceph-0) queue_transactions filestore_blackhole = TRUE, dropping transaction

int FileStore::queue_transactions(Sequencer *posr, list<Transaction*> &tls,
                                  TrackedOpRef osd_op,
                                  ThreadPool::TPHandle *handle)
{
  Context *onreadable;
  Context *ondisk;
  Context *onreadable_sync;
  ObjectStore::Transaction::collect_contexts(
    tls, &onreadable, &ondisk, &onreadable_sync);
  if (g_conf->filestore_blackhole) {
    dout(0) << "queue_transactions filestore_blackhole = TRUE, dropping transaction" << dendl;
    delete ondisk;
    delete onreadable;
    delete onreadable_sync;
    return 0;
  }
Actions #3

Updated by Kefu Chai over 7 years ago

  • Status changed from New to Need More Info
  • Assignee changed from Kefu Chai to Yuri Weinstein

@Yuri, are we still supporting firefly? 0.80.* is retired, i think.

Actions #4

Updated by Nathan Cutler over 7 years ago

@Kefu Chai: This is rados/singleton-nomsgr/{rados.yaml all/11429.yaml} which might be an important regression test to have around? It installs firefly 0.80.8 here:

https://github.com/ceph/ceph-qa-suite/blob/master/suites/rados/singleton-nomsgr/all/11429.yaml#L41

There's also 13324.yaml which installs dumpling!

https://github.com/ceph/ceph-qa-suite/blob/master/suites/rados/singleton-nomsgr/all/13234.yaml#L40

Actions #5

Updated by Kefu Chai over 7 years ago

@Nathan Weinberg

thanks for looking into the tests involving retired releases. yeah, they are upgraded tests

  • 11429.yaml: from firefly to hammer, then to current branch
  • 13234.yaml: from dumpling to firefly, then to hammer, and then to current branch.

but the funny thing is, the crash happened in a retired release. and the intent of #11429 was trying verify that hammer is able to handle the case where pg is referencing non-existent osdmap.

Actions #6

Updated by Kefu Chai over 7 years ago

i tried to to reproduce this issue manually. but seems osd.0 is not slow enough that it finished handling the pg-remove request before "ceph daemon osd.0 config set filestore_blackhole true" returns.

Actions #7

Updated by Kefu Chai over 7 years ago

  • Status changed from Need More Info to New
  • Assignee deleted (Yuri Weinstein)

resigning from myself for now.

Actions #8

Updated by Samuel Just over 7 years ago

  • Status changed from New to Won't Fix

If this is a bug in firefly which is happening infrequently, I think we're going to ignore it.

Actions #9

Updated by Yuri Weinstein over 7 years ago

Just to keep record, also seen in http://qa-proxy.ceph.com/teuthology/yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/orig.config.yaml

description: rados/singleton-nomsgr/{rados.yaml all/11429.yaml}
http://qa-proxy.ceph.com/teuthology/yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/teuthology.log

teuthology@teuthology:~$ zgrep "^ ceph version"  /a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/*osd* -b20 -a30

6:12.986418 7fa023c33700 15 filestore(/var/lib/ceph/osd/ceph-0) omap_get_values meta/a468ec03/snapmapper/0//-1
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375115770-    -2> 2016-09-20 15:16:12.986432 7fa023c33700 15 filestore(/var/lib/ceph/osd/ceph-0) omap_get_values meta/a468ec03/snapmapper/0//-1
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375115904-    -1> 2016-09-20 15:16:12.986448 7fa023c33700  0 filestore(/var/lib/ceph/osd/ceph-0) queue_transactions filestore_blackhole = TRUE, dropping transaction
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116059-     0> 2016-09-20 15:16:12.988211 7fa023c33700 -1 *** Caught signal (Aborted) **
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116141- in thread 7fa023c33700
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116165-
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116166: ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116230- 1: ceph-osd() [0x98207f]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116256- 2: (()+0x10330) [0x7fa03ab23330]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116290- 3: (gsignal()+0x37) [0x7fa0391c8c37]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116328- 4: (abort()+0x148) [0x7fa0391cc028]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116365- 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x204) [0x7fa03ad50114]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116521- 6: (()+0x14184) [0x7fa03ad45184]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116555- 7: (operator delete(void*)+0x1e4) [0x7fa03ad63df4]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116607- 8: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x6fa) [0x86eb6a]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375116824- 9: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction*, Context*, Context*, Context*, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x165) [0x665a65]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117016- 10: (remove_dir(CephContext*, ObjectStore*, SnapMapper*, OSDriver*, ObjectStore::Sequencer*, coll_t, std::tr1::shared_ptr<DeletingState>, ThreadPool::TPHandle&)+0x816) [0x604006]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117196- 11: (OSD::RemoveWQ::_process(std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, ThreadPool::TPHandle&)+0x1ee) [0x609b0e]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117342- 12: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> > >::_void_process(void*, ThreadPool::TPHandle&)+0x10a) [0x6662ba]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117587- 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xae9) [0xa55ad9]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117655- 14: (ThreadPool::WorkThread::entry()+0x10) [0xa569c0]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117710- 15: (()+0x8184) [0x7fa03ab1b184]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117744- 16: (clone()+0x6d) [0x7fa03928c37d]
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117781- NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117874-
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117875---- logging levels ---
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117898-   0/ 5 none
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117911-   0/ 1 lockdep
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117927-   0/ 1 context
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117943-   1/ 1 crush
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117957-   1/ 5 mds
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117969-   1/ 5 mds_balancer
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375117990-   1/ 5 mds_locker
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375118009-   1/ 5 mds_log
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375118025-   1/ 5 mds_log_expire
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375118048-   1/ 5 mds_migrator
/a//yuriw-2016-09-20_14:50:21-rados-wip-yuri-testing2_2016_09_19-distro-basic-smithi/425869/remote/smithi104/log/ceph-osd.0.log.gz:2375118069-   0/ 1 buffer
Actions

Also available in: Atom PDF