Project

General

Profile

Bug #9891

"Assertion: os/DBObjectMap.cc: 1214: FAILED assert(0)" in upgrade:firefly-x-giant-distro-basic-multi run

Added by Yuri Weinstein about 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
10/26/2014
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
giant,firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-10-25_18:13:01-upgrade:firefly-x-giant-distro-basic-multi/571584/

(from scrape)

Assertion: os/DBObjectMap.cc: 1214: FAILED assert(0)
ceph version 0.86-104-gb05efdd (b05efddb77290b86eb5c150776c761ab84f66f37)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb81feb]
 2: (DBObjectMap::lookup_parent(std::tr1::shared_ptr<DBObjectMap::_Header>)+0x798) [0xa2da28]
 3: (DBObjectMap::_clear(std::tr1::shared_ptr<DBObjectMap::_Header>, std::tr1::shared_ptr<KeyValueDB::TransactionImpl>)+0xf4) [0xa30e74]
 4: (DBObjectMap::clear(ghobject_t const&, SequencerPosition const*)+0x202) [0xa35962]
 5: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x16b) [0x8ff01b]
 6: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x8ff4ab]
 7: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2cc1) [0x90ff01]
 8: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x913624]
 9: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x18f) [0x9137cf]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb73176]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xb74220]
 12: (()+0x8182) [0x7ffbf6ff8182]
 13: (clone()+0x6d) [0x7ffbf5563fbd]

archive_path: /var/lib/teuthworker/archive/teuthology-2014-10-25_18:13:01-upgrade:firefly-x-giant-distro-basic-multi/571584
branch: giant
description: upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml
  2-workload/{ec-rados-default.yaml rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml
  test_rbd_python.yaml} 3-upgrade-sequence/upgrade-all.yaml 4-final-workload/{rados-snaps-few-objects.yaml
  rados_loadgenmix.yaml rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml
  rgw_swift.yaml} distros/ubuntu_14.04.yaml}
email: ceph-qa@ceph.com
job_id: '571584'
kernel: &id001
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: plana,burnupi,mira
name: teuthology-2014-10-25_18:13:01-upgrade:firefly-x-giant-distro-basic-multi
nuke-on-error: true
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: giant
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    sha1: b05efddb77290b86eb5c150776c761ab84f66f37
  ceph-deploy:
    branch:
      dev: giant
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: b05efddb77290b86eb5c150776c761ab84f66f37
  s3tests:
    branch: giant
  workunit:
    sha1: b05efddb77290b86eb5c150776c761ab84f66f37
owner: scheduled_teuthology@teuthology
priority: 99
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
suite: upgrade:firefly-x
suite_branch: giant
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_giant
targets:
  ubuntu@burnupi55.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCoNkQL+cTE/GCwShmnIodptdNXomyURFXp3UtW2a5/5qqq/4ic1ujJ5KAxPSs+uJCP5u+RVpoX6O7WIT0RC9OjPo4yQRU2CXk0AneE0rS2JK0DBz5KJlpKXwjQ5hkU71uaXMkUDJE9YE1faZ3C92hnlyIixjsmWfNEdxcGWUCFRc1/wXNbJsp5wyTIy+RHqwxuy+IRp3Oa1qvrbM2u0wEx5qCFUvlj4M1F+Oi7c531BWXRMlh9VbDa3scs2qc3Cu5foglcMe4B6Fxoj9J8ZPfdX/lI3qHlgRdnTZzB3fCCNldc9ZOreAbL/8eV+fHRMxG6uuCKp2b9M+scxh4puZ2J
  ubuntu@plana29.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDgHl5M5raEaWWKwH7thuC+Glhc6UlVLHpvIMDGMezlVOELXUB197ZAyurxpWurUteDEaqevUqjUgi08lEL0dHW4G9ulst8B3FqHrAvIvCmhKwGs1znrorZx2Bq+/8mreE4ocMWfT/R8sFzBsM3npgdgSqCAdDBSgI78S92WSlHGAqUz1A0iJoGdwiRjTKhCrI/tuIYyXUU9z/2vIR9bJTp0fwt2S3z4LnpKdsMU8BGpJwm8CPrVj/odRDZ/epgEFBySb4OUq68QDXlg8RHnL1D+zFVsoi/uU5o9rW55EAo7KnPDcdNCQSsZmaurwRaZxVSqhomF0kIU9oL1wlD+s7N
  ubuntu@plana38.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCnqJmM2k0mJ+z0UPorFQERz0n6Dm6kgc51D9b/Mxebr4tNb2JIhI0L23thzhnsWW/q9UzIyVFZriTFmjdrvHjmfqByoihjUiX2QnHF0jDskMuywSUm+hvhwU+P7s+BSC5HZEWf/oQJ/w8WuLXOfsruMJu2cQc6ow3Zj+UzeGwO0wab5knKwJDG70aVfrs3wQjcm1RQWSqv+oVG4/CXQnC+TmAMR1/2G42XBNjjF5vl8M47M5zZKW+czP5r40dC5LmNU8lyvmvikxIaoxQS5n9SPSUtuAF1PM6UnSy+Rq2RZNvI/yRVWyYuV0UBwwFN8EarZ9icstCv2P1rzsMMeNYv
tasks:
- internal.lock_machines:
  - 3
  - plana,burnupi,mira
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.push_inventory: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: firefly
- print: '**** done installing firefly'
- ceph:
    fs: xfs
- print: '**** done ceph'
- parallel:
  - workload
  - upgrade-sequence
- print: '**** done parallel'
- rados:
    clients:
    - client.1
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
- workunit:
    clients:
      client.1:
      - rados/load-gen-mix.sh
- mon_thrash:
    revive_delay: 20
    thrash_delay: 1
- workunit:
    clients:
      client.1:
      - rados/test.sh
- workunit:
    clients:
      client.1:
      - cls/test_cls_rbd.sh
- workunit:
    clients:
      client.1:
      - rbd/import_export.sh
    env:
      RBD_CREATE_ARGS: --new-format
- rgw:
  - client.1
- swift:
    client.1:
      rgw_server: client.1
teuthology_branch: master
tube: multi
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
      mon.b: null
  - ceph.restart:
    - mon.a
    - mon.b
    - mon.c
    - mds.a
    - osd.0
    - osd.1
    - osd.2
    - osd.3
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.multi.3115
workload:
  sequential:
  - rados:
      clients:
      - client.0
      ec_pool: true
      objects: 50
      op_weights:
        append: 100
        copy_from: 50
        delete: 50
        read: 100
        rmattr: 25
        rollback: 50
        setattr: 25
        snap_create: 50
        snap_remove: 50
        write: 0
      ops: 4000
  - workunit:
      branch: firefly
      clients:
        client.0:
        - cls
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rados/load-gen-big.sh
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rbd/test_librbd.sh
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rbd/test_librbd_python.sh
description: upgrade:firefly-x/parallel/{0-cluster/start.yaml 1-firefly-install/firefly.yaml
  2-workload/{ec-rados-default.yaml rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml
  test_rbd_python.yaml} 3-upgrade-sequence/upgrade-all.yaml 4-final-workload/{rados-snaps-few-objects.yaml
  rados_loadgenmix.yaml rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml
  rgw_swift.yaml} distros/ubuntu_14.04.yaml}
duration: 2967.701498031616
failure_reason: '''wait_until_healthy''reached maximum tries (150) after waiting for
  900 seconds'
flavor: basic
owner: scheduled_teuthology@teuthology
status: fail
success: false

History

#1 Updated by Samuel Just about 4 years ago

2014-10-25 18:56:23.243456 7fefdcc3e700 20 filestore dbobjectmap: seq is 485
2014-10-25 18:56:23.243559 7fefdd43f700 20 filestore dbobjectmap: seq is 486
2014-10-25 18:56:23.243602 7fefdd43f700 20 filestore dbobjectmap: seq is 487
2014-10-25 18:56:23.243628 7fefdd43f700 20 filestore set_header: setting seq 324
2014-10-25 18:56:23.243642 7fefdd43f700 20 filestore set_map_header: setting 485 oid fb2248dc/burnupi5544081-40/45//3/18446744073709551615/2 parent seq 324
2014-10-25 18:56:23.243655 7fefdd43f700 20 filestore set_map_header: setting 486 oid fb2248dc/burnupi5544081-40/45//3/484/2 parent seq 324
...
2014-10-25 18:56:30.801289 7ffbf8345900 0 ceph version 0.86-104-gb05efdd (b05efddb77290b86eb5c150776c761ab84f66f37), process ceph-osd, pid 18817
...
2014-10-25 18:56:31.104055 7ffbf8345900 20 filestore (init)dbobjectmap: seq is 485
...
2014-10-25 18:56:40.629704 7ffbe8e4d700 20 filestore(/var/lib/ceph/osd/ceph-2) objectmap clone
2014-10-25 18:56:40.629834 7ffbe8e4d700 20 filestore dbobjectmap: seq is 487
2014-10-25 18:56:40.629878 7ffbe8e4d700 20 filestore dbobjectmap: seq is 488
2014-10-25 18:56:40.629907 7ffbe8e4d700 20 filestore set_header: setting seq 453
2014-10-25 18:56:40.629926 7ffbe8e4d700 20 filestore set_map_header: setting 486 oid ac9ea7a1/burnupi5544081-15/head//3/18446744073709551615/1 parent seq 453
2014-10-25 18:56:40.629959 7ffbe8e4d700 20 filestore set_map_header: setting 487 oid ac9ea7a1/burnupi5544081-15/64//3/18446744073709551615/1 parent seq 453
...
2014-10-25 18:56:41.920041 7ffbe8e4d700 20 filestore remove_map_header: removing 486 oid fb2248dc/burnupi5544081-40/45//3/484/2
2014-10-25 18:56:41.920056 7ffbe8e4d700 20 filestore clear_header: clearing seq 486
...
2014-10-25 18:56:41.933829 7ffbe8e4d700 20 filestore remove_map_header: removing 489 oid ac9ea7a1/burnupi5544081-15/head//3/360/1
2014-10-25 18:56:41.933842 7ffbe964e700 10 filestore oid: 1113f7/pglog_3.as2/0//-1 not skipping op, *spos 7669.0.4
2014-10-25 18:56:41.933844 7ffbe8e4d700 20 filestore clear_header: clearing seq 489
2014-10-25 18:56:41.933847 7ffbe964e700 10 filestore > header.spos 0.0.0
2014-10-25 18:56:41.933890 7ffbe964e700 15 filestore(/var/lib/ceph/osd/ceph-2) _omap_setkeys meta/1113f7/pglog_3.as2/0//-1
2014-10-25 18:56:41.933932 7ffbe964e700 10 filestore oid: 1113f7/pglog_3.as2/0//-1 not skipping op, *spos 7669.0.5
2014-10-25 18:56:41.933937 7ffbe964e700 10 filestore > header.spos 0.0.0
2014-10-25 18:56:41.934078 7ffbe964e700 10 filestore(/var/lib/ceph/osd/ceph-2) -ERANGE, len is 802
2014-10-25 18:56:41.934107 7ffbe964e700 10 filestore(/var/lib/ceph/osd/ceph-2) -ERANGE, got 401
2014-10-25 18:56:41.934108 7ffbe8e4d700 20 filestore lookup_parent: parent 486 for seq 489

So, the transaction updating dbobjectmap seq didn't apply for some reason. It would have been submitted before the transaction actually writing out the nodes using it, so that's definitely odd since burnupi5544081-40/45//3/484/2 was apparently written out with seq 486. This must be either a filesystem thing or a leveldb thing. It's not clear to me how it could have been a filesystem thing though.

#2 Updated by Sage Weil about 4 years ago

  • Priority changed from Urgent to High

does not appear to be a ceph issue.. either bad disk or leveldb corruption or something. lowering priority.

#3 Updated by Samuel Just almost 4 years ago

  • Status changed from New to Rejected

#4 Updated by Loic Dachary almost 4 years ago

http://pulpito.ceph.com/loic-2015-01-10_11:54:25-rados-giant-backports-testing-basic-vps/694324/

    -6> 2015-01-11 09:18:46.785891 7f6d14685700  1 -- 10.214.140.199:6803/19206 <== osd.5 10.214.138.167:0/6293 12 ==== osd_ping(ping e459 stamp 2015-01-11 09:18:46.774003) v2 ==== 47+0+0 (3607071168 0 0) 0x5d42a00 con 0x5abd000
    -5> 2015-01-11 09:18:46.785913 7f6d14685700  1 -- 10.214.140.199:6803/19206 --> 10.214.138.167:0/6293 -- osd_ping(ping_reply e459 stamp 2015-01-11 09:18:46.774003) v2 -- ?+0 0x5a4ec00 con 0x5abd000
    -4> 2015-01-11 09:18:46.785929 7f6d14685700 20 osd.2 459 share_map_peer 0x5a45000 already has epoch 459
    -3> 2015-01-11 09:18:46.785822 7f6d12e82700  1 -- 10.214.140.199:6802/19206 <== osd.5 10.214.138.167:0/6293 12 ==== osd_ping(ping e459 stamp 2015-01-11 09:18:46.774003) v2 ==== 47+0+0 (3607071168 0 0) 0x5dbfa00 con 0x5abd160
    -2> 2015-01-11 09:18:46.785998 7f6d12e82700  1 -- 10.214.140.199:6802/19206 --> 10.214.138.167:0/6293 -- osd_ping(ping_reply e459 stamp 2015-01-11 09:18:46.774003) v2 -- ?+0 0x5499200 con 0x5abd160
    -1> 2015-01-11 09:18:46.786011 7f6d12e82700 20 osd.2 459 share_map_peer 0x5a45000 already has epoch 459
     0> 2015-01-11 09:18:46.804341 7f6d1c7c3700 -1 os/DBObjectMap.cc: In function 'DBObjectMap::Header DBObjectMap::lookup_parent(DBObjectMap::Header)' thread 7f6d1c7c3700 time 2015-01-11 09:18:46.775993
os/DBObjectMap.cc: 1214: FAILED assert(0)

 ceph version 0.87-87-g0ea20e6 (0ea20e6c51208d6710f469454ab3f964bfa7c9d2)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa04bf]
 2: (DBObjectMap::lookup_parent(std::tr1::shared_ptr<DBObjectMap::_Header>)+0x925) [0x9790e5]
 3: (DBObjectMap::_clear(std::tr1::shared_ptr<DBObjectMap::_Header>, std::tr1::shared_ptr<KeyValueDB::TransactionImpl>)+0xe5) [0x9792c5]
 4: (DBObjectMap::clear(ghobject_t const&, SequencerPosition const*)+0x1a5) [0x97d325]
 5: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x250) [0x8ab400]
 6: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x172) [0x8ab872]
 7: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x3bcc) [0x8bbc5c]
 8: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x8be064]
 9: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x297) [0x8be317]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa900ee]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xa92e80]
 12: (()+0x7e9a) [0x7f6d27eabe9a]
 13: (clone()+0x6d) [0x7f6d268762ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#5 Updated by Yuri Weinstein over 3 years ago

  • Status changed from Rejected to New

See in release giant v0.87.1 RC
Run: http://pulpito.front.sepia.ceph.com/teuthology-2015-02-12_17:29:32-rados-giant-distro-basic-multi/
Job: ['754710']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-12_17:29:32-rados-giant-distro-basic-multi/754710/

Assertion: os/DBObjectMap.cc: 1214: FAILED assert(0)
ceph version 0.87-159-gccb0914 (ccb0914f76da23acdd7374233cd1939ab80ef3c8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb82c4b]
 2: (DBObjectMap::lookup_parent(std::tr1::shared_ptr<DBObjectMap::_Header>)+0x798) [0xa2e6d8]
 3: (DBObjectMap::_clear(std::tr1::shared_ptr<DBObjectMap::_Header>, std::tr1::shared_ptr<KeyValueDB::TransactionImpl>)+0xf4) [0xa31b24]
 4: (DBObjectMap::clear(ghobject_t const&, SequencerPosition const*)+0x202) [0xa36612]
 5: (FileStore::lfn_unlink(coll_t, ghobject_t const&, SequencerPosition const&, bool)+0x16b) [0x8ff55b]
 6: (FileStore::_remove(coll_t, ghobject_t const&, SequencerPosition const&)+0x8b) [0x8ff9eb]
 7: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2cc1) [0x910441]
 8: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x913b64]
 9: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x18f) [0x913d0f]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa46) [0xb73dd6]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xb74e80]
 12: (()+0x8182) [0x7f5c252de182]
 13: (clone()+0x6d) [0x7f5c23848efd]

#6 Updated by Yuri Weinstein over 3 years ago

  • Backport set to giant

#7 Updated by Yuri Weinstein over 3 years ago

  • Priority changed from High to Urgent

#8 Updated by Yuri Weinstein over 3 years ago

Verified that this test:

rados/thrash-erasure-code/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/default.yaml workloads/ec-rados-plugin=jerasure-k=2-m=1.yaml}

passed in run http://pulpito.front.sepia.ceph.com/teuthology-2015-02-17_14:43:58-rados-giant-distro-basic-multi/

Job/logs - http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-17_14:43:58-rados-giant-distro-basic-multi/763604/

#9 Updated by Samuel Just over 3 years ago

  • Status changed from New to Testing
  • Assignee set to Samuel Just
  • Backport changed from giant to hammer,giant,firefly,dumpling

2015-02-13 03:40:47.208535 7f0e9b5ec700 20 filestore dbobjectmap: seq is 271
2015-02-13 03:40:47.208564 7f0e9adeb700 20 filestore dbobjectmap: seq is 271
2015-02-13 03:40:47.208579 7f0e9b5ec700 20 filestore dbobjectmap: seq is 272

Note that there are two threads participating there! DBObjectMap::sync() doesn't take the header_lock!

#10 Updated by Samuel Just over 3 years ago

This can be ignored for the present giant release though.

#11 Updated by Samuel Just over 3 years ago

  • Status changed from Testing to Pending Backport

#14 Updated by Loic Dachary over 3 years ago

  • Backport changed from hammer,giant,firefly,dumpling to hammer,giant,firefly

dumpling is end of life

#15 Updated by Loic Dachary over 3 years ago

  • Backport changed from hammer,giant,firefly to giant,firefly

already in hammer

#16 Updated by Loic Dachary over 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF