Project

General

Profile

Actions

Bug #5154

closed

osd/SnapMapper.cc: 270: FAILED assert(check(oid))

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

     0> 2013-05-23 01:52:34.870678 7f8a88dd0700 -1 osd/SnapMapper.cc: In function 'int SnapMapper::remove_oid(const hobject_t&, MapCacher::Transaction<std::basic_string<char>, ceph::buffer::list>*)' thread 7f8a88dd0700 time 2013-05-23 01:52:34.842007
osd/SnapMapper.cc: 270: FAILED assert(check(oid))

 ceph version 0.62-198-gc2e262f (c2e262fc9493b4bb22c2b7b4990aa1ee7846940e)
 1: ceph-osd() [0x692392]
 2: (remove_dir(ObjectStore*, SnapMapper*, OSDriver*, ObjectStore::Sequencer*, coll_t, std::tr1::shared_ptr<DeletingState>)+0x3ac) [0x60a2cc]
 3: (OSD::RemoveWQ::_process(std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >)+0x2d0) [0x60ace0]
 4: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> > >::_process(std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, ThreadPool::TPHandle&)+0x52) [0x65a822]
 5: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> > >::_void_process(void*, ThreadPool::TPHandle&)+0xd4) [0x65a954]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x83a486]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x83c2b0]
 8: (()+0x7e9a) [0x7f8a9b76be9a]
 9: (clone()+0x6d) [0x7f8a998feccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

job was
ubuntu@teuthology:/a/teuthology-2013-05-23_01:00:08-rados-next-testing-basic/20329$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: cb53a7ee6c6a42a0d8b472eb1d296219def2d877
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    sha1: c2e262fc9493b4bb22c2b7b4990aa1ee7846940e
  install:
    ceph:
      sha1: c2e262fc9493b4bb22c2b7b4990aa1ee7846940e
  s3tests:
    branch: next
  workunit:
    sha1: c2e262fc9493b4bb22c2b7b4990aa1ee7846940e
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- radosbench:
    clients:
    - client.0
    time: 1800


Related issues 3 (0 open3 closed)

Has duplicate Ceph - Bug #5156: OSD: split followed by pg resurrection might leave an object in two collectionsDuplicateSamuel Just05/23/2013

Actions
Has duplicate Ceph - Bug #5630: osd/ReplicatedPG.cc: 7089: FAILED assert(r >= 0) in scan_range, cuttlefishDuplicate07/15/2013

Actions
Has duplicate Ceph - Bug #5667: osd/ReplicatedPG.cc: 4782: FAILED assert(obc->ssc)Duplicate07/18/2013

Actions
Actions #1

Updated by Ian Colle almost 11 years ago

  • Assignee set to Samuel Just
Actions #2

Updated by Samuel Just almost 11 years ago

  • Status changed from New to Resolved
Actions #3

Updated by Sage Weil almost 11 years ago

  • Status changed from Resolved to 12

just hit this on job

ubuntu@teuthology:/a/teuthology-2013-06-24_01:00:12-rados-master-testing-basic/43959$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 3d740946b3b79d51f07d9a735a5fb77a849f57dd
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: osd
        ms inject socket failures: 2500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    sha1: 134d08a9654f66634b893d493e4a92f38acc63cf
  install:
    ceph:
      sha1: 134d08a9654f66634b893d493e4a92f38acc63cf
  s3tests:
    branch: master
  workunit:
    sha1: 134d08a9654f66634b893d493e4a92f38acc63cf
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000
Actions #4

Updated by Samuel Just almost 11 years ago

ubuntu@teuthology:/a/teuthology-2013-07-01_01:00:09-rados-master-testing-basic/51502/remote$ less ceph-osd.4.log

0> 2013-07-01 01:35:20.186234 7fd8e8f78700 -1 osd/SnapMapper.cc: In function 'int SnapMapper::remove_oid(const hobject_t&, MapCacher::Transaction<std::basic_string<char>, ceph::buffer::list>*)'
thread 7fd8e8f78700 time 2013-07-01 01:35:20.183963
osd/SnapMapper.cc: 270: FAILED assert(check(oid))

ceph version 0.65-225-ge235958 (e235958fb13e654aa1ad5e1c5a6ccf76a95196ca)
1: ceph-osd() [0x6ddd32]
2: (remove_dir(ObjectStore*, SnapMapper*, OSDriver*, ObjectStore::Sequencer*, coll_t, std::tr1::shared_ptr<DeletingState>)+0x3e1) [0x657431]
3: (OSD::RemoveWQ::_process(std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >)+0x2d0) [0x657ef0]
4: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> > >::_process(std::pai
r<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, ThreadPool::TPHandle&)+0x52) [0x6a5362]
5: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<DeletingState> > >::_void_process(voi
d*, ThreadPool::TPHandle&)+0xd4) [0x6a5494]
6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8a5136]
7: (ThreadPool::WorkThread::entry()+0x10) [0x8a6f60]
8: (()+0x7e9a) [0x7fd8fc915e9a]
9: (clone()+0x6d) [0x7fd8faaa8ccd]

Actions #5

Updated by Sage Weil almost 11 years ago

ubuntu@teuthology:/a/teuthology-2013-07-15_01:00:16-rados-next-testing-basic/67483

Actions #6

Updated by Samuel Just almost 11 years ago

  • Status changed from 12 to Need More Info
Actions #7

Updated by Samuel Just almost 11 years ago

  • Assignee deleted (Samuel Just)
Actions #8

Updated by Samuel Just almost 11 years ago

2013-07-17 15:14:10.588979 7f87c4c3a700 15 filestore(/var/lib/ceph/osd/ceph-0) _split_collection 3.0_head bits: 3
2013-07-17 15:14:10.588985 7f87c4c3a700 15 filestore(/var/lib/ceph/osd/ceph-0) collection_stat /var/lib/ceph/osd/ceph-0/current/3.0_head
2013-07-17 15:14:10.588993 7f87c4c3a700 10 filestore(/var/lib/ceph/osd/ceph-0) collection_stat /var/lib/ceph/osd/ceph-0/current/3.0_head = 0
2013-07-17 15:14:10.588997 7f87c4c3a700 15 filestore(/var/lib/ceph/osd/ceph-0) collection_stat /var/lib/ceph/osd/ceph-0/current/3.7_head
2013-07-17 15:14:10.589001 7f87c4c3a700 10 filestore(/var/lib/ceph/osd/ceph-0) collection_stat /var/lib/ceph/osd/ceph-0/current/3.7_head = 0
2013-07-17 15:14:10.589003 7f87c4c3a700 15 filestore(/var/lib/ceph/osd/ceph-0) collection_empty 3.7_head
2013-07-17 15:14:10.589010 7f87c4c3a700 20 _collection_list_partial 0//0//-1 1-1 ls.size 0
2013-07-17 15:14:10.589039 7f87c4c3a700 20 prefixes
2013-07-17 15:14:10.589052 7f87c4c3a700 10 filestore(/var/lib/ceph/osd/ceph-0) _set_replay_guard 840.0.43 START
2013-07-17 15:14:10.589057 7f87c4c3a700 20 filestore dbobjectmap: seq is 36
2013-07-17 15:14:10.622281 7f87c4c3a700 10 filestore(/var/lib/ceph/osd/ceph-0) _set_replay_guard 840.0.43 done
2013-07-17 15:14:10.622351 7f87c4c3a700 10 filestore(/var/lib/ceph/osd/ceph-0) _set_replay_guard 840.0.43 START
2013-07-17 15:14:10.622360 7f87c4c3a700 20 filestore dbobjectmap: seq is 36
2013-07-17 15:14:10.648775 7f87bf42f700 1 -- 10.214.133.30:6802/12604 <== osd.4 10.214.132.3:0/26804 13 ==== osd_ping(ping e11 stamp 2013-07-17 15:14:10.649236) v2 ==== 47+0+0 (421875485 0 0) 0x29f6700 con 0x28d5b00
2013-07-17 15:14:10.648798 7f87c0431700 1 -- 10.214.133.30:6803/12604 <== osd.4 10.214.132.3:0/26804 13 ==== osd_ping(ping e11 stamp 2013-07-17 15:14:10.649236) v2 ==== 47+0+0 (421875485 0 0) 0x27c4380 con 0x28282c0
2013-07-17 15:14:10.648812 7f87bf42f700 1 -- 10.214.133.30:6802/12604 --> 10.214.132.3:0/26804 -- osd_ping(ping_reply e11 stamp 2013-07-17 15:14:10.649236) v2 -- ?+0 0x27d31c0 con 0x28d5b00
2013-07-17 15:14:10.648837 7f87bf42f700 20 osd.0 11 _share_map_outgoing 0x2793b00 already has epoch 11
2013-07-17 15:14:10.648857 7f87c0431700 1 -- 10.214.133.30:6803/12604 --> 10.214.132.3:0/26804 -- osd_ping(ping_reply e11 stamp 2013-07-17 15:14:10.649236) v2 -- ?+0 0x27b7e00 con 0x28282c0
2013-07-17 15:14:10.648889 7f87c0431700 20 osd.0 11 _share_map_outgoing 0x2793b00 already has epoch 11
2013-07-17 15:14:10.655607 7f87c4c3a700 10 filestore(/var/lib/ceph/osd/ceph-0) _set_replay_guard 840.0.43 done

...

2013-07-17 15:14:21.408944 7f9595f68780 0 ceph version 0.66-673-g0ebf23c (0ebf23cee84180a0ae8b9fc0d8c2463ca31e6cbc), process ceph-osd, pid 14016

...
2013-07-17 15:14:21.724675 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) do_transaction on 0x16c4180
2013-07-17 15:14:21.724733 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) error opening file /var/lib/ceph/osd/ceph-0/current/3.0_head/plana7527305-271
_head_3A673EB6__3 with flags=2: (2) No such file or directory
2013-07-17 15:14:21.724757 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) check_replay_guard 3.0_head 3a673eb6/plana7527305-271/head//3 dne
2013-07-17 15:14:21.724763 7f9595f68780 15 filestore(/var/lib/ceph/osd/ceph-0) write 3.0_head/3a673eb6/plana7527305-271/head//3 1593331~588713
2013-07-17 15:14:21.725360 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) write 3.0_head/3a673eb6/plana7527305-271/head//3 1593331~588713 = 588713
2013-07-17 15:14:21.725437 7f9595f68780 20 filestore(/var/lib/ceph/osd/ceph-0) _check_replay_guard no xattr
2013-07-17 15:14:21.725477 7f9595f68780 15 filestore(/var/lib/ceph/osd/ceph-0) setattrs 3.0_head/3a673eb6/plana7527305-271/head//3
2013-07-17 15:14:21.725531 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) setattrs 3.0_head/3a673eb6/plana7527305-271/head//3 = -2
2013-07-17 15:14:21.725543 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) tolerating ENOENT on replay
2013-07-17 15:14:21.725568 7f9595f68780 20 filestore(/var/lib/ceph/osd/ceph-0) _check_replay_guard no xattr
2013-07-17 15:14:21.725590 7f9595f68780 20 filestore(/var/lib/ceph/osd/ceph-0) fgetattrs 22 getting '
'
2013-07-17 15:14:21.725602 7f9595f68780 15 filestore(/var/lib/ceph/osd/ceph-0) setattrs 3.0_head/3a673eb6/plana7527305-271/head//3
2013-07-17 15:14:21.725637 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) setattrs 3.0_head/3a673eb6/plana7527305-271/head//3 = -2
2013-07-17 15:14:21.725647 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) tolerating ENOENT on replay
2013-07-17 15:14:21.729082 7f9595f68780 10 filestore(/var/lib/ceph/osd/ceph-0) _do_transaction on 0x16c4180
2013-07-17 15:14:21.729088 7f9595f68780 15 filestore(/var/lib/ceph/osd/ceph-0) _omap_setkeys meta/103a0d1e/pglog_3.0/0//-1
2013-07-17 15:14:21.729167 7f9595f68780 10 filestore hoid: 103a0d1e/pglog_3.0/0//-1 not skipping op, *spos 828.0.0
2013-07-17 15:14:21.729178 7f9595f68780 10 filestore > header.spos 0.0.0

ubuntu@teuthology:/a/dzafman-5154/70686

Splits need to do a sync and prevent replay of any operation prior to that point within the affected collections. Or something.

Actions #9

Updated by Samuel Just almost 11 years ago

  • Status changed from Need More Info to Resolved
Actions

Also available in: Atom PDF