Project

General

Profile

Actions

Bug #3698

closed

filestore: ENOENT on clone

Added by Josh Durgin over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Full logs in teuthology:/a/joshd-3631-12-27-12_22.21/29826

   -21> 2012-12-29 01:09:41.660730 7f1e33fc0700  0 filestore(/tmp/cephtest/data/osd.2.data)  error (2) No such file or directory not handled on operation 30 (17262.0.9, or op 9, counting from 0)
   -20> 2012-12-29 01:09:41.660752 7f1e33fc0700  0 filestore(/tmp/cephtest/data/osd.2.data) ENOENT on clone suggests osd bug
   -19> 2012-12-29 01:09:41.660756 7f1e33fc0700  0 filestore(/tmp/cephtest/data/osd.2.data)  transaction dump:
{ "ops": [
        { "op_num": 0,
          "op_name": "remove",
          "collection": "0.f_head",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 1,
          "op_name": "remove",
          "collection": "0.f_TEMP",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 2,
          "op_name": "touch",
          "collection": "0.f_TEMP",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 3,
          "op_name": "omap_setheader",
          "collection": "0.f_TEMP",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0",
          "header_length": "0"},
        { "op_num": 4,
          "op_name": "omap_setkeys",
          "collection": "0.f_TEMP",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0",
          "attr_lens": {}},
        { "op_num": 5,
          "op_name": "setattrs",
          "collection": "0.f_TEMP",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0",
          "attr_lens": { "_": 231}},
        { "op_num": 6,
          "op_name": "remove",
          "collection": "0.f_head",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 7,
          "op_name": "collection_add",
          "src_collection": "0.f_head",
          "dst_collection": "0.f_TEMP",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 8,
          "op_name": "collection_remove",
          "collection": "0.f_TEMP",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 9,
          "op_name": "clonerange2",
          "collection": "0.f_head",
          "src_oid": "6d8c2d0f\/plana4621311-45\/head\/\/0",
          "dst_oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0",
          "src_offset": 0,
          "len": 2107129,
          "dst_offset": 0},
        { "op_num": 10,
          "op_name": "collection_setattr",
          "collection": "0.f_head",
          "name": "info",
          "length": 5},
        { "op_num": 11,
          "op_name": "truncate",
          "collection": "meta",
          "oid": "dae1b753\/pginfo_0.f\/0\/\/-1",
          "offset": 0},
        { "op_num": 12,
          "op_name": "write",
          "collection": "meta",
          "oid": "dae1b753\/pginfo_0.f\/0\/\/-1",
          "length": 1650,
          "offset": 0,
          "bufferlist length": 1650},
        { "op_num": 13,
          "op_name": "mkcoll",
          "collection": "0.f_3df"},
        { "op_num": 14,
          "op_name": "collection_add",
          "src_collection": "0.f_3df",
          "dst_collection": "0.f_head",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 15,
          "op_name": "collection_add",
          "src_collection": "0.f_3d2",
          "dst_collection": "0.f_head",
          "oid": "6d8c2d0f\/plana4621311-45\/3df\/\/0"},
        { "op_num": 16,
          "op_name": "collection_setattr",
          "collection": "0.f_head",
          "name": "info",
          "length": 5},
        { "op_num": 17,
          "op_name": "truncate",
          "collection": "meta",
          "oid": "dae1b753\/pginfo_0.f\/0\/\/-1",
          "offset": 0},
        { "op_num": 18,
          "op_name": "write",
          "collection": "meta",
          "oid": "dae1b753\/pginfo_0.f\/0\/\/-1",
          "length": 1650,
          "offset": 0,
          "bufferlist length": 1650}]}
   -18> 2012-12-29 01:09:41.674379 7f1e33fc0700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)' thread 7f1e33fc0700 time 2012-12-29 01:09:41.661118
os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error")

 ceph version 0.55.1-369-g942c714 (942c71454b11d3bfb09ee4abe65f1e40936d3bf6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x11acbb1]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x2cf3) [0x104fe3b]
 3: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0xe8) [0x104aee2]
 4: (FileStore::_do_op(FileStore::OpSequencer*)+0x211) [0x1049727]
 5: (FileStore::OpWQ::_process(FileStore::OpSequencer*)+0x27) [0x1069907]
 6: (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*)+0x2e) [0x106fab6]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x700) [0x11a1e78]
 8: (ThreadPool::WorkThread::entry()+0x23) [0x11a3ac3]
 9: (Thread::_entry_func(void*)+0x23) [0x119b399]
 10: (()+0x7e9a) [0x7f1e3d950e9a]
 11: (clone()+0x6d) [0x7f1e3badb4bd]

The job that produced this (once in 200 runs):

nuke-on-error: true
overrides:
  ceph:
    branch: next
    conf:
      client:
        debug ms: 1
        debug objecter: 20
        debug rados: 20
      global:
        ms inject socket failures: 5000
    fs: xfs
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      snap_create: 50
      snap_remove: 50
      snap_rollback: 50
      write: 100
    ops: 4000
Actions #1

Updated by Josh Durgin over 11 years ago

Happened again in teuthology:/a/joshd-3631-12-28-12_08.53/30681

Actions #2

Updated by Sage Weil over 11 years ago

Can you add 'debug osd = 20' so the job you're running?

Actions #3

Updated by Sage Weil over 11 years ago

  • Status changed from New to 12
Actions #4

Updated by Ian Colle over 11 years ago

  • Assignee set to Sage Weil
Actions #5

Updated by Tamilarasi muthamizhan over 11 years ago

recent log: ubuntu@teuthology:/a/teuthology-2013-01-01_19:00:03-regression-next-testing-basic/33152

Actions #6

Updated by Sage Weil over 11 years ago

another instance with logs: ubuntu@teuthology:/a/sage-a2/33879

Actions #7

Updated by Sage Weil over 11 years ago

  • Assignee changed from Sage Weil to Samuel Just
Actions #8

Updated by Samuel Just over 11 years ago

Ok. The recovery_qos stuff can allow a client op to reorder past a push. This is a problem since the push might be using the target of the client op to clone from. In this case:

1) Primary sends a push for snap N which clones from HEAD
2) Client deletes HEAD
3) Primary sends sends sub_op for the delete
4) Replica processes the delete
5) Replica processes the push and finds that the head is missing.

Working on solution.

Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to In Progress
Actions #10

Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF