Project

General

Profile

Bug #5163

filestore: ENOTEMPTY on object removal

Added by Mike Lowe almost 11 years ago. Updated almost 11 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had an osd crash during normal opperation, this could possibly be related to 4927. I was able to restart the osd and continue running. The underlying xfs filesystem appears to be normal and dmesg is clean.

May 24 10:14:37 gwioss1 ceph-osd:    -22> 2013-05-24 10:14:36.868976 7ff0fdbc2700  0 filestore(/data/osd.9)  error (39) Directory not empty not handled on operation 13 (28835393.1.0, or op 0, counting from 0)
May 24 10:14:37 gwioss1 ceph-osd: -21> 2013-05-24 10:14:36.884334 7ff0fdbc2700 0 filestore(/data/osd.9) ENOTEMPTY suggests garbage data in osd data dir
May 24 10:14:37 gwioss1 ceph-osd: -20> 2013-05-24 10:14:36.884340 7ff0fdbc2700 0 filestore(/data/osd.9) transaction dump:#012{ "ops": [#012 { "op_num": 0,#012 "op_name": "remove",#012 "collection": "2.363_head",#012 "oid": "8416ff63\/rb.0.2025.2ae8944a.000000005e78\/head\/\/2"}]}
May 24 10:14:37 gwioss1 ceph-osd: 0> 2013-05-24 10:14:37.095302 7ff0fdbc2700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)' thread 7ff0fdbc2700 time 2013-05-24 10:14:36.911403#012os/FileStore.cc: 2697: FAILED assert(0 == "unexpected error")#012#012 ceph version 0.61-4-ge99daaf (e99daaf76c63377a03c1078f2a22b31b1f617304)#012 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x94c) [0x804fec]#012 2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x69) [0x808479]#012 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x17e) [0x80860e]#012 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8e646a]#012 5: (ThreadPool::WorkThread::entry()+0x10) [0x8e7710]#012 6: (()+0x7e9a) [0x7ff107dc5e9a]#012 7: (clone()+0x6d) [0x7ff105efbccd]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

History

#1 Updated by Mike Lowe almost 11 years ago

The object in question was part of an rbd image that a vm was doing a fstrim on when the crash happened.

#2 Updated by Tamilarasi muthamizhan almost 11 years ago

  • Priority changed from High to Urgent

This happened when running rados_python test on the arm test setup.


2013-05-31 16:13:02.587030 af3ff420 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)' thread af3ff420 time 2013-05-31 16:13:02.582218
os/FileStore.cc: 2698: FAILED assert(0 == "unexpected error")

 ceph version 0.61.2-30-g131dca7 (131dca7d10d5b0b7dca5c5cbe9b0b54938805015)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long long, int)+0x639) [0x337926]
 2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long long, ThreadPool::TPHandle*)+0x43) [0x3396d4]
 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x10f) [0x3397fc]
 4: (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*, ThreadPool::TPHandle&)+0x7) [0x33aa44]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x3c3) [0x3bee88]
 6: (ThreadPool::WorkThread::entry()+0x9) [0x3bf9ea]
 7: (Thread::_entry_func(void*)+0x7) [0x3b92f4]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

...
...

2013-05-31 16:13:02.833444 a2cff420  0 -- 10.214.143.10:6801/19670 >> 10.214.143.10:6804/19671 pipe(0xac355b28 sd=34 :57310 s=2 pgs=1 cs=1 l=0).fault with nothing to send, going to standby
2013-05-31 16:13:02.833813 a25ff420  0 -- 10.214.143.10:0/19670 >> 10.214.143.10:6805/19671 pipe(0xb0d104d8 sd=30 :0 s=1 pgs=0 cs=0 l=1).fault
2013-05-31 16:13:02.846528 af3ff420 -1 *** Caught signal (Aborted) **
 in thread af3ff420

 ceph version 0.61.2-30-g131dca7 (131dca7d10d5b0b7dca5c5cbe9b0b54938805015)
 1: ceph-osd() [0x35bcee]
 2: (__default_sa_restorer_v2()+0) [0xb69e49d0]
 3: (()+0x171e6) [0xb69d61e6]
 4: (gsignal()+0x29) [0xb69e3db2]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

tamil@ubuntu:~/arm/teuthology$ cat rados.yaml 
roles:
- [mon.a, mds.a, osd.0, osd.1]
- [mon.b, mon.c, osd.2, osd.3, client.0]

targets:
  ubuntu@tala007.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC77W/bX7Lxw7CYYGH0Qef09nebdo+RKD07QliMGVogjb/UItyn31uMTh3yOLm1VDD5jtAlDRNRNNyDyqYp4MFKyLJAOHIMdBQV+Q8nGd4wjpMkh2/aCPKEO73l64+vVKnNLWLlI3Lrgp1OTrz3r9Fk7pRDJb5K7EmYKacx4UeEENch9pr/f/FE8222e5aoLXJnifz9SrSI2ItqZlgNgGPOXCdvlEaefIaDU7MIlEBP+21jjJHzDlFWoQW0umL9JF6nXbH+dxQlbLNE9mOI+c8YnsolStRAMqziXOhmsK4cn4DrzcY48vuIOFQVjhwsuqE+f8oEGIkXv6OWRFo9JziH
  ubuntu@tala010.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDAPzBE6fYTGlFJJ5fkmwsqmsIB3xs+ukNywGc048bSzrCghBQ5BKjVHh/cWt2AX5RfP5naZCCnOLftWQs/V+oC36J7pBzAt2hwDq88EWXRegEaiTjwjSzgBffI8wFeKJ6BVMMfMk0yXkuQYXiwXUnR5X57Gfar+vw8bIxDqiJE6i9JQ3Qo/exmTZMsPGLfAMUE7xmQqwhBpaUhA5BbSnekI70LshHQETNUtE3C1wDNziZ+C5X6CeiHGUUdTHwmdfFpsHT1/Wfnf0UMJi8QSA00D5Tb4IUXvTGtFZD+1zU9f76Zkb8Ff9cELtr/2KmVn1YogpZWkKYNkHAi41K5Ubu/

tasks:
#- chef:
- install:
     branch: cuttlefish
- ceph:
- workunit:
    clients:
      client.0:
        - rados/test_python.sh

#3 Updated by Tamilarasi muthamizhan almost 11 years ago

The teuthology logs are copied to :/home/ubuntu/bug5163/testing_logs_rados_python

#4 Updated by Sage Weil almost 11 years ago

  • Status changed from New to Need More Info

Tamil- Yours looks like a different (and easier) bug. what was the workload? It appears to just be a bad trucnation value

{ "ops": [
        { "op_num": 0,
          "op_name": "truncate",
          "collection": "17.3_head",
          "oid": "251e4793\/abc\/head\/\/17",
          "offset": 13152137403151941634},
        { "op_num": 1,
          "op_name": "setattr",
          "collection": "17.3_head",
          "oid": "251e4793\/abc\/head\/\/17",
          "name": "_",
          "length": 206},
        { "op_num": 2,
          "op_name": "setattr",
          "collection": "17.3_head",
          "oid": "251e4793\/abc\/head\/\/17",
          "name": "snapset",
          "length": 31}]}

Mike- do you have the full log with more line sleading up to the crash? usually it tells you what the error code was and dumps the transaction.

thanks!

#5 Updated by Sage Weil almost 11 years ago

Tamil- I see, it's the python rados tests. Is this reproducible?

#6 Updated by Sage Weil almost 11 years ago

  • Subject changed from OSD Crash to filestore: ENOTEMPTY on object removal
  • Source changed from other to Community (user)

moved tamil's issue to #5233. and mike, i see the output now, but it doesn't make much sense. a more complete log would be great, thanks!

#7 Updated by Samuel Just almost 11 years ago

Can we get a recursive ls of 2.363_head on that osd?

#8 Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to High

#9 Updated by Samuel Just almost 11 years ago

  • Status changed from Need More Info to Can't reproduce

Also available in: Atom PDF