Project

General

Profile

Actions

Bug #1185

closed

rados: export caught in loop on 'buck' bucket (1.5M objects)

Added by Sage Weil almost 13 years ago. Updated almost 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
librados
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

dumped an object list, watched strace, and periodically checked the current file/object name against the list, and it did not appear ot be making progress... looks like it's reiterating over the same block somewhere around object 1.1M.

this is on raid3136, bucket buck.

Actions #1

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.30 to v0.31
Actions #2

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_position set to 693
Actions #3

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_story_points set to 5
  • Translation missing: en.field_position deleted (699)
  • Translation missing: en.field_position set to 699
Actions #4

Updated by Sage Weil almost 13 years ago

  • Assignee deleted (Sage Weil)
Actions #5

Updated by Sage Weil almost 13 years ago

trying to reproduce this (with logs) and having a hard time. :/

cd /mnt/backup/dhobjects
rados -n client.dhobackup01 export --delete-after buck /mnt/backup/dhobjects/b/u/c/buck --log-file buck2.log --debug-ms 1 --debug-objecter 20 --log-to-stderr 0 &

on raid3136

Actions #6

Updated by Colin McCabe almost 13 years ago

This is something where a core file or a backtrace would be really, really helpful. I reviewed the code in librados::ObjectIterator and in rados_sync, and although they could use some optimization, there is nothing obviously wrong there.

Was someone else performing operations on the pool while this happened? One thing that I don't think we've tested very much is one user performing adds and deletes on a rados pool while another user lists the objects in that pool.

Actions #7

Updated by Sage Weil almost 13 years ago

The original process is still running (but suspended). Unfortunately the binary is an old build so there are no debug symbols, making it hard to make much sense of in gdb. I was able to tell from strace -p that it's caught in a loop but its difficult to get much more out of it.

I've run it a few more times and still can't reproduce. I think you're right that the thing to do is write a test that tests listing large buckets with concurrent modifications...

Actions #8

Updated by Colin McCabe almost 13 years ago

See #1258

Actions #9

Updated by Sage Weil almost 13 years ago

  • Target version changed from v0.31 to v0.32
Actions #10

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_position deleted (708)
  • Translation missing: en.field_position set to 726
Actions #11

Updated by Sage Weil almost 13 years ago

Still having trouble hitting this. Running in a loop without any debugging to see if I can trigger it.

Actions #12

Updated by Sage Weil almost 13 years ago

  • Status changed from New to Can't reproduce

no luck.

Actions

Also available in: Atom PDF