Project

General

Profile

Bug #1185

rados: export caught in loop on 'buck' bucket (1.5M objects)

Added by Sage Weil over 8 years ago. Updated about 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
librados
Target version:
Start date:
06/13/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

dumped an object list, watched strace, and periodically checked the current file/object name against the list, and it did not appear ot be making progress... looks like it's reiterating over the same block somewhere around object 1.1M.

this is on raid3136, bucket buck.

History

#1 Updated by Sage Weil about 8 years ago

  • Target version changed from v0.30 to v0.31

#2 Updated by Sage Weil about 8 years ago

  • translation missing: en.field_position set to 693

#3 Updated by Sage Weil about 8 years ago

  • translation missing: en.field_story_points set to 5
  • translation missing: en.field_position deleted (699)
  • translation missing: en.field_position set to 699

#4 Updated by Sage Weil about 8 years ago

  • Assignee deleted (Sage Weil)

#5 Updated by Sage Weil about 8 years ago

trying to reproduce this (with logs) and having a hard time. :/

cd /mnt/backup/dhobjects
rados -n client.dhobackup01 export --delete-after buck /mnt/backup/dhobjects/b/u/c/buck --log-file buck2.log --debug-ms 1 --debug-objecter 20 --log-to-stderr 0 &

on raid3136

#6 Updated by Colin McCabe about 8 years ago

This is something where a core file or a backtrace would be really, really helpful. I reviewed the code in librados::ObjectIterator and in rados_sync, and although they could use some optimization, there is nothing obviously wrong there.

Was someone else performing operations on the pool while this happened? One thing that I don't think we've tested very much is one user performing adds and deletes on a rados pool while another user lists the objects in that pool.

#7 Updated by Sage Weil about 8 years ago

The original process is still running (but suspended). Unfortunately the binary is an old build so there are no debug symbols, making it hard to make much sense of in gdb. I was able to tell from strace -p that it's caught in a loop but its difficult to get much more out of it.

I've run it a few more times and still can't reproduce. I think you're right that the thing to do is write a test that tests listing large buckets with concurrent modifications...

#8 Updated by Colin McCabe about 8 years ago

See #1258

#9 Updated by Sage Weil about 8 years ago

  • Target version changed from v0.31 to v0.32

#10 Updated by Sage Weil about 8 years ago

  • translation missing: en.field_position deleted (708)
  • translation missing: en.field_position set to 726

#11 Updated by Sage Weil about 8 years ago

Still having trouble hitting this. Running in a loop without any debugging to see if I can trigger it.

#12 Updated by Sage Weil about 8 years ago

  • Status changed from New to Can't reproduce

no luck.

Also available in: Atom PDF