Project

General

Profile

Bug #10474

Testing wip-9780-9781 hung ceph-objectstore-tool on import

Added by David Zafman about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --journal-path /var/lib/ceph/osd/ceph-2/journal --log-file=/var/log/ceph/objectstore_tool.$pid.log --op import --file /home/ubuntu/cephtest/data/exp.2.1.2

Thread 1 (Thread 0x7f7692bce900 (LWP 47823)):
#0 0x00007f769166e66b in pthread_join (threadid=140147075331840, thread_return=0x0) at pthread_join.c:92
#1 0x0000000000b89f52 in Thread::join (this=this@entry=0x46b4228, prval=prval@entry=0x0) at common/Thread.cc:140
#2 0x0000000000a7bf01 in FileJournal::stop_writer (this=this@entry=0x46b3c00) at os/FileJournal.cc:640
#3 0x0000000000a7d588 in FileJournal::close (this=0x46b3c00) at os/FileJournal.cc:557
#4 0x0000000000996a53 in JournalingObjectStore::journal_write_close (this=this@entry=0x475c000) at os/JournalingObjectStore.cc:30
#5 0x0000000000948c08 in FileStore::umount (this=0x475c000) at os/FileStore.cc:1639
#6 0x000000000064b511 in main (argc=<optimized out>, argv=<optimized out>) at tools/ceph_objectstore_tool.cc:3128

Associated revisions

Revision 07eadc45 (diff)
Added by David Zafman about 9 years ago

FileJournal: Fix hang in FileJournal::close()

On FileStore::umount() when stopping and in ENOSPC condition,
don't let write_thread_entry() get stuck in commit_cond.Wait(write_lock).

Fixes: #10474

Signed-off-by: David Zafman <>

History

#1 Updated by David Zafman about 9 years ago

This thread didn't notice that write_stop is set in ENOSPC handling:

Thread 5 (Thread 0x7f7688a43700 (LWP 47834)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x0000000000785a9e in Cond::Wait (this=0x46b41e8, mutex=...) at ./common/Cond.h:55
#2 0x0000000000a8a191 in FileJournal::write_thread_entry (this=0x46b3c00) at os/FileJournal.cc:1182
#3 0x000000000098385d in FileJournal::Writer::entry (this=<optimized out>) at os/FileJournal.h:337
#4 0x00007f769166d182 in start_thread (arg=0x7f7688a43700) at pthread_create.c:312
#5 0x00007f768d58338d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Mutex::Locker locker(write_lock);
uint64_t orig_ops = 0;
uint64_t orig_bytes = 0;
bufferlist bl;
int r = prepare_multi_write(bl, orig_ops, orig_bytes);
if (r == -ENOSPC) {
dout(20) << "write_thread_entry full, going to sleep (waiting for commit)" << dendl;
commit_cond.Wait(write_lock);

#2 Updated by David Zafman about 9 years ago

End of log shows entire umount sequence.

2015-01-06 15:38:59.771186 7f7692bce900 5 filestore(/var/lib/ceph/osd/ceph-2) umount /var/lib/ceph/osd/ceph-2
2015-01-06 15:38:59.771198 7f7692bce900 10 filestore(/var/lib/ceph/osd/ceph-2) do_force_sync
2015-01-06 15:38:59.771210 7f7689244700 20 filestore(/var/lib/ceph/osd/ceph-2) sync_entry force_sync set
2015-01-06 15:38:59.771224 7f7689244700 10 journal commit_start max_applied_seq 19118, open_ops 0
2015-01-06 15:38:59.771226 7f7689244700 10 journal commit_start blocked, all open_ops have completed
2015-01-06 15:38:59.771228 7f7689244700 10 journal commit_start committing 19118, still blocked
2015-01-06 15:38:59.771229 7f7689244700 10 journal commit_start
2015-01-06 15:38:59.771230 7f7689244700 1 journal FULL_FULL -> FULL_WAIT. commit_start on seq 19118 > journaled_seq 19099, moving to FULL_WAIT.
2015-01-06 15:38:59.771239 7f7689244700 15 filestore(/var/lib/ceph/osd/ceph-2) sync_entry committing 19118
2015-01-06 15:38:59.771267 7f7689244700 10 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) create_checkpoint: 'snap_19118'
2015-01-06 15:39:01.902574 7f7689244700 20 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) create_checkpoint: async snap create 'snap_19118' transid 293
2015-01-06 15:39:01.902930 7f7689244700 10 journal commit_started committing 19118, unblocking
2015-01-06 15:39:01.902940 7f7689244700 20 filestore(/var/lib/ceph/osd/ceph-2) waiting for checkpoint 293 to complete
2015-01-06 15:39:01.902942 7f7689244700 10 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) sync_checkpoint: transid 293 to complete
2015-01-06 15:39:01.916421 7f7689244700 20 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) sync_checkpoint: done waiting for transid 293
2015-01-06 15:39:01.916443 7f7689244700 20 filestore(/var/lib/ceph/osd/ceph-2) done waiting for checkpoint293 to complete
2015-01-06 15:39:01.916454 7f7689244700 10 filestore(/var/lib/ceph/osd/ceph-2) sync_entry commit took 2.145220, interval was 4.014677
2015-01-06 15:39:01.916458 7f7689244700 10 journal commit_finish thru 19118
2015-01-06 15:39:01.916460 7f7689244700 5 journal committed_thru 19118 (last_committed_seq 19064)
2015-01-06 15:39:01.916462 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19100 0 lat 2.745769
2015-01-06 15:39:01.916465 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19101 0 lat 2.704540
2015-01-06 15:39:01.916466 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19102 0 lat 2.677489
2015-01-06 15:39:01.916467 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19103 0 lat 2.650437
2015-01-06 15:39:01.916469 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19104 0 lat 2.626060
2015-01-06 15:39:01.916470 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19105 0 lat 2.574504
2015-01-06 15:39:01.916471 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19106 0 lat 2.557820
2015-01-06 15:39:01.916472 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19107 0 lat 2.553709
2015-01-06 15:39:01.916474 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19108 0 lat 2.549159
2015-01-06 15:39:01.916475 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19109 0 lat 2.469178
2015-01-06 15:39:01.916476 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19110 0 lat 2.453252
2015-01-06 15:39:01.916477 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19111 0 lat 2.424698
2015-01-06 15:39:01.916479 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19112 0 lat 2.403907
2015-01-06 15:39:01.916480 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19113 0 lat 2.342278
2015-01-06 15:39:01.916481 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19114 0 lat 2.338337
2015-01-06 15:39:01.916482 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19115 0 lat 2.299326
2015-01-06 15:39:01.916483 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19116 0 lat 2.199048
2015-01-06 15:39:01.916484 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19117 0 lat 2.185321
2015-01-06 15:39:01.916485 7f7689244700 10 journal queue_completions_thru seq 19118 queueing seq 19118 0 lat 2.167092
2015-01-06 15:39:01.916490 7f7689244700 10 journal header: block_size 4096 alignment 4096 max_size 104857600
2015-01-06 15:39:01.916492 7f7689244700 10 journal header: start 69275648
2015-01-06 15:39:01.916492 7f7689244700 10 journal write_pos 69275648
2015-01-06 15:39:01.916494 7f7689244700 15 journal dropping committed but unwritten seq 19116 len 3471224
2015-01-06 15:39:01.916496 7f7689244700 5 journal put_throttle finished 1 ops and 3471224 bytes, now 0 ops and 0 bytes
2015-01-06 15:39:01.916506 7f7689244700 15 journal dropping committed but unwritten seq 19117 len 2001867
2015-01-06 15:39:01.916507 7f7689244700 5 journal put_throttle finished 1 ops and 2001867 bytes, now 0 ops and 0 bytes
2015-01-06 15:39:01.916518 7f7689244700 15 journal dropping committed but unwritten seq 19118 len 617015
2015-01-06 15:39:01.916519 7f7689244700 5 journal put_throttle finished 1 ops and 617015 bytes, now 0 ops and 0 bytes
2015-01-06 15:39:01.918380 7f7689244700 10 journal committed_thru done
2015-01-06 15:39:01.918401 7f7688a43700 20 journal write_thread_entry woke up
2015-01-06 15:39:01.918413 7f7688a43700 20 journal write_thread_entry full, going to sleep (waiting for commit)
2015-01-06 15:39:01.918447 7f7689244700 10 filestore(/var/lib/ceph/osd/ceph-2) removing snap 'snap_19063'
2015-01-06 15:39:01.918452 7f7689244700 10 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) destroy_checkpoint: 'snap_19063'
2015-01-06 15:39:01.918711 7f7689244700 15 filestore(/var/lib/ceph/osd/ceph-2) sync_entry committed to op_seq 19118
2015-01-06 15:39:01.918987 7f7692bce900 10 journal journal_stop
2015-01-06 15:39:01.919047 7f7692bce900 1 journal close /var/lib/ceph/osd/ceph-2/journal

#3 Updated by David Zafman about 9 years ago

This is a caused the journal's write thread waiting for full_state to go to FULL_NOTFULL, at the same time that only the header needs to be written. In this case simply letting the header get written by do_write(), so that the empty queue and need_write_header == false will allow the thread to stop.

#4 Updated by David Zafman about 9 years ago

  • Status changed from 12 to Fix Under Review

#5 Updated by David Zafman about 9 years ago

  • Status changed from Fix Under Review to 7

#6 Updated by David Zafman about 9 years ago

  • Status changed from 7 to Fix Under Review

#7 Updated by Sage Weil about 9 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF