Project

General

Profile

Actions

Bug #674

closed

tiobench stress test , OSD timeout

Added by changping Wu over 13 years ago. Updated about 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
we do multi-thread stress test for ceph 0.23.1 , ceph client printk osd timeout.

1. test tool: tiobench-0.3.3 ,test command:
./tiobench-0.3.3/tiotest -t 4 -f 4096 -r 1000 -b 131072 -d /mnt/ceph -T

2. ceph server version: 0.23-1
3. ceph client: git from git://ceph.newdeam.net/git/ceph-client-standalone.git, unstable-backport.
4. ceph server hosts OS: ubuntu 10.04 server x86_64 ,kernel:2.6.32-21-server
5. ceph client host OS:ubuntu 10.04 server x86_64,kernel: 2.6.32-21-server
6. ceph config: ceph-none, two OSD server:osd0 and osd1 , one MDS server and one MON server

===========================================
1. tiobench log:

===================================================================================

Run #1: ./tiobench-0.3.3/tiotest -t 4 -f 4096 -r 1000 -b 131072 -d /mnt/ceph -T

Unit information

================

File size = megabytes

Blk Size = bytes

Rate = megabytes per second

CPU% = percentage of CPU used during the test

Latency = milliseconds

Lat% = percent of requests that took longer than X seconds

CPU Eff = Rate divided by CPU% - throughput per cpu load

File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU

Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff

---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----

File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU

Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff

---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----

File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU

Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff

---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----

File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU

Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff

---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----

File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU

Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff

---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----

File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU

Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff

---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----

Sequential Reads

2.6.32-21-server 16384 4096 4 66.62 18.81% 0.227 251.11 0.00000 0.00000 354

2.6.32-21-server 16384 8192 4 65.77 16.16% 0.452 458.29 0.00000 0.00000 407

2.6.32-21-server 16384 16384 4 63.92 16.11% 0.968 559.74 0.00000 0.00000 397

2.6.32-21-server 16384 32768 4 64.61 14.39% 1.815 1046.43 0.00000 0.00000 449

2.6.32-21-server 16384 65536 4 66.26 15.32% 3.765 837.34 0.00000 0.00000 432

2.6.32-21-server 16384 13107 4 64.53 14.43% 7.566 1349.89 0.00000 0.00000 447

Random Reads

2.6.32-21-server 16384 4096 4 0.75 1.108% 20.405 106.68 0.00000 0.00000 68

2.6.32-21-server 16384 8192 4 1.66 0.797% 17.845 109.55 0.00000 0.00000 208

2.6.32-21-server 16384 16384 4 2.90 1.812% 20.977 333.91 0.00000 0.00000 160

2.6.32-21-server 16384 32768 4 6.25 1.849% 18.955 279.49 0.00000 0.00000 338

2.6.32-21-server 16384 65536 4 11.87 3.702% 20.782 439.90 0.00000 0.00000 321

2.6.32-21-server 16384 13107 4 22.14 6.199% 22.256 367.77 0.00000 0.00000 357

Sequential Writes

2.6.32-21-server 16384 4096 4 17.18 7.038% 0.862 14455.88 0.01473 0.00026 244

2.6.32-21-server 16384 8192 4 16.09 5.656% 1.831 18047.37 0.02990 0.00076 284

2.6.32-21-server 16384 16384 4 16.19 5.111% 3.671 13165.22 0.06485 0.00057 317

2.6.32-21-server 16384 32768 4 13.84 4.163% 8.497 24246.45 0.14687 0.00782 332

2.6.32-21-server 16384 65536 4 13.27 3.808% 17.584 24053.48 0.29259 0.01717 348

2.6.32-21-server 16384 13107 4 13.45 3.814% 35.383 23944.59 0.62485 0.03281 353

Random Writes

2.6.32-21-server 16384 4096 4 0.60 0.880% 0.220 330.88 0.00000 0.00000 68

2.6.32-21-server 16384 8192 4 1.00 0.669% 0.945 3338.01 0.02500 0.00000 149

2.6.32-21-server 16384 16384 4 1.41 0.699% 0.853 2497.44 0.02500 0.00000 202

2.6.32-21-server 16384 32768 4 2.61 0.814% 0.264 493.69 0.00000 0.00000 321

2.6.32-21-server 16384 65536 4 0.77 0.353% 1.176 1520.99 0.00000 0.00000 217

2.6.32-21-server 16384 13107 4 1.44 0.610% 0.838 1001.35 0.00000 0.00000 236

2. dmesg log:

========================================================

..............................................

..............................................

[75004.800179] INFO: task tiotest:2337 blocked for more than 120 seconds.

[75004.800213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[75004.800240] tiotest D 00000000ffffffff 0 2337 2180 0x00000008

[75004.800243] ffff880116257cc8 0000000000000082 0000000000015bc0 0000000000015bc0

[75004.800246] ffff880115825f80 ffff880116257fd8 0000000000015bc0 ffff880115825bc0

[75004.800249] 0000000000015bc0 ffff880116257fd8 0000000000015bc0 ffff880115825f80

[75004.800252] Call Trace:

[75004.800254] [<ffffffff810f3460>] ? sync_page+0x0/0x50

[75004.800257] [<ffffffff815555c7>] io_schedule+0x47/0x70

[75004.800259] [<ffffffff810f349d>] sync_page+0x3d/0x50

[75004.800262] [<ffffffff81555bef>] __wait_on_bit+0x5f/0x90

[75004.800264] [<ffffffff810f3653>] wait_on_page_bit+0x73/0x80

[75004.800267] [<ffffffff81084fe0>] ? wake_bit_function+0x0/0x40

[75004.800269] [<ffffffff810fd9f5>] ? pagevec_lookup_tag+0x25/0x40

[75004.800272] [<ffffffff810f3ae5>] wait_on_page_writeback_range+0xf5/0x190

[75004.800275] [<ffffffff810f3cb8>] filemap_write_and_wait_range+0x78/0x90

[75004.800277] [<ffffffff8116a62e>] vfs_fsync_range+0x7e/0xe0

[75004.800280] [<ffffffff8116a6fd>] vfs_fsync+0x1d/0x20

[75004.800282] [<ffffffff8116a73e>] do_fsync+0x3e/0x60

[75004.800284] [<ffffffff8116a790>] sys_fsync+0x10/0x20

[75004.800286] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

[75058.860051] libceph: tid 232244 timed out on osd0, will reset osd

[75058.861555] libceph: tid 232902 timed out on osd1, will reset osd

[75118.860035] libceph: tid 232596 timed out on osd0, will reset osd

[75118.860925] libceph: tid 233897 timed out on osd1, will reset osd

[75178.860021] libceph: tid 233467 timed out on osd0, will reset osd

[75846.350031] libceph: tid 280625 timed out on osd0, will reset osd

[75911.350030] libceph: tid 280786 timed out on osd0, will reset osd

[75986.350025] libceph: tid 281143 timed out on osd0, will reset osd

[76046.350037] libceph: tid 281253 timed out on osd0, will reset osd

[76106.350039] libceph: tid 281459 timed out on osd0, will reset osd

[76166.350038] libceph: tid 281661 timed out on osd0, will reset osd

[76226.350038] libceph: tid 281859 timed out on osd0, will reset osd

[76286.360026] libceph: tid 282082 timed out on osd0, will reset osd

[76346.360029] libceph: tid 282286 timed out on osd0, will reset osd

[76406.360025] libceph: tid 282478 timed out on osd0, will reset osd

[76466.360033] libceph: tid 282702 timed out on osd0, will reset osd

[76526.360029] libceph: tid 282899 timed out on osd0, will reset osd

[76586.360035] libceph: tid 283090 timed out on osd0, will reset osd

[76646.360033] libceph: tid 283304 timed out on osd0, will reset osd

[76754.770199] libceph: tid 283819 timed out on osd0, will reset osd

[76754.773797] libceph: tid 284035 timed out on osd1, will reset osd

[76814.770080] libceph: tid 284624 timed out on osd0, will reset osd

[76814.772596] libceph: tid 285401 timed out on osd1, will reset osd

[76874.770052] libceph: tid 285191 timed out on osd0, will reset osd

[76874.771749] libceph: tid 286271 timed out on osd1, will reset osd

[76934.770035] libceph: tid 285730 timed out on osd0, will reset osd

[76934.771057] libceph: tid 287155 timed out on osd1, will reset osd

[76994.770023] libceph: tid 286576 timed out on osd0, will reset osd

[77612.270030] libceph: tid 328771 timed out on osd0, will reset osd

-==========

1. test tool attached, test command,for instance: ./tiobench.sh /mnt/ceph 4096


Files

tiobench-0.3.3.tgz (827 KB) tiobench-0.3.3.tgz changping Wu, 12/27/2010 01:45 AM
tiotest_osd_timeout.txt (8.62 KB) tiotest_osd_timeout.txt changping Wu, 12/27/2010 01:45 AM
tiotest (32.6 KB) tiotest changping Wu, 01/03/2011 06:09 PM
Actions #1

Updated by changping Wu over 13 years ago

printk many timed out messages, then ,tiobench test fail.

Actions #2

Updated by changping Wu over 13 years ago

git the lastest ceph-client-standalone:master-backport:
run:

dd if=/dev/zero of=/mnt/ceph bs=4k count=500000

also pritk osd timeout ,logs:

=================================
dmesg:

...........................
............................
[118468.643817] PM: resume of drv:usbhid dev:3-1:1.0 complete after 1403.326 msecs
[118468.643825] PM: resume of drv: dev:ep_81 complete after 1108.365 msecs
[118472.630009] ata1: link is slow to respond, please be patient (ready=0)
[118472.690015] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[118472.697432] ata1.00: configured for UDMA/133
[118472.701101] PM: resume of drv:sd dev:0:0:0:0 complete after 5460.546 msecs
[118472.701110] PM: resume of drv:scsi_device dev:0:0:0:0 complete after 5460.546 msecs
[118472.701114] PM: resume of drv:scsi_disk dev:0:0:0:0 complete after 4057.260 msecs
[118472.701157] PM: resume of devices complete after 5464.718 msecs
[118472.701342] PM: resume devices took 5.470 seconds
[118472.701370] PM: Finishing wakeup.
[118472.701371] Restarting tasks ... done.
[118472.812045] e1000e 0000:00:19.0: irq 48 for MSI/MSI-X
[118472.871916] e1000e 0000:00:19.0: irq 48 for MSI/MSI-X
[118472.872179] ADDRCONF: eth0: link is not ready
[118474.441709] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[118474.441713] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO
[118474.441959] ADDRCONF: eth0: link becomes ready
[118483.876611] EXT4-fs (sda5): re-mounted. Opts: errors=remount-ro,user_xattr,commit=0
[118484.890013] eth0: no IPv6 routers present
[118487.501593] EXT4-fs (sda6): re-mounted. Opts: user_xattr,commit=0
[118489.511361] libceph: mon0 172.16.10.63:6789 socket closed
[118489.557216] libceph: mon0 172.16.10.63:6789 session lost, hunting for new mon
[118489.954784] libceph: mon0 172.16.10.63:6789 session established
[118500.474077] ceph: mds0 reconnect start
[118500.564930] ceph: mds0 reconnect success
[118547.953053] ceph: mds0 recovery completed
[119165.763499] npviewer.bin29917: segfault at 418 ip 00000000f60b4b66 sp 00000000ffe47768 error 6 in libflashplayer.so[f5e66000+b2c000]
[122995.993102] libceph: loaded (mon/osd proto 15/24, osdmap 5/5 5/5)
[123002.540923] ceph: loaded (mds proto 32)
[123061.089543] libceph: client4120 fsid 12d0d091-c0f9-10a3-d4d5-6e858826cdb6
[123061.089662] libceph: mon0 172.16.10.63:6789 session established
[123196.780685] libceph: tid 9 timed out on osd0, will reset osd
[123196.781112] libceph: tid 22 timed out on osd1, will reset osd
[123230.961895] ceph: mds0 caps stale
[123245.961887] ceph: mds0 caps stale
[123256.910053] libceph: tid 75 timed out on osd0, will reset osd
[123256.910409] libceph: tid 81 timed out on osd1, will reset osd
[123317.030666] libceph: tid 206 timed out on osd0, will reset osd
[123317.030955] libceph: tid 81 timed out on osd1, will reset osd
[123377.150044] libceph: tid 206 timed out on osd0, will reset osd
[123377.150331] libceph: tid 81 timed out on osd1, will reset osd
[123437.270668] libceph: tid 206 timed out on osd0, will reset osd
[123437.270957] libceph: tid 81 timed out on osd1, will reset osd
[123484.390044] INFO: task cosd:18013 blocked for more than 120 seconds.
[123484.443872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123484.510054] cosd D 0000000000000002 0 18013 1 0x00000000
[123484.510059] ffff8800780afc18 0000000000000086 ffff8800780afb88 0000000000015980
[123484.510063] ffff8800780affd8 0000000000015980 ffff8800780affd8 ffff880014a12dc0
[123484.510066] 0000000000015980 0000000000015980 ffff8800780affd8 0000000000015980
[123484.510069] Call Trace:
[123484.510076] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123484.510080] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123484.510082] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123484.510084] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123484.510087] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123484.510090] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123484.510093] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123484.510095] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123484.510098] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123484.510102] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123484.510104] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123484.510107] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123484.510110] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123484.510112] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123484.510115] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123484.510117] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123484.510121] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123484.510127] INFO: task cosd:18063 blocked for more than 120 seconds.
[123484.563950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123484.630127] cosd D 0000000100bba75b 0 18063 1 0x00000000
[123484.630130] ffff8800783fbc18 0000000000000082 ffff880000000000 0000000000015980
[123484.630134] ffff8800783fbfd8 0000000000015980 ffff8800783fbfd8 ffff880051b28000
[123484.630137] 0000000000015980 0000000000015980 ffff8800783fbfd8 0000000000015980
[123484.630140] Call Trace:
[123484.630144] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123484.630146] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123484.630149] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123484.630151] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123484.630153] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123484.630156] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123484.630159] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123484.630162] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123484.630164] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123484.630168] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123484.630171] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123484.630173] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123484.630176] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123484.630178] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123484.630181] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123484.630183] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123484.630187] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123497.390665] libceph: tid 206 timed out on osd0, will reset osd
[123497.390954] libceph: tid 81 timed out on osd1, will reset osd
[123557.515370] libceph: tid 206 timed out on osd0, will reset osd
[123557.515626] libceph: tid 81 timed out on osd1, will reset osd
[123604.620049] INFO: task cosd:18013 blocked for more than 120 seconds.
[123604.673879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123604.740066] cosd D 0000000000000002 0 18013 1 0x00000000
[123604.740070] ffff8800780afc18 0000000000000086 ffff8800780afb88 0000000000015980
[123604.740073] ffff8800780affd8 0000000000015980 ffff8800780affd8 ffff880014a12dc0
[123604.740076] 0000000000015980 0000000000015980 ffff8800780affd8 0000000000015980
[123604.740079] Call Trace:
[123604.740087] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123604.740091] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123604.740094] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123604.740096] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123604.740098] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123604.740102] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123604.740105] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123604.740107] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123604.740110] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123604.740114] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123604.740116] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123604.740119] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123604.740121] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123604.740124] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123604.740127] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123604.740129] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123604.740133] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123604.740138] INFO: task cosd:18063 blocked for more than 120 seconds.
[123604.793982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123604.860197] cosd D 0000000100bba75b 0 18063 1 0x00000000
[123604.860200] ffff8800783fbc18 0000000000000082 ffff880000000000 0000000000015980
[123604.860204] ffff8800783fbfd8 0000000000015980 ffff8800783fbfd8 ffff880051b28000
[123604.860207] 0000000000015980 0000000000015980 ffff8800783fbfd8 0000000000015980
[123604.860210] Call Trace:
[123604.860214] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123604.860216] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123604.860218] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123604.860220] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123604.860223] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123604.860225] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123604.860228] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123604.860230] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123604.860233] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123604.860236] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123604.860239] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123604.860242] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123604.860244] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123604.860246] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123604.860248] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123604.860251] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123604.860253] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123617.630046] libceph: tid 206 timed out on osd0, will reset osd
[123617.630333] libceph: tid 81 timed out on osd1, will reset osd
[123677.760665] libceph: tid 206 timed out on osd0, will reset osd
[123677.760953] libceph: tid 81 timed out on osd1, will reset osd
[123724.860048] INFO: task cosd:18013 blocked for more than 120 seconds.
[123724.913869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123724.980045] cosd D 0000000000000002 0 18013 1 0x00000000
[123724.980049] ffff8800780afc18 0000000000000086 ffff8800780afb88 0000000000015980
[123724.980052] ffff8800780affd8 0000000000015980 ffff8800780affd8 ffff880014a12dc0
[123724.980056] 0000000000015980 0000000000015980 ffff8800780affd8 0000000000015980
[123724.980059] Call Trace:
[123724.980065] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123724.980069] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123724.980072] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123724.980074] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123724.980076] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123724.980079] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123724.980082] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123724.980085] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123724.980088] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123724.980091] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123724.980094] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123724.980096] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123724.980099] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123724.980101] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123724.980104] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123724.980106] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123724.980110] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123724.980116] INFO: task cosd:18063 blocked for more than 120 seconds.
[123725.033949] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123725.100154] cosd D 0000000100bba75b 0 18063 1 0x00000000
[123725.100158] ffff8800783fbc18 0000000000000082 ffff880000000000 0000000000015980
[123725.100161] ffff8800783fbfd8 0000000000015980 ffff8800783fbfd8 ffff880051b28000
[123725.100164] 0000000000015980 0000000000015980 ffff8800783fbfd8 0000000000015980
[123725.100167] Call Trace:
[123725.100171] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123725.100173] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123725.100176] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123725.100178] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123725.100180] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123725.100183] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123725.100185] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123725.100188] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123725.100191] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123725.100194] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123725.100197] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123725.100200] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123725.100202] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123725.100204] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123725.100207] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123725.100210] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123725.100213] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123737.870668] libceph: tid 206 timed out on osd0, will reset osd
[123737.870956] libceph: tid 81 timed out on osd1, will reset osd
[123797.990667] libceph: tid 206 timed out on osd0, will reset osd
[123797.990955] libceph: tid 81 timed out on osd1, will reset osd
[123845.100049] INFO: task cosd:18013 blocked for more than 120 seconds.
[123845.153892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123845.220093] cosd D 0000000000000002 0 18013 1 0x00000000
[123845.220098] ffff8800780afc18 0000000000000086 ffff8800780afb88 0000000000015980
[123845.220102] ffff8800780affd8 0000000000015980 ffff8800780affd8 ffff880014a12dc0
[123845.220105] 0000000000015980 0000000000015980 ffff8800780affd8 0000000000015980
[123845.220108] Call Trace:
[123845.220115] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123845.220119] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123845.220121] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123845.220124] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123845.220126] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123845.220130] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123845.220133] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123845.220135] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123845.220138] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123845.220142] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123845.220144] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123845.220147] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123845.220149] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123845.220152] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123845.220155] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123845.220157] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123845.220161] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123845.220168] INFO: task cosd:18063 blocked for more than 120 seconds.
[123845.274010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123845.340222] cosd D 0000000100bba75b 0 18063 1 0x00000000
[123845.340228] ffff8800783fbc18 0000000000000082 ffff880000000000 0000000000015980
[123845.340231] ffff8800783fbfd8 0000000000015980 ffff8800783fbfd8 ffff880051b28000
[123845.340234] 0000000000015980 0000000000015980 ffff8800783fbfd8 0000000000015980
[123845.340238] Call Trace:
[123845.340246] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123845.340251] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123845.340254] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123845.340256] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123845.340259] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123845.340262] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123845.340266] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123845.340268] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123845.340271] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123845.340275] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123845.340278] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123845.340281] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123845.340283] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123845.340285] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123845.340288] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123845.340291] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123845.340295] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123858.110668] libceph: tid 206 timed out on osd0, will reset osd
[123858.110956] libceph: tid 81 timed out on osd1, will reset osd
[123918.230673] libceph: tid 206 timed out on osd0, will reset osd
[123918.230962] libceph: tid 81 timed out on osd1, will reset osd
[123965.340053] INFO: task cosd:18013 blocked for more than 120 seconds.
[123965.393886] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123965.460087] cosd D 0000000000000002 0 18013 1 0x00000000
[123965.460094] ffff8800780afc18 0000000000000086 ffff8800780afb88 0000000000015980
[123965.460098] ffff8800780affd8 0000000000015980 ffff8800780affd8 ffff880014a12dc0
[123965.460101] 0000000000015980 0000000000015980 ffff8800780affd8 0000000000015980
[123965.460104] Call Trace:
[123965.460112] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123965.460116] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123965.460119] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123965.460121] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123965.460123] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123965.460127] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123965.460130] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123965.460133] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123965.460135] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123965.460144] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123965.460147] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123965.460150] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123965.460152] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123965.460154] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123965.460157] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123965.460159] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123965.460163] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123965.460170] INFO: task cosd:18063 blocked for more than 120 seconds.
[123965.513993] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[123965.580202] cosd D 0000000100bba75b 0 18063 1 0x00000000
[123965.580207] ffff8800783fbc18 0000000000000082 ffff880000000000 0000000000015980
[123965.580210] ffff8800783fbfd8 0000000000015980 ffff8800783fbfd8 ffff880051b28000
[123965.580213] 0000000000015980 0000000000015980 ffff8800783fbfd8 0000000000015980
[123965.580216] Call Trace:
[123965.580223] [<ffffffff81101600>] ? sync_page+0x0/0x50
[123965.580226] [<ffffffff81587023>] io_schedule+0x73/0xc0
[123965.580228] [<ffffffff8110163d>] sync_page+0x3d/0x50
[123965.580230] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
[123965.580232] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
[123965.580236] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
[123965.580239] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
[123965.580241] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
[123965.580244] [<ffffffff81101d7b>] filemap_fdatawait+0x2b/0x30
[123965.580247] [<ffffffff81175a61>] wait_sb_inodes+0xa1/0x100
[123965.580250] [<ffffffff81175b51>] sync_inodes_sb+0x91/0xc0
[123965.580253] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
[123965.580255] [<ffffffff8117a648>] __sync_filesystem+0x88/0xa0
[123965.580257] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
[123965.580260] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
[123965.580262] [<ffffffff8117a6d5>] sys_sync+0x45/0x70
[123965.580266] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[123978.350667] libceph: tid 206 timed out on osd0, will reset osd
[123978.350961] libceph: tid 81 timed out on osd1, will reset osd
[124038.470668] libceph: tid 206 timed out on osd0, will reset osd
[124038.470956] libceph: tid 81 timed out on osd1, will reset osd
jeff@cephhost:~/source/git-src/gitosis-admin$

Actions #3

Updated by Sage Weil over 13 years ago

are the cosd instances running? what does 'ceph -s' show? it looks like it's just blocking because the osds aren't reachable..

Actions #4

Updated by changping Wu over 13 years ago

Hi ,
ceph 0.24, git ceph-client-standalone:master-backport.
It's very easy to reproduce this issue with tiotest.
example.sh:
#!/bin/bash
#while echo `date`; do ./tiotest -f 400 -t 20 -d /dev/$1 -R -T -c -D 20 -r 2000;done
while echo `date`; do ./tiotest -f 20 -t 10 -d $1 -T -c -D 20 -r 1000;done

------------------
with the logs, cosd instances are running,but osd0 should be down.
tiotest tool and log are attached.

jeff@cephhost:/usr/local/etc/ceph$ sudo ceph -s
2011-01-04 10:15:46.238974 pg v94: 528 pgs: 528 active+clean; 395 MB data, 44324 MB used, 62595 MB / 110 GB avail
2011-01-04 10:15:46.239897 mds e5: 1/1/1 up {0=up:active}, 1 up:standby
2011-01-04 10:15:46.239923 osd e8: 2 osds: 1 up, 2 in
2011-01-04 10:15:46.239963 log 2011-01-04 10:12:35.032217 mon0 172.16.10.63:6789/0 9 : [INF] osd0 172.16.10.63:6802/9022 failed (by osd1 172.16.10.63:6805/9082)
2011-01-04 10:15:46.240003 mon e1: 1 mons at {0=172.16.10.63:6789/0}
jeff@cephhost:/usr/local/etc/ceph$ ps -ef | grep cosd
root 9023 1 3 10:03 ? 00:00:24 /usr/local/bin/cosd -i 0 -c ceph.conf
root 9083 1 3 10:03 ? 00:00:27 /usr/local/bin/cosd -i 1 -c ceph.conf
jeff 9575 3866 0 10:15 pts/1 00:00:00 grep --color=auto cosd
jeff@cephhost:/usr/local/etc/ceph$

jeff@cephhost:/usr/local/etc/ceph$ sudo ceph -s
2011-01-04 10:18:58.982886 pg v95: 528 pgs: 528 active+clean; 395 MB data, 22164 MB used, 31295 MB / 56320 MB avail
2011-01-04 10:18:58.983853 mds e5: 1/1/1 up {0=up:active}, 1 up:standby
2011-01-04 10:18:58.983879 osd e9: 2 osds: 1 up, 1 in
2011-01-04 10:18:58.983913 log 2011-01-04 10:17:36.663650 mon0 172.16.10.63:6789/0 10 : [INF] osd0 out (down for 301.311964)
2011-01-04 10:18:58.983949 mon e1: 1 mons at {0=172.16.10.63:6789/0}

======================================
jeff@cephhost:/usr/local/etc/ceph$ sudo tailf /var/log/messages
[sudo] password for jeff:
Jan 4 09:35:28 cephhost rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="851" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Jan 4 10:03:54 cephhost kernel: [ 3220.198539] libceph: loaded (mon/osd proto 15/24, osdmap 5/5 5/5)
Jan 4 10:04:00 cephhost kernel: [ 3226.543170] ceph: loaded (mds proto 32)
Jan 4 10:05:25 cephhost kernel: [ 3311.217016] libceph: client4101 fsid 20a2bbea-d2b7-fc70-e375-43c84f817bc9
Jan 4 10:05:25 cephhost kernel: [ 3311.217122] libceph: mon0 172.16.10.63:6789 session established
Jan 4 10:12:35 cephhost kernel: [ 3741.050294] libceph: osd0 down
Jan 4 10:13:36 cephhost kernel: [ 3801.930665] libceph: tid 17654 timed out on osd1, will reset osd
Jan 4 10:14:18 cephhost kernel: [ 3844.374119] cosd D 0000000000000000 0 9042 1 0x00000000
Jan 4 10:14:18 cephhost kernel: [ 3844.374122] ffff88002f355ce8 0000000000000082 ffff88002f355cd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.374126] ffff88002f355fd8 0000000000015980 ffff88002f355fd8 ffff88002f345b80
Jan 4 10:14:18 cephhost kernel: [ 3844.374129] 0000000000015980 0000000000015980 ffff88002f355fd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.374132] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.374139] [<ffffffff8158760d>] schedule_timeout+0x22d/0x310
Jan 4 10:14:18 cephhost kernel: [ 3844.374143] [<ffffffff8104ed12>] ? enqueue_entity+0x132/0x1b0
Jan 4 10:14:18 cephhost kernel: [ 3844.374145] [<ffffffff8104f1d5>] ? enqueue_task_fair+0x45/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.374148] [<ffffffff81052bd7>] ? enqueue_task+0x77/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.374151] [<ffffffff815871eb>] wait_for_common+0xdb/0x180
Jan 4 10:14:18 cephhost kernel: [ 3844.374154] [<ffffffff81056770>] ? default_wake_function+0x0/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.374156] [<ffffffff8158736d>] wait_for_completion+0x1d/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.374160] [<ffffffff81175713>] writeback_inodes_sb+0xb3/0xe0
Jan 4 10:14:18 cephhost kernel: [ 3844.374162] [<ffffffff8117a660>] ? sync_one_sb+0x0/0x30
Jan 4 10:14:18 cephhost kernel: [ 3844.374165] [<ffffffff8117a60e>] __sync_filesystem+0x4e/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.374167] [<ffffffff8117a680>] sync_one_sb+0x20/0x30
Jan 4 10:14:18 cephhost kernel: [ 3844.374170] [<ffffffff81154f5b>] iterate_supers+0x8b/0xd0
Jan 4 10:14:18 cephhost kernel: [ 3844.374172] [<ffffffff8117a6bf>] sys_sync+0x2f/0x70
Jan 4 10:14:18 cephhost kernel: [ 3844.374176] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
Jan 4 10:14:18 cephhost kernel: [ 3844.388497] tiotest D 0000000000000002 0 9483 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.388501] ffff88006175bcd8 0000000000000086 ffff88006175bc58 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.388505] ffff88006175bfd8 0000000000015980 ffff88006175bfd8 ffff880062f096e0
Jan 4 10:14:18 cephhost kernel: [ 3844.388510] 0000000000015980 0000000000015980 ffff88006175bfd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.388514] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.388518] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.388521] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.388523] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.388526] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.388529] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.388532] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.388535] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.388538] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.388541] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.388544] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.388547] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.388550] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.388553] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.388556] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.402880] tiotest D 0000000000000002 0 9484 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.402884] ffff88005433fcd8 0000000000000086 ffff88005433fc58 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.402889] ffff88005433ffd8 0000000000015980 ffff88005433ffd8 ffff880062f0adc0
Jan 4 10:14:18 cephhost kernel: [ 3844.402893] 0000000000015980 0000000000015980 ffff88005433ffd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.402898] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.402901] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.402904] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.402906] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.402909] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.402912] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.402914] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.402917] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.402920] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.402923] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.402926] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.402930] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.402932] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.402935] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.402938] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.417262] tiotest D 000000010005386f 0 9485 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.417266] ffff880047d39cd8 0000000000000086 ffff880047d39c48 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.417273] ffff880047d39fd8 0000000000015980 ffff880047d39fd8 ffff880062f0c4a0
Jan 4 10:14:18 cephhost kernel: [ 3844.417276] 0000000000015980 0000000000015980 ffff880047d39fd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.417279] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.417283] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.417285] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.417287] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.417290] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.417292] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.417295] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.417298] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.417301] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.417304] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.417307] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.417310] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.417313] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.417316] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.417319] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.431643] tiotest D 0000000000000002 0 9486 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.431647] ffff880047d5bcd8 0000000000000086 ffff880047d5bc58 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.431652] ffff880047d5bfd8 0000000000015980 ffff880047d5bfd8 ffff880062f08000
Jan 4 10:14:18 cephhost kernel: [ 3844.431656] 0000000000015980 0000000000015980 ffff880047d5bfd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.431660] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.431665] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.431669] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.431671] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.431674] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.431676] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.431680] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.431683] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.431686] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.431689] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.431692] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.431695] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.431698] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.431701] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.431705] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.446015] tiotest D 0000000000000002 0 9487 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.446019] ffff880047c73cd8 0000000000000086 ffff880047c73c58 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.446024] ffff880047c73fd8 0000000000015980 ffff880047c73fd8 ffff880062f0db80
Jan 4 10:14:18 cephhost kernel: [ 3844.446028] 0000000000015980 0000000000015980 ffff880047c73fd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.446032] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.446035] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.446038] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.446041] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.446043] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.446046] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.446049] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.446052] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.446055] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.446058] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.446061] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.446064] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.446067] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.446070] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.446073] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.460366] tiotest D 0000000000000002 0 9488 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.460369] ffff880047ca9cd8 0000000000000086 ffff880047ca9c58 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.460372] ffff880047ca9fd8 0000000000015980 ffff880047ca9fd8 ffff88006e4f2dc0
Jan 4 10:14:18 cephhost kernel: [ 3844.460375] 0000000000015980 0000000000015980 ffff880047ca9fd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.460378] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.460381] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.460383] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.460385] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.460388] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.460390] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.460392] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.460395] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.460398] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.460401] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.460403] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.460406] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.460408] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.460411] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.460413] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.474782] tiotest D 00000001000533a8 0 9489 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.474788] ffff880047cfdcd8 0000000000000086 ffff880047cfdc48 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.474788] ffff880047cfdcd8 0000000000000086 ffff880047cfdc48 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.474791] ffff880047cfdfd8 0000000000015980 ffff880047cfdfd8 ffff88006e4f5b80
Jan 4 10:14:18 cephhost kernel: [ 3844.474794] 0000000000015980 0000000000015980 ffff880047cfdfd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.474797] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.474804] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.474808] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.474810] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.474812] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.474814] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.474818] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.474821] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.474823] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.474826] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.474831] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.474834] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.474836] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.474838] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.474844] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.489160] tiotest D 00000001000531f5 0 9490 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.489163] ffff880047c4dcd8 0000000000000086 ffff880000000000 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.489166] ffff880047c4dfd8 0000000000015980 ffff880047c4dfd8 ffff88006e4f44a0
Jan 4 10:14:18 cephhost kernel: [ 3844.489169] 0000000000015980 0000000000015980 ffff880047c4dfd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.489172] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.489177] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.489180] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.489182] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.489184] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.489187] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.489189] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.489192] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.489195] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.489198] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.489201] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.489204] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.489206] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.489208] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.489212] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:18 cephhost kernel: [ 3844.503653] tiotest D 0000000100053418 0 9491 9381 0x00020000
Jan 4 10:14:18 cephhost kernel: [ 3844.503656] ffff8800542c3cd8 0000000000000086 ffff880000000000 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.503660] ffff8800542c3fd8 0000000000015980 ffff8800542c3fd8 ffff8800020216e0
Jan 4 10:14:18 cephhost kernel: [ 3844.503662] 0000000000015980 0000000000015980 ffff8800542c3fd8 0000000000015980
Jan 4 10:14:18 cephhost kernel: [ 3844.503665] Call Trace:
Jan 4 10:14:18 cephhost kernel: [ 3844.503669] [<ffffffff81101600>] ? sync_page+0x0/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.503672] [<ffffffff81587023>] io_schedule+0x73/0xc0
Jan 4 10:14:18 cephhost kernel: [ 3844.503674] [<ffffffff8110163d>] sync_page+0x3d/0x50
Jan 4 10:14:18 cephhost kernel: [ 3844.503676] [<ffffffff815878ff>] __wait_on_bit+0x5f/0x90
Jan 4 10:14:18 cephhost kernel: [ 3844.503679] [<ffffffff811017f3>] wait_on_page_bit+0x73/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.503681] [<ffffffff8107f650>] ? wake_bit_function+0x0/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.503684] [<ffffffff8110bee5>] ? pagevec_lookup_tag+0x25/0x40
Jan 4 10:14:18 cephhost kernel: [ 3844.503686] [<ffffffff81101cbd>] filemap_fdatawait_range+0x10d/0x1a0
Jan 4 10:14:18 cephhost kernel: [ 3844.503689] [<ffffffff81101e50>] filemap_write_and_wait_range+0x70/0x80
Jan 4 10:14:18 cephhost kernel: [ 3844.503692] [<ffffffff8117a39a>] vfs_fsync_range+0x5a/0xa0
Jan 4 10:14:18 cephhost kernel: [ 3844.503695] [<ffffffff8117a44c>] vfs_fsync+0x1c/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.503697] [<ffffffff8117a48a>] do_fsync+0x3a/0x60
Jan 4 10:14:18 cephhost kernel: [ 3844.503699] [<ffffffff8117a4e0>] sys_fsync+0x10/0x20
Jan 4 10:14:18 cephhost kernel: [ 3844.503702] [<ffffffff810461a0>] sysenter_dispatch+0x7/0x2e
Jan 4 10:14:36 cephhost kernel: [ 3862.052078] libceph: tid 17654 timed out on osd1, will reset osd

Actions #5

Updated by Sage Weil over 13 years ago

  • Assignee set to Colin McCabe
Actions #6

Updated by Colin McCabe over 13 years ago

added tiotest to qa/workunits

Actions #7

Updated by changping Wu over 13 years ago

Hi ,i tried to modify some of codes which can decrease the timeout issue;

For instance:
mount.ceph monip:6789/ /mnt/ceph

1. reduce the tiotest I/O stress
example:
-----------
#!/bin/bash
while echo `date`;do ./tiotest f 1 -t 32 -d /mnt/ceph -T -D -r 10;done
--------------

2.
tiotest R/W random data,the performance is not good ,R/W and sync 3000M data into OSD, then ,
"#define CEPH_OSD_KEEPALIVE_DEFAULT 5" as OSD timeout schedule time is not enough, we can change timeout schedule time to "#define CEPH_OSD_TIMEOUT_DEFAULT 60 ";

3.handle_timeout:
1. osd keepalive, some times , printk "socked closed",i have to keep osd alive before sending timeout requests.
1. handle req->r_resend request;
2. If osd isn't up or addr is changed, reset osd ,send timeout request;
3. still fail, reset osd ,resend request;
4. still fail , abort any requests on the OSDs, to make sure that don't printk OOPS.


and i also wonder that : if add the timer at the osd , how to make sure that each of requests had been handled completely? maybe , it will loss some of requests.
and
Could it add the timer at each of requests ??

Jeff Wu

Actions #8

Updated by Colin McCabe about 13 years ago

  • Assignee changed from Colin McCabe to Samuel Just
Actions #9

Updated by Sage Weil about 13 years ago

  • Status changed from New to Can't reproduce
  • Translation missing: en.field_position set to 1
  • Translation missing: en.field_position changed from 1 to 594
Actions

Also available in: Atom PDF