Project

General

Profile

Bug #598

osd: journal reset in parallel mode acts weird

Added by Sage Weil over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
11/22/2010
Due date:
% Done:

0%

Estimated time:
2.00 h
Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

from ML:

From jan.smets@alcatel-lucent.com Thu Nov 18 20:40:00 2010
Date: Fri, 19 Nov 2010 14:33:05 +0100
From: "Smets, Jan (Jan)" <jan.smets@alcatel-lucent.com>
To: Sage Weil <sage@newdream.net>
Subject: Journal reset

Hi

[root@client0 ceph]# dd if=/dev/zero of=bla bs=4k count=2500k
....

While doing this, I killed OSD1, the cluster was still online.

2010-11-19 13:47:28.220351    pg v651: 1584 pgs: 1584 active+clean; 607 MB data, 3407 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:41.308149   log 2010-11-19 13:56:41.106859 mon0 138.203.10.98:6789/0 40 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd0 138.203.10.98:6801/2855)
2010-11-19 13:56:41.577522   log 2010-11-19 13:56:41.401719 mon0 138.203.10.98:6789/0 41 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd3 138.203.10.99:6803/1690)
2010-11-19 13:56:41.974092    pg v652: 1584 pgs: 1584 active+clean; 607 MB data, 3103 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:42.338220   osd e10: 6 osds: 5 up, 6 in
2010-11-19 13:56:42.514285   log 2010-11-19 13:56:42.104776 mon0 138.203.10.98:6789/0 42 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd2 138.203.10.99:6801/1649)
2010-11-19 13:56:43.006958    pg v653: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:43.227246    pg v654: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:43.278054   osd e11: 6 osds: 5 up, 6 in
2010-11-19 13:56:44.081089    pg v655: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:44.081210   log 2010-11-19 13:56:43.000098 mon2 138.203.10.100:6789/0 18 : [WRN] lease_expire from mon0 was sent from future time 2010-11-19 13:56:43.074412 with expected time <=2010-11-19 13:56:43.010509, clocks not synchronized
2010-11-19 13:56:44.240413   osd e12: 6 osds: 5 up, 6 in
2010-11-19 13:56:45.001293    pg v656: 1584 pgs: 1460 active+clean, 124 degraded+peering; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail; 13/336 degraded (3.869%)
2010-11-19 13:56:45.252438   osd e13: 6 osds: 5 up, 6 in
2010-11-19 13:56:45.338111    pg v657: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%)
2010-11-19 13:56:45.496352    pg v658: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%)
2010-11-19 13:56:48.055595    pg v659: 1584 pgs: 1025 active+clean, 104 active+clean+degraded, 455 degraded+peering; 643 MB data, 2039 MB used, 1366 GB / 1380 GB avail; 60/354 degraded (16.949%)
2010-11-19 13:56:48.541752    pg v660: 1584 pgs: 1025 active+clean, 365 active+clean+degraded, 194 degraded+peering; 783 MB data, 2201 MB used, 1366 GB / 1380 GB avail; 73/424 degraded (17.217%)
2010-11-19 13:56:49.066588    pg v661: 1584 pgs: 1025 active+clean, 459 active+clean+degraded, 100 degraded+peering; 823 MB data, 2267 MB used, 1366 GB / 1380 GB avail; 78/444 degraded (17.568%)
2010-11-19 13:56:50.092107    pg v662: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 863 MB data, 2337 MB used, 1366 GB / 1380 GB avail; 80/464 degraded (17.241%)
2010-11-19 13:56:52.315789    pg v663: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 827 MB data, 2422 MB used, 1366 GB / 1380 GB avail; 84/478 degraded (17.573%)

So then I decided to start OSD1 again

2010-11-19 13:58:53.971343   osd e14: 6 osds: 6 up, 6 in
2010-11-19 13:58:54.540376    pg v734: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5204 MB data, 8670 MB used, 1130 GB / 1148 GB avail; 473/2636 degraded (17.944%)
2010-11-19 13:58:54.971413   log 2010-11-19 13:58:53.685598 mon0 138.203.10.98:6789/0 44 : [INF] osd1 138.203.10.98:6803/3434 boot
2010-11-19 13:58:55.225719   osd e15: 6 osds: 6 up, 6 in
2010-11-19 13:58:55.451934    pg v735: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%)
2010-11-19 13:58:56.036469    pg v736: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%)
2010-11-19 13:58:56.328999   osd e16: 6 osds: 6 up, 6 in
2010-11-19 13:58:57.192782    pg v737: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%)
2010-11-19 13:58:57.575131    pg v738: 1584 pgs: 62 active, 1222 active+clean, 300 active+clean+degraded; 5572 MB data, 9413 MB used, 1359 GB / 1380 GB avail; 484/2818 degraded (17.175%)
2010-11-19 13:58:58.241494   osd e17: 6 osds: 6 up, 6 in
2010-11-19 13:58:58.434765    pg v739: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%)
2010-11-19 13:58:59.594457    pg v740: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%)
2010-11-19 13:59:00.564642    pg v741: 1584 pgs: 95 active, 1325 active+clean, 164 active+clean+degraded; 5780 MB data, 9881 MB used, 1358 GB / 1380 GB avail; 478/2922 degraded (16.359%)
2010-11-19 13:59:01.149046    pg v742: 1584 pgs: 128 active, 1420 active+clean, 36 active+clean+degraded; 6100 MB data, 10518 MB used, 1358 GB / 1380 GB avail; 468/3082 degraded (15.185%)
2010-11-19 13:59:02.039833   osd e18: 6 osds: 6 up, 6 in
2010-11-19 13:59:02.258308    pg v743: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%)
2010-11-19 13:59:02.658114    pg v744: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%)
2010-11-19 13:59:03.169934    pg v745: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6224 MB data, 10836 MB used, 1357 GB / 1380 GB avail; 450/3144 degraded (14.313%)
2010-11-19 13:59:03.737821    pg v746: 1584 pgs: 135 active, 1449 active+clean; 6304 MB data, 10988 MB used, 1357 GB / 1380 GB avail; 466/3184 degraded (14.636%)
.......
2010-11-19 14:10:39.656792    pg v913: 1584 pgs: 33 active, 1551 active+clean; 10293 MB data, 19986 MB used, 1348 GB / 1380 GB avail; 142/5180 degraded (2.741%)

You can see it's slowly recovering, right?

But after this, it just stayed at the same level (well, going down 0.003% every minute?)

Anyway, if I look at the file which my DD is writing to, I can see the file growing very slowly. (attaching strace to dd show some writes now and then)

[root@client0 ~]# while true; do du  /mnt/ceph/bla ; sleep 10; done
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10205520        /mnt/ceph/bla
10205520        /mnt/ceph/bla
10205520        /mnt/ceph/bla

2010-11-19 13:59:49.396893 7f4823165710 osd1 18 pg[0.b6( v 18'62 (9'54,18'62] n=6 ec=2 les=15 14/14/14) [1,5] r=0 lcod 13'59 mlcod 9'54 active] recover_replicas - nothing to do!
2010-11-19 13:59:49.513111 7f482a274710 journal check_for_full at 341794816 : JOURNAL FULL 341794816 >= 901119 (max_size 524288000 start 342695936)
2010-11-19 13:59:49.559325 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas
2010-11-19 13:59:49.630903 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas - nothing to do!
2010-11-19 13:59:49.761690 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas
2010-11-19 13:59:49.952224 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 13:59:57.396808 7f4829272710 journal  seq 136010 >= full_commit_seq 135949, prior journal contents are now fully committed.  resetting journal.

Killing OSD1 again makes it going back to full speed. So I edited ceph.conf to enable OSD logging, but when starting again the cluster kept running just fine...

So did the   while true; do ./ceph -c ceph.conf  osd tell 1  bench; sleep 10; done   with what I got it going down yesterday.

010-11-19 14:21:14.798056 7f0c746ba710 osd1 24 pg[0.2c( v 18'3809 (9'3802,18'3809] n=5 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'3802 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 14:21:14.844190 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas
2010-11-19 14:21:14.844256 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 14:21:15.182259 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas
2010-11-19 14:21:15.182317 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 14:21:20.258272 7f0c7a7c7710 journal  seq 138394 >= full_commit_seq 138361, prior journal contents are now fully committed.  resetting journal.
2010-11-19 14:21:20.573888 7f0c74ebb710 journal  seq 138760 >= full_restart_seq 138760, restarting journal
2010-11-19 14:23:43.371102 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.386308 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.401429 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.420130 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.441343 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.467342 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.484976 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.503794 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.522736 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.535576 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.560202 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.574383 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.585959 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.598890 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.610328 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.623691 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.639286 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.660119 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.679829 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.691648 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.702549 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.730357 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.742713 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.754155 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.765106 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.776403 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.789113 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.800176 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.811974 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.823539 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.835310 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.846132 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.857955 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.869687 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.881021 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.892952 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.904863 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.916889 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.929281 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.941637 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.953639 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.965723 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.977537 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.989106 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.002811 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.014843 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.026917 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.038535 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.050636 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.191378 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.203257 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.214730 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.225352 7f0c7b7c9710 journal check_for_full at 453378048 : JOURNAL FULL 453378048 >= 3174399 (max_size 524288000 start 456552448)
2010-11-19 14:23:58.242549 7f0c7a7c7710 journal  seq 141240 >= full_commit_seq 141234, prior journal contents are now fully committed.  resetting journal.
2010-11-19 14:24:47.126417 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=0).accept peer addr is really 138.203.10.101:0/2384373810 (socket is 138.203.10.101:32919/0)
2010-11-19 14:24:47.126482 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1)

So the problem is somewhere before/after the journal reset.

(more debug log, not sure if it's of any use...)

010-11-19 14:28:42.548880 7fb98f9ae710 osd1 36 update_osd_stat osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5])
2010-11-19 14:28:42.548898 7fb98f9ae710 osd1 36 heartbeat: stat(2010-11-19 14:28:42.548829 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-19 14:28:42.548912 7fb98f9ae710 osd1 36 heartbeat: osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5])
2010-11-19 14:28:42.658113 7fb992ab5710 osd1 36 heartbeat_dispatch 0x34eea80
2010-11-19 14:28:42.658136 7fb992ab5710 osd1 36 handle_osd_ping from osd5 got stat stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-19 14:28:42.658154 7fb992ab5710 osd1 36 _share_map_incoming osd5 138.203.10.100:6804/1698 36
2010-11-19 14:28:42.658165 7fb992ab5710 osd1 36 take_peer_stat peer osd5 stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-19 14:28:42.665266 7fb996abd710 journal commit_finish thru 145188
2010-11-19 14:28:42.665280 7fb996abd710 journal committed_thru 145188 (last_committed_seq 145097)
2010-11-19 14:28:42.665285 7fb996abd710 journal  seq 145188 >= full_commit_seq 145149, prior journal contents are now fully committed.  resetting journal.
2010-11-19 14:28:42.665291 7fb996abd710 journal header: block_size 4096 alignment 4096 max_size 524288000
2010-11-19 14:28:42.665303 7fb996abd710 journal header: start 42086400
2010-11-19 14:28:42.665308 7fb996abd710 journal  write_pos 42086400
2010-11-19 14:28:42.665312 7fb996abd710 journal committed_thru done
2010-11-19 14:28:42.730313 7fb992ab5710 osd1 36 heartbeat_dispatch 0x3521540

ceph version 0.23 (commit:5d1d8d0c4602be9819cc9f7aea562fccbb005a56)

osd]
        ; This is where the btrfs volume will be mounted.
        osd data = /data/osd$id

        ; Ideally, make this a separate disk or partition.  A few
        ; hundred MB should be enough; more if you have fast or many
        ; disks.  You can use a file under the osd data dir if need be
        ; (e.g. /data/osd$id/journal), but it will be slower than a
        ; separate disk or partition.
        osd journal = /data/osd$id/journal

        journal dio = false
        osd journal = /dev/shm/journal$id
        osd journal size = 500
        filestore journal parallel = 1

You can probably reproduce by applying load and doing the osd bench to fill up the journal.

Thanks!


Related issues

Related to Ceph - Bug #479: ceph/mount crash badly when writing Can't reproduce 10/13/2010

History

#1 Updated by Sage Weil over 8 years ago

  • Estimated time set to 2.00 h

#2 Updated by Sage Weil over 8 years ago

  • Target version set to v0.24

#3 Updated by Sage Weil over 8 years ago

  • Source changed from 0 to 3

#4 Updated by Sage Weil over 8 years ago

  • Assignee set to Sage Weil

#5 Updated by Sage Weil over 8 years ago

bunch of problems here, not all related to a full journal.

#6 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved

fixed as of 132f74c56064fdb3c47943679c48aa2a6b98f4eb, along with a ton of other related issues with the io queuing and journaling.

Also available in: Atom PDF