Bug #598: osd: journal reset in parallel mode acts weird - Ceph - Ceph

Actions

Copy link

Bug #598

closed

osd: journal reset in parallel mode acts weird

Added by Sage Weil over 13 years ago. Updated over 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

Target version:

v0.24

% Done:

Spent time:

5:00 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

from ML:

From jan.smets@alcatel-lucent.com Thu Nov 18 20:40:00 2010
Date: Fri, 19 Nov 2010 14:33:05 +0100
From: "Smets, Jan (Jan)" <jan.smets@alcatel-lucent.com>
To: Sage Weil <sage@newdream.net>
Subject: Journal reset

Hi

[root@client0 ceph]# dd if=/dev/zero of=bla bs=4k count=2500k
....

While doing this, I killed OSD1, the cluster was still online.

2010-11-19 13:47:28.220351    pg v651: 1584 pgs: 1584 active+clean; 607 MB data, 3407 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:41.308149   log 2010-11-19 13:56:41.106859 mon0 138.203.10.98:6789/0 40 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd0 138.203.10.98:6801/2855)
2010-11-19 13:56:41.577522   log 2010-11-19 13:56:41.401719 mon0 138.203.10.98:6789/0 41 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd3 138.203.10.99:6803/1690)
2010-11-19 13:56:41.974092    pg v652: 1584 pgs: 1584 active+clean; 607 MB data, 3103 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:42.338220   osd e10: 6 osds: 5 up, 6 in
2010-11-19 13:56:42.514285   log 2010-11-19 13:56:42.104776 mon0 138.203.10.98:6789/0 42 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd2 138.203.10.99:6801/1649)
2010-11-19 13:56:43.006958    pg v653: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:43.227246    pg v654: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:43.278054   osd e11: 6 osds: 5 up, 6 in
2010-11-19 13:56:44.081089    pg v655: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail
2010-11-19 13:56:44.081210   log 2010-11-19 13:56:43.000098 mon2 138.203.10.100:6789/0 18 : [WRN] lease_expire from mon0 was sent from future time 2010-11-19 13:56:43.074412 with expected time <=2010-11-19 13:56:43.010509, clocks not synchronized
2010-11-19 13:56:44.240413   osd e12: 6 osds: 5 up, 6 in
2010-11-19 13:56:45.001293    pg v656: 1584 pgs: 1460 active+clean, 124 degraded+peering; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail; 13/336 degraded (3.869%)
2010-11-19 13:56:45.252438   osd e13: 6 osds: 5 up, 6 in
2010-11-19 13:56:45.338111    pg v657: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%)
2010-11-19 13:56:45.496352    pg v658: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%)
2010-11-19 13:56:48.055595    pg v659: 1584 pgs: 1025 active+clean, 104 active+clean+degraded, 455 degraded+peering; 643 MB data, 2039 MB used, 1366 GB / 1380 GB avail; 60/354 degraded (16.949%)
2010-11-19 13:56:48.541752    pg v660: 1584 pgs: 1025 active+clean, 365 active+clean+degraded, 194 degraded+peering; 783 MB data, 2201 MB used, 1366 GB / 1380 GB avail; 73/424 degraded (17.217%)
2010-11-19 13:56:49.066588    pg v661: 1584 pgs: 1025 active+clean, 459 active+clean+degraded, 100 degraded+peering; 823 MB data, 2267 MB used, 1366 GB / 1380 GB avail; 78/444 degraded (17.568%)
2010-11-19 13:56:50.092107    pg v662: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 863 MB data, 2337 MB used, 1366 GB / 1380 GB avail; 80/464 degraded (17.241%)
2010-11-19 13:56:52.315789    pg v663: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 827 MB data, 2422 MB used, 1366 GB / 1380 GB avail; 84/478 degraded (17.573%)

So then I decided to start OSD1 again

2010-11-19 13:58:53.971343   osd e14: 6 osds: 6 up, 6 in
2010-11-19 13:58:54.540376    pg v734: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5204 MB data, 8670 MB used, 1130 GB / 1148 GB avail; 473/2636 degraded (17.944%)
2010-11-19 13:58:54.971413   log 2010-11-19 13:58:53.685598 mon0 138.203.10.98:6789/0 44 : [INF] osd1 138.203.10.98:6803/3434 boot
2010-11-19 13:58:55.225719   osd e15: 6 osds: 6 up, 6 in
2010-11-19 13:58:55.451934    pg v735: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%)
2010-11-19 13:58:56.036469    pg v736: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%)
2010-11-19 13:58:56.328999   osd e16: 6 osds: 6 up, 6 in
2010-11-19 13:58:57.192782    pg v737: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%)
2010-11-19 13:58:57.575131    pg v738: 1584 pgs: 62 active, 1222 active+clean, 300 active+clean+degraded; 5572 MB data, 9413 MB used, 1359 GB / 1380 GB avail; 484/2818 degraded (17.175%)
2010-11-19 13:58:58.241494   osd e17: 6 osds: 6 up, 6 in
2010-11-19 13:58:58.434765    pg v739: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%)
2010-11-19 13:58:59.594457    pg v740: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%)
2010-11-19 13:59:00.564642    pg v741: 1584 pgs: 95 active, 1325 active+clean, 164 active+clean+degraded; 5780 MB data, 9881 MB used, 1358 GB / 1380 GB avail; 478/2922 degraded (16.359%)
2010-11-19 13:59:01.149046    pg v742: 1584 pgs: 128 active, 1420 active+clean, 36 active+clean+degraded; 6100 MB data, 10518 MB used, 1358 GB / 1380 GB avail; 468/3082 degraded (15.185%)
2010-11-19 13:59:02.039833   osd e18: 6 osds: 6 up, 6 in
2010-11-19 13:59:02.258308    pg v743: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%)
2010-11-19 13:59:02.658114    pg v744: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%)
2010-11-19 13:59:03.169934    pg v745: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6224 MB data, 10836 MB used, 1357 GB / 1380 GB avail; 450/3144 degraded (14.313%)
2010-11-19 13:59:03.737821    pg v746: 1584 pgs: 135 active, 1449 active+clean; 6304 MB data, 10988 MB used, 1357 GB / 1380 GB avail; 466/3184 degraded (14.636%)
.......
2010-11-19 14:10:39.656792    pg v913: 1584 pgs: 33 active, 1551 active+clean; 10293 MB data, 19986 MB used, 1348 GB / 1380 GB avail; 142/5180 degraded (2.741%)

You can see it's slowly recovering, right?

But after this, it just stayed at the same level (well, going down 0.003% every minute?)

Anyway, if I look at the file which my DD is writing to, I can see the file growing very slowly. (attaching strace to dd show some writes now and then)

[root@client0 ~]# while true; do du  /mnt/ceph/bla ; sleep 10; done
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10197548        /mnt/ceph/bla
10205520        /mnt/ceph/bla
10205520        /mnt/ceph/bla
10205520        /mnt/ceph/bla

2010-11-19 13:59:49.396893 7f4823165710 osd1 18 pg[0.b6( v 18'62 (9'54,18'62] n=6 ec=2 les=15 14/14/14) [1,5] r=0 lcod 13'59 mlcod 9'54 active] recover_replicas - nothing to do!
2010-11-19 13:59:49.513111 7f482a274710 journal check_for_full at 341794816 : JOURNAL FULL 341794816 >= 901119 (max_size 524288000 start 342695936)
2010-11-19 13:59:49.559325 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas
2010-11-19 13:59:49.630903 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas - nothing to do!
2010-11-19 13:59:49.761690 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas
2010-11-19 13:59:49.952224 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 13:59:57.396808 7f4829272710 journal  seq 136010 >= full_commit_seq 135949, prior journal contents are now fully committed.  resetting journal.

Killing OSD1 again makes it going back to full speed. So I edited ceph.conf to enable OSD logging, but when starting again the cluster kept running just fine...

So did the   while true; do ./ceph -c ceph.conf  osd tell 1  bench; sleep 10; done   with what I got it going down yesterday.

010-11-19 14:21:14.798056 7f0c746ba710 osd1 24 pg[0.2c( v 18'3809 (9'3802,18'3809] n=5 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'3802 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 14:21:14.844190 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas
2010-11-19 14:21:14.844256 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 14:21:15.182259 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas
2010-11-19 14:21:15.182317 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas - nothing to do!
2010-11-19 14:21:20.258272 7f0c7a7c7710 journal  seq 138394 >= full_commit_seq 138361, prior journal contents are now fully committed.  resetting journal.
2010-11-19 14:21:20.573888 7f0c74ebb710 journal  seq 138760 >= full_restart_seq 138760, restarting journal
2010-11-19 14:23:43.371102 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.386308 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.401429 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.420130 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.441343 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.467342 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.484976 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.503794 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.522736 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.535576 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.560202 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.574383 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.585959 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.598890 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.610328 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.623691 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.639286 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.660119 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.679829 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.691648 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:43.702549 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.730357 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.742713 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.754155 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.765106 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.776403 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.789113 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.800176 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.811974 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.823539 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.835310 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.846132 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.857955 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.869687 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.881021 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.892952 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.904863 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.916889 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.929281 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.941637 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.953639 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.965723 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.977537 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:48.989106 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.002811 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.014843 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.026917 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.038535 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.050636 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.191378 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.203257 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.214730 7f0c76fc0710 journal throttle: waited for bytes
2010-11-19 14:23:49.225352 7f0c7b7c9710 journal check_for_full at 453378048 : JOURNAL FULL 453378048 >= 3174399 (max_size 524288000 start 456552448)
2010-11-19 14:23:58.242549 7f0c7a7c7710 journal  seq 141240 >= full_commit_seq 141234, prior journal contents are now fully committed.  resetting journal.
2010-11-19 14:24:47.126417 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=0).accept peer addr is really 138.203.10.101:0/2384373810 (socket is 138.203.10.101:32919/0)
2010-11-19 14:24:47.126482 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1)

So the problem is somewhere before/after the journal reset.

(more debug log, not sure if it's of any use...)

010-11-19 14:28:42.548880 7fb98f9ae710 osd1 36 update_osd_stat osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5])
2010-11-19 14:28:42.548898 7fb98f9ae710 osd1 36 heartbeat: stat(2010-11-19 14:28:42.548829 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-19 14:28:42.548912 7fb98f9ae710 osd1 36 heartbeat: osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5])
2010-11-19 14:28:42.658113 7fb992ab5710 osd1 36 heartbeat_dispatch 0x34eea80
2010-11-19 14:28:42.658136 7fb992ab5710 osd1 36 handle_osd_ping from osd5 got stat stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-19 14:28:42.658154 7fb992ab5710 osd1 36 _share_map_incoming osd5 138.203.10.100:6804/1698 36
2010-11-19 14:28:42.658165 7fb992ab5710 osd1 36 take_peer_stat peer osd5 stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-19 14:28:42.665266 7fb996abd710 journal commit_finish thru 145188
2010-11-19 14:28:42.665280 7fb996abd710 journal committed_thru 145188 (last_committed_seq 145097)
2010-11-19 14:28:42.665285 7fb996abd710 journal  seq 145188 >= full_commit_seq 145149, prior journal contents are now fully committed.  resetting journal.
2010-11-19 14:28:42.665291 7fb996abd710 journal header: block_size 4096 alignment 4096 max_size 524288000
2010-11-19 14:28:42.665303 7fb996abd710 journal header: start 42086400
2010-11-19 14:28:42.665308 7fb996abd710 journal  write_pos 42086400
2010-11-19 14:28:42.665312 7fb996abd710 journal committed_thru done
2010-11-19 14:28:42.730313 7fb992ab5710 osd1 36 heartbeat_dispatch 0x3521540

ceph version 0.23 (commit:5d1d8d0c4602be9819cc9f7aea562fccbb005a56)

osd]
        ; This is where the btrfs volume will be mounted.
        osd data = /data/osd$id

        ; Ideally, make this a separate disk or partition.  A few
        ; hundred MB should be enough; more if you have fast or many
        ; disks.  You can use a file under the osd data dir if need be
        ; (e.g. /data/osd$id/journal), but it will be slower than a
        ; separate disk or partition.
        osd journal = /data/osd$id/journal

        journal dio = false
        osd journal = /dev/shm/journal$id
        osd journal size = 500
        filestore journal parallel = 1

You can probably reproduce by applying load and doing the osd bench to fill up the journal.

Thanks!

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #598

osd: journal reset in parallel mode acts weird

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 13 years ago