Actions
Bug #598
closedosd: journal reset in parallel mode acts weird
% Done:
0%
Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
from ML:
From jan.smets@alcatel-lucent.com Thu Nov 18 20:40:00 2010 Date: Fri, 19 Nov 2010 14:33:05 +0100 From: "Smets, Jan (Jan)" <jan.smets@alcatel-lucent.com> To: Sage Weil <sage@newdream.net> Subject: Journal reset Hi [root@client0 ceph]# dd if=/dev/zero of=bla bs=4k count=2500k .... While doing this, I killed OSD1, the cluster was still online. 2010-11-19 13:47:28.220351 pg v651: 1584 pgs: 1584 active+clean; 607 MB data, 3407 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:41.308149 log 2010-11-19 13:56:41.106859 mon0 138.203.10.98:6789/0 40 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd0 138.203.10.98:6801/2855) 2010-11-19 13:56:41.577522 log 2010-11-19 13:56:41.401719 mon0 138.203.10.98:6789/0 41 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd3 138.203.10.99:6803/1690) 2010-11-19 13:56:41.974092 pg v652: 1584 pgs: 1584 active+clean; 607 MB data, 3103 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:42.338220 osd e10: 6 osds: 5 up, 6 in 2010-11-19 13:56:42.514285 log 2010-11-19 13:56:42.104776 mon0 138.203.10.98:6789/0 42 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd2 138.203.10.99:6801/1649) 2010-11-19 13:56:43.006958 pg v653: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:43.227246 pg v654: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:43.278054 osd e11: 6 osds: 5 up, 6 in 2010-11-19 13:56:44.081089 pg v655: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:44.081210 log 2010-11-19 13:56:43.000098 mon2 138.203.10.100:6789/0 18 : [WRN] lease_expire from mon0 was sent from future time 2010-11-19 13:56:43.074412 with expected time <=2010-11-19 13:56:43.010509, clocks not synchronized 2010-11-19 13:56:44.240413 osd e12: 6 osds: 5 up, 6 in 2010-11-19 13:56:45.001293 pg v656: 1584 pgs: 1460 active+clean, 124 degraded+peering; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail; 13/336 degraded (3.869%) 2010-11-19 13:56:45.252438 osd e13: 6 osds: 5 up, 6 in 2010-11-19 13:56:45.338111 pg v657: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%) 2010-11-19 13:56:45.496352 pg v658: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%) 2010-11-19 13:56:48.055595 pg v659: 1584 pgs: 1025 active+clean, 104 active+clean+degraded, 455 degraded+peering; 643 MB data, 2039 MB used, 1366 GB / 1380 GB avail; 60/354 degraded (16.949%) 2010-11-19 13:56:48.541752 pg v660: 1584 pgs: 1025 active+clean, 365 active+clean+degraded, 194 degraded+peering; 783 MB data, 2201 MB used, 1366 GB / 1380 GB avail; 73/424 degraded (17.217%) 2010-11-19 13:56:49.066588 pg v661: 1584 pgs: 1025 active+clean, 459 active+clean+degraded, 100 degraded+peering; 823 MB data, 2267 MB used, 1366 GB / 1380 GB avail; 78/444 degraded (17.568%) 2010-11-19 13:56:50.092107 pg v662: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 863 MB data, 2337 MB used, 1366 GB / 1380 GB avail; 80/464 degraded (17.241%) 2010-11-19 13:56:52.315789 pg v663: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 827 MB data, 2422 MB used, 1366 GB / 1380 GB avail; 84/478 degraded (17.573%) So then I decided to start OSD1 again 2010-11-19 13:58:53.971343 osd e14: 6 osds: 6 up, 6 in 2010-11-19 13:58:54.540376 pg v734: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5204 MB data, 8670 MB used, 1130 GB / 1148 GB avail; 473/2636 degraded (17.944%) 2010-11-19 13:58:54.971413 log 2010-11-19 13:58:53.685598 mon0 138.203.10.98:6789/0 44 : [INF] osd1 138.203.10.98:6803/3434 boot 2010-11-19 13:58:55.225719 osd e15: 6 osds: 6 up, 6 in 2010-11-19 13:58:55.451934 pg v735: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%) 2010-11-19 13:58:56.036469 pg v736: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%) 2010-11-19 13:58:56.328999 osd e16: 6 osds: 6 up, 6 in 2010-11-19 13:58:57.192782 pg v737: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%) 2010-11-19 13:58:57.575131 pg v738: 1584 pgs: 62 active, 1222 active+clean, 300 active+clean+degraded; 5572 MB data, 9413 MB used, 1359 GB / 1380 GB avail; 484/2818 degraded (17.175%) 2010-11-19 13:58:58.241494 osd e17: 6 osds: 6 up, 6 in 2010-11-19 13:58:58.434765 pg v739: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%) 2010-11-19 13:58:59.594457 pg v740: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%) 2010-11-19 13:59:00.564642 pg v741: 1584 pgs: 95 active, 1325 active+clean, 164 active+clean+degraded; 5780 MB data, 9881 MB used, 1358 GB / 1380 GB avail; 478/2922 degraded (16.359%) 2010-11-19 13:59:01.149046 pg v742: 1584 pgs: 128 active, 1420 active+clean, 36 active+clean+degraded; 6100 MB data, 10518 MB used, 1358 GB / 1380 GB avail; 468/3082 degraded (15.185%) 2010-11-19 13:59:02.039833 osd e18: 6 osds: 6 up, 6 in 2010-11-19 13:59:02.258308 pg v743: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%) 2010-11-19 13:59:02.658114 pg v744: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%) 2010-11-19 13:59:03.169934 pg v745: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6224 MB data, 10836 MB used, 1357 GB / 1380 GB avail; 450/3144 degraded (14.313%) 2010-11-19 13:59:03.737821 pg v746: 1584 pgs: 135 active, 1449 active+clean; 6304 MB data, 10988 MB used, 1357 GB / 1380 GB avail; 466/3184 degraded (14.636%) ....... 2010-11-19 14:10:39.656792 pg v913: 1584 pgs: 33 active, 1551 active+clean; 10293 MB data, 19986 MB used, 1348 GB / 1380 GB avail; 142/5180 degraded (2.741%) You can see it's slowly recovering, right? But after this, it just stayed at the same level (well, going down 0.003% every minute?) Anyway, if I look at the file which my DD is writing to, I can see the file growing very slowly. (attaching strace to dd show some writes now and then) [root@client0 ~]# while true; do du /mnt/ceph/bla ; sleep 10; done 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10205520 /mnt/ceph/bla 10205520 /mnt/ceph/bla 10205520 /mnt/ceph/bla 2010-11-19 13:59:49.396893 7f4823165710 osd1 18 pg[0.b6( v 18'62 (9'54,18'62] n=6 ec=2 les=15 14/14/14) [1,5] r=0 lcod 13'59 mlcod 9'54 active] recover_replicas - nothing to do! 2010-11-19 13:59:49.513111 7f482a274710 journal check_for_full at 341794816 : JOURNAL FULL 341794816 >= 901119 (max_size 524288000 start 342695936) 2010-11-19 13:59:49.559325 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas 2010-11-19 13:59:49.630903 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas - nothing to do! 2010-11-19 13:59:49.761690 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas 2010-11-19 13:59:49.952224 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 13:59:57.396808 7f4829272710 journal seq 136010 >= full_commit_seq 135949, prior journal contents are now fully committed. resetting journal. Killing OSD1 again makes it going back to full speed. So I edited ceph.conf to enable OSD logging, but when starting again the cluster kept running just fine... So did the while true; do ./ceph -c ceph.conf osd tell 1 bench; sleep 10; done with what I got it going down yesterday. 010-11-19 14:21:14.798056 7f0c746ba710 osd1 24 pg[0.2c( v 18'3809 (9'3802,18'3809] n=5 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'3802 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 14:21:14.844190 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas 2010-11-19 14:21:14.844256 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 14:21:15.182259 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas 2010-11-19 14:21:15.182317 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 14:21:20.258272 7f0c7a7c7710 journal seq 138394 >= full_commit_seq 138361, prior journal contents are now fully committed. resetting journal. 2010-11-19 14:21:20.573888 7f0c74ebb710 journal seq 138760 >= full_restart_seq 138760, restarting journal 2010-11-19 14:23:43.371102 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.386308 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.401429 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.420130 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.441343 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.467342 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.484976 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.503794 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.522736 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.535576 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.560202 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.574383 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.585959 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.598890 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.610328 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.623691 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.639286 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.660119 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.679829 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.691648 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.702549 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.730357 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.742713 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.754155 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.765106 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.776403 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.789113 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.800176 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.811974 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.823539 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.835310 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.846132 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.857955 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.869687 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.881021 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.892952 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.904863 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.916889 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.929281 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.941637 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.953639 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.965723 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.977537 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.989106 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.002811 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.014843 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.026917 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.038535 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.050636 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.191378 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.203257 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.214730 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.225352 7f0c7b7c9710 journal check_for_full at 453378048 : JOURNAL FULL 453378048 >= 3174399 (max_size 524288000 start 456552448) 2010-11-19 14:23:58.242549 7f0c7a7c7710 journal seq 141240 >= full_commit_seq 141234, prior journal contents are now fully committed. resetting journal. 2010-11-19 14:24:47.126417 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=0).accept peer addr is really 138.203.10.101:0/2384373810 (socket is 138.203.10.101:32919/0) 2010-11-19 14:24:47.126482 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) So the problem is somewhere before/after the journal reset. (more debug log, not sure if it's of any use...) 010-11-19 14:28:42.548880 7fb98f9ae710 osd1 36 update_osd_stat osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5]) 2010-11-19 14:28:42.548898 7fb98f9ae710 osd1 36 heartbeat: stat(2010-11-19 14:28:42.548829 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-19 14:28:42.548912 7fb98f9ae710 osd1 36 heartbeat: osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5]) 2010-11-19 14:28:42.658113 7fb992ab5710 osd1 36 heartbeat_dispatch 0x34eea80 2010-11-19 14:28:42.658136 7fb992ab5710 osd1 36 handle_osd_ping from osd5 got stat stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-19 14:28:42.658154 7fb992ab5710 osd1 36 _share_map_incoming osd5 138.203.10.100:6804/1698 36 2010-11-19 14:28:42.658165 7fb992ab5710 osd1 36 take_peer_stat peer osd5 stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-19 14:28:42.665266 7fb996abd710 journal commit_finish thru 145188 2010-11-19 14:28:42.665280 7fb996abd710 journal committed_thru 145188 (last_committed_seq 145097) 2010-11-19 14:28:42.665285 7fb996abd710 journal seq 145188 >= full_commit_seq 145149, prior journal contents are now fully committed. resetting journal. 2010-11-19 14:28:42.665291 7fb996abd710 journal header: block_size 4096 alignment 4096 max_size 524288000 2010-11-19 14:28:42.665303 7fb996abd710 journal header: start 42086400 2010-11-19 14:28:42.665308 7fb996abd710 journal write_pos 42086400 2010-11-19 14:28:42.665312 7fb996abd710 journal committed_thru done 2010-11-19 14:28:42.730313 7fb992ab5710 osd1 36 heartbeat_dispatch 0x3521540 ceph version 0.23 (commit:5d1d8d0c4602be9819cc9f7aea562fccbb005a56) osd] ; This is where the btrfs volume will be mounted. osd data = /data/osd$id ; Ideally, make this a separate disk or partition. A few ; hundred MB should be enough; more if you have fast or many ; disks. You can use a file under the osd data dir if need be ; (e.g. /data/osd$id/journal), but it will be slower than a ; separate disk or partition. osd journal = /data/osd$id/journal journal dio = false osd journal = /dev/shm/journal$id osd journal size = 500 filestore journal parallel = 1 You can probably reproduce by applying load and doing the osd bench to fill up the journal. Thanks!
Actions