Actions
Bug #598
closedosd: journal reset in parallel mode acts weird
% Done:
0%
Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
from ML:
From jan.smets@alcatel-lucent.com Thu Nov 18 20:40:00 2010 Date: Fri, 19 Nov 2010 14:33:05 +0100 From: "Smets, Jan (Jan)" <jan.smets@alcatel-lucent.com> To: Sage Weil <sage@newdream.net> Subject: Journal reset Hi [root@client0 ceph]# dd if=/dev/zero of=bla bs=4k count=2500k .... While doing this, I killed OSD1, the cluster was still online. 2010-11-19 13:47:28.220351 pg v651: 1584 pgs: 1584 active+clean; 607 MB data, 3407 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:41.308149 log 2010-11-19 13:56:41.106859 mon0 138.203.10.98:6789/0 40 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd0 138.203.10.98:6801/2855) 2010-11-19 13:56:41.577522 log 2010-11-19 13:56:41.401719 mon0 138.203.10.98:6789/0 41 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd3 138.203.10.99:6803/1690) 2010-11-19 13:56:41.974092 pg v652: 1584 pgs: 1584 active+clean; 607 MB data, 3103 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:42.338220 osd e10: 6 osds: 5 up, 6 in 2010-11-19 13:56:42.514285 log 2010-11-19 13:56:42.104776 mon0 138.203.10.98:6789/0 42 : [INF] osd1 138.203.10.98:6803/2906 failed (by osd2 138.203.10.99:6801/1649) 2010-11-19 13:56:43.006958 pg v653: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:43.227246 pg v654: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:43.278054 osd e11: 6 osds: 5 up, 6 in 2010-11-19 13:56:44.081089 pg v655: 1584 pgs: 1584 active+clean; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail 2010-11-19 13:56:44.081210 log 2010-11-19 13:56:43.000098 mon2 138.203.10.100:6789/0 18 : [WRN] lease_expire from mon0 was sent from future time 2010-11-19 13:56:43.074412 with expected time <=2010-11-19 13:56:43.010509, clocks not synchronized 2010-11-19 13:56:44.240413 osd e12: 6 osds: 5 up, 6 in 2010-11-19 13:56:45.001293 pg v656: 1584 pgs: 1460 active+clean, 124 degraded+peering; 607 MB data, 2827 MB used, 1365 GB / 1380 GB avail; 13/336 degraded (3.869%) 2010-11-19 13:56:45.252438 osd e13: 6 osds: 5 up, 6 in 2010-11-19 13:56:45.338111 pg v657: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%) 2010-11-19 13:56:45.496352 pg v658: 1584 pgs: 1129 active+clean, 455 degraded+peering; 607 MB data, 2299 MB used, 1366 GB / 1380 GB avail; 49/336 degraded (14.583%) 2010-11-19 13:56:48.055595 pg v659: 1584 pgs: 1025 active+clean, 104 active+clean+degraded, 455 degraded+peering; 643 MB data, 2039 MB used, 1366 GB / 1380 GB avail; 60/354 degraded (16.949%) 2010-11-19 13:56:48.541752 pg v660: 1584 pgs: 1025 active+clean, 365 active+clean+degraded, 194 degraded+peering; 783 MB data, 2201 MB used, 1366 GB / 1380 GB avail; 73/424 degraded (17.217%) 2010-11-19 13:56:49.066588 pg v661: 1584 pgs: 1025 active+clean, 459 active+clean+degraded, 100 degraded+peering; 823 MB data, 2267 MB used, 1366 GB / 1380 GB avail; 78/444 degraded (17.568%) 2010-11-19 13:56:50.092107 pg v662: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 863 MB data, 2337 MB used, 1366 GB / 1380 GB avail; 80/464 degraded (17.241%) 2010-11-19 13:56:52.315789 pg v663: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 827 MB data, 2422 MB used, 1366 GB / 1380 GB avail; 84/478 degraded (17.573%) So then I decided to start OSD1 again 2010-11-19 13:58:53.971343 osd e14: 6 osds: 6 up, 6 in 2010-11-19 13:58:54.540376 pg v734: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5204 MB data, 8670 MB used, 1130 GB / 1148 GB avail; 473/2636 degraded (17.944%) 2010-11-19 13:58:54.971413 log 2010-11-19 13:58:53.685598 mon0 138.203.10.98:6789/0 44 : [INF] osd1 138.203.10.98:6803/3434 boot 2010-11-19 13:58:55.225719 osd e15: 6 osds: 6 up, 6 in 2010-11-19 13:58:55.451934 pg v735: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%) 2010-11-19 13:58:56.036469 pg v736: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%) 2010-11-19 13:58:56.328999 osd e16: 6 osds: 6 up, 6 in 2010-11-19 13:58:57.192782 pg v737: 1584 pgs: 1025 active+clean, 559 active+clean+degraded; 5468 MB data, 9073 MB used, 1129 GB / 1148 GB avail; 502/2766 degraded (18.149%) 2010-11-19 13:58:57.575131 pg v738: 1584 pgs: 62 active, 1222 active+clean, 300 active+clean+degraded; 5572 MB data, 9413 MB used, 1359 GB / 1380 GB avail; 484/2818 degraded (17.175%) 2010-11-19 13:58:58.241494 osd e17: 6 osds: 6 up, 6 in 2010-11-19 13:58:58.434765 pg v739: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%) 2010-11-19 13:58:59.594457 pg v740: 1584 pgs: 76 active, 1264 active+clean, 244 active+clean+degraded; 5628 MB data, 9539 MB used, 1359 GB / 1380 GB avail; 481/2846 degraded (16.901%) 2010-11-19 13:59:00.564642 pg v741: 1584 pgs: 95 active, 1325 active+clean, 164 active+clean+degraded; 5780 MB data, 9881 MB used, 1358 GB / 1380 GB avail; 478/2922 degraded (16.359%) 2010-11-19 13:59:01.149046 pg v742: 1584 pgs: 128 active, 1420 active+clean, 36 active+clean+degraded; 6100 MB data, 10518 MB used, 1358 GB / 1380 GB avail; 468/3082 degraded (15.185%) 2010-11-19 13:59:02.039833 osd e18: 6 osds: 6 up, 6 in 2010-11-19 13:59:02.258308 pg v743: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%) 2010-11-19 13:59:02.658114 pg v744: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6144 MB data, 10607 MB used, 1358 GB / 1380 GB avail; 450/3104 degraded (14.497%) 2010-11-19 13:59:03.169934 pg v745: 1584 pgs: 128 active, 1420 active+clean, 36 peering; 6224 MB data, 10836 MB used, 1357 GB / 1380 GB avail; 450/3144 degraded (14.313%) 2010-11-19 13:59:03.737821 pg v746: 1584 pgs: 135 active, 1449 active+clean; 6304 MB data, 10988 MB used, 1357 GB / 1380 GB avail; 466/3184 degraded (14.636%) ....... 2010-11-19 14:10:39.656792 pg v913: 1584 pgs: 33 active, 1551 active+clean; 10293 MB data, 19986 MB used, 1348 GB / 1380 GB avail; 142/5180 degraded (2.741%) You can see it's slowly recovering, right? But after this, it just stayed at the same level (well, going down 0.003% every minute?) Anyway, if I look at the file which my DD is writing to, I can see the file growing very slowly. (attaching strace to dd show some writes now and then) [root@client0 ~]# while true; do du /mnt/ceph/bla ; sleep 10; done 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10197548 /mnt/ceph/bla 10205520 /mnt/ceph/bla 10205520 /mnt/ceph/bla 10205520 /mnt/ceph/bla 2010-11-19 13:59:49.396893 7f4823165710 osd1 18 pg[0.b6( v 18'62 (9'54,18'62] n=6 ec=2 les=15 14/14/14) [1,5] r=0 lcod 13'59 mlcod 9'54 active] recover_replicas - nothing to do! 2010-11-19 13:59:49.513111 7f482a274710 journal check_for_full at 341794816 : JOURNAL FULL 341794816 >= 901119 (max_size 524288000 start 342695936) 2010-11-19 13:59:49.559325 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas 2010-11-19 13:59:49.630903 7f4823165710 osd1 18 pg[0.c2( v 18'1678 (9'1658,18'1678] n=7 ec=2 les=15 14/14/14) [1,0] r=0 lcod 13'1675 mlcod 9'1658 active] recover_replicas - nothing to do! 2010-11-19 13:59:49.761690 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas 2010-11-19 13:59:49.952224 7f4823165710 osd1 18 pg[0.b7( v 13'2906 (9'2888,13'2906] n=6 ec=2 les=15 14/14/14) [1,2] r=0 lcod 13'2904 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 13:59:57.396808 7f4829272710 journal seq 136010 >= full_commit_seq 135949, prior journal contents are now fully committed. resetting journal. Killing OSD1 again makes it going back to full speed. So I edited ceph.conf to enable OSD logging, but when starting again the cluster kept running just fine... So did the while true; do ./ceph -c ceph.conf osd tell 1 bench; sleep 10; done with what I got it going down yesterday. 010-11-19 14:21:14.798056 7f0c746ba710 osd1 24 pg[0.2c( v 18'3809 (9'3802,18'3809] n=5 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'3802 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 14:21:14.844190 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas 2010-11-19 14:21:14.844256 7f0c746ba710 osd1 24 pg[0.4e( v 20'20 (9'13,20'20] n=4 ec=2 les=24 23/23/23) [1,4] r=0 lcod 13'16 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 14:21:15.182259 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas 2010-11-19 14:21:15.182317 7f0c746ba710 osd1 24 pg[0.38( v 13'907 (9'900,13'907] n=3 ec=2 les=24 23/23/23) [1,4] r=0 lcod 9'902 mlcod 0'0 active] recover_replicas - nothing to do! 2010-11-19 14:21:20.258272 7f0c7a7c7710 journal seq 138394 >= full_commit_seq 138361, prior journal contents are now fully committed. resetting journal. 2010-11-19 14:21:20.573888 7f0c74ebb710 journal seq 138760 >= full_restart_seq 138760, restarting journal 2010-11-19 14:23:43.371102 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.386308 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.401429 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.420130 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.441343 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.467342 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.484976 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.503794 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.522736 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.535576 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.560202 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.574383 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.585959 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.598890 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.610328 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.623691 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.639286 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.660119 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.679829 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.691648 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:43.702549 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.730357 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.742713 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.754155 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.765106 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.776403 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.789113 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.800176 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.811974 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.823539 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.835310 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.846132 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.857955 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.869687 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.881021 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.892952 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.904863 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.916889 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.929281 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.941637 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.953639 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.965723 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.977537 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:48.989106 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.002811 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.014843 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.026917 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.038535 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.050636 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.191378 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.203257 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.214730 7f0c76fc0710 journal throttle: waited for bytes 2010-11-19 14:23:49.225352 7f0c7b7c9710 journal check_for_full at 453378048 : JOURNAL FULL 453378048 >= 3174399 (max_size 524288000 start 456552448) 2010-11-19 14:23:58.242549 7f0c7a7c7710 journal seq 141240 >= full_commit_seq 141234, prior journal contents are now fully committed. resetting journal. 2010-11-19 14:24:47.126417 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=0).accept peer addr is really 138.203.10.101:0/2384373810 (socket is 138.203.10.101:32919/0) 2010-11-19 14:24:47.126482 7f0c70e9e710 -- 138.203.10.98:6803/3608 >> 138.203.10.101:0/2384373810 pipe(0x34d4280 sd=25 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) So the problem is somewhere before/after the journal reset. (more debug log, not sure if it's of any use...) 010-11-19 14:28:42.548880 7fb98f9ae710 osd1 36 update_osd_stat osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5]) 2010-11-19 14:28:42.548898 7fb98f9ae710 osd1 36 heartbeat: stat(2010-11-19 14:28:42.548829 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-19 14:28:42.548912 7fb98f9ae710 osd1 36 heartbeat: osd_stat(6459 MB used, 223 GB avail, 231 GB total, peers [0,2,3,4,5]/[0,2,3,4,5]) 2010-11-19 14:28:42.658113 7fb992ab5710 osd1 36 heartbeat_dispatch 0x34eea80 2010-11-19 14:28:42.658136 7fb992ab5710 osd1 36 handle_osd_ping from osd5 got stat stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-19 14:28:42.658154 7fb992ab5710 osd1 36 _share_map_incoming osd5 138.203.10.100:6804/1698 36 2010-11-19 14:28:42.658165 7fb992ab5710 osd1 36 take_peer_stat peer osd5 stat(2010-11-19 14:28:42.611097 oprate=1.31855 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-19 14:28:42.665266 7fb996abd710 journal commit_finish thru 145188 2010-11-19 14:28:42.665280 7fb996abd710 journal committed_thru 145188 (last_committed_seq 145097) 2010-11-19 14:28:42.665285 7fb996abd710 journal seq 145188 >= full_commit_seq 145149, prior journal contents are now fully committed. resetting journal. 2010-11-19 14:28:42.665291 7fb996abd710 journal header: block_size 4096 alignment 4096 max_size 524288000 2010-11-19 14:28:42.665303 7fb996abd710 journal header: start 42086400 2010-11-19 14:28:42.665308 7fb996abd710 journal write_pos 42086400 2010-11-19 14:28:42.665312 7fb996abd710 journal committed_thru done 2010-11-19 14:28:42.730313 7fb992ab5710 osd1 36 heartbeat_dispatch 0x3521540 ceph version 0.23 (commit:5d1d8d0c4602be9819cc9f7aea562fccbb005a56) osd] ; This is where the btrfs volume will be mounted. osd data = /data/osd$id ; Ideally, make this a separate disk or partition. A few ; hundred MB should be enough; more if you have fast or many ; disks. You can use a file under the osd data dir if need be ; (e.g. /data/osd$id/journal), but it will be slower than a ; separate disk or partition. osd journal = /data/osd$id/journal journal dio = false osd journal = /dev/shm/journal$id osd journal size = 500 filestore journal parallel = 1 You can probably reproduce by applying load and doing the osd bench to fill up the journal. Thanks!
Updated by Sage Weil over 13 years ago
bunch of problems here, not all related to a full journal.
Updated by Sage Weil over 13 years ago
- Status changed from New to Resolved
fixed as of 132f74c56064fdb3c47943679c48aa2a6b98f4eb, along with a ton of other related issues with the io queuing and journaling.
Actions