Bug #906
closed
clustered mds: lchown not setting uid/gid
Added by Anonymous about 13 years ago.
Updated almost 13 years ago.
Description
This is from autotest ceph_pjd_fstest, job 257.
saw failure on client node
http://autotest.ceph.newdream.net/results/257-tv/group0/sepia89.ceph.dreamhost.com/status
Test Summary Report
-------------------
/usr/local/autotest/tests/pjd_fstest/src/tests/chmod/00.t (Wstat: 0 Tests: 58 Failed: 2)
Failed tests: 27, 31
/usr/local/autotest/tests/pjd_fstest/src/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 7)
Failed tests: 97, 102, 112, 135-137, 153
Files=184, Tests=1957, 199 wallclock secs ( 1.05 usr 0.27 sys + 0.40 cusr 0.86 csys = 2.58 CPU)
Result: FAIL
- debugging chmod/00.t test 27
- test code
expect 0 create ${n0} 0644
ctime1=`${fstest} stat ${n0} ctime`
sleep 1
expect 0 chmod ${n0} 0111
ctime2=`${fstest} stat ${n0} ctime`
test_check $ctime1 -lt $ctime2
- TODO conclusion: chmod does not update ctime
- debugging chmod/00.t test 31
- test code
expect 0 mkdir ${n0} 0755
ctime1=`${fstest} stat ${n0} ctime`
sleep 1
expect 0 chmod ${n0} 0753
ctime2=`${fstest} stat ${n0} ctime`
test_check $ctime1 -lt $ctime2
- TODO conclusion: same as test 27, for directories
- debugging chown/00.t test 97, line 209
- test code
expect 0 create ${n0} 0644
ctime1=`${fstest} stat ${n0} ctime`
sleep 1
expect 0 chown ${n0} 65534 65533
expect 65534,65533 lstat ${n0} uid,gid
ctime2=`${fstest} stat ${n0} ctime`
test_check $ctime1 -lt $ctime2
- TODO conclusion: chown does not update ctime
- debugging chown/00.t test 102, line 218
- test code
expect 0 mkdir ${n0} 0755
ctime1=`${fstest} stat ${n0} ctime`
sleep 1
expect 0 chown ${n0} 65534 65533
expect 65534,65533 lstat ${n0} uid,gid
ctime2=`${fstest} stat ${n0} ctime`
test_check $ctime1 -lt $ctime2
- TODO conclusion: same as test 97, for directories
- debugging chown/00.t test 112, line 236
- test code
expect 0 symlink ${n1} ${n0}
ctime1=`${fstest} lstat ${n0} ctime`
sleep 1
expect 0 lchown ${n0} 65534 65533
expect 65534,65533 lstat ${n0} uid,gid
ctime2=`${fstest} lstat ${n0} ctime`
test_check $ctime1 -lt $ctime2
- TODO conclusion: lchown does not update ctime of the symlink
- debugging chown/00.t test 135-137, line 274-
- test code
expect 0 symlink ${n1} ${n0}
expect 0 lchown ${n0} 65534 65533
ctime1=`${fstest} lstat ${n0} ctime`
sleep 1
expect 0 -u 65534 -g 65532 lchown ${n0} 65534 65532
expect 65534,65532 lstat ${n0} uid,gid
ctime2=`${fstest} lstat ${n0} ctime`
test_check $ctime1 -lt $ctime2
- TODO conclusion 1: lchown does not change user/group of the symlink?
- TODO conclusion 2: lchown does not update ctime of the symlink
- debugging chown/00.t test 153, line 330-
- test code
expect 0 symlink ${n1} ${n0}
ctime1=`${fstest} lstat ${n0} ctime`
sleep 1
expect 0 -- lchown ${n0} -1 -1
ctime2=`${fstest} lstat ${n0} ctime`
case "${os}:${fs}" in
Linux:ext3)
test_check $ctime1 -lt $ctime2
;;
*)
test_check $ctime1 -eq $ctime2
;;
esac
- TODO conclusion: lchown does not update ctime of the symlink?
It's possible this is correct, but have we checked that client and server times match? Otherwise this is probably a duplicate of #854.
Hmm. Looks like that explains the ctimes.
[0 tv@dreamer ~]$ dsh -m ubuntu@sepia14.ceph.dreamhost.com,ubuntu@sepia84.ceph.dreamhost.com,ubuntu@sepia85.ceph.dreamhost.com,ubuntu@sepia86.ceph.dreamhost.com,ubuntu@sepia88.ceph.dreamhost.com,ubuntu@sepia89.ceph.dreamhost.com date
ubuntu@sepia14.ceph.dreamhost.com: Mon Mar 21 14:16:09 PDT 2011
ubuntu@sepia84.ceph.dreamhost.com: Mon Mar 21 14:15:41 PDT 2011
ubuntu@sepia85.ceph.dreamhost.com: Mon Mar 21 14:15:39 PDT 2011
ubuntu@sepia86.ceph.dreamhost.com: Mon Mar 21 14:15:39 PDT 2011
ubuntu@sepia88.ceph.dreamhost.com: Mon Mar 21 14:15:21 PDT 2011
ubuntu@sepia89.ceph.dreamhost.com: Mon Mar 21 14:15:20 PDT 2011
[0 tv@dreamer ~]$
We need to get ntp or something on all the test machines..
That still doesn't explain this failing:
expect 65534,65532 lstat ${n0} uid,gid
kclient by default now. I can rerun with cfuse if that helps.
- Target version set to v0.27
- Translation missing: en.field_position set to 324
- Subject changed from ctime not updated, lchown not working to lchown not setting uid/gid
- Translation missing: en.field_story_points set to 2
- Translation missing: en.field_position deleted (
329)
- Translation missing: en.field_position set to 329
Re-running as job 409, clocks are in decent sync:
$ dsh -m ubuntu@sepia17.ceph.dreamhost.com,ubuntu@sepia20.ceph.dreamhost.com,ubuntu@sepia21.ceph.dreamhost.com,ubuntu@sepia23.ceph.dreamhost.com,ubuntu@sepia13.ceph.dreamhost.com,ubuntu@sepia14.ceph.dreamhost.com date
ubuntu@sepia17.ceph.dreamhost.com: Mon Apr 11 13:18:59 PDT 2011
ubuntu@sepia20.ceph.dreamhost.com: Mon Apr 11 13:19:00 PDT 2011
ubuntu@sepia21.ceph.dreamhost.com: Mon Apr 11 13:19:00 PDT 2011
ubuntu@sepia23.ceph.dreamhost.com: Mon Apr 11 13:19:00 PDT 2011
ubuntu@sepia13.ceph.dreamhost.com: Mon Apr 11 13:19:00 PDT 2011
ubuntu@sepia14.ceph.dreamhost.com: Mon Apr 11 13:19:00 PDT 2011
yet logs have this http://autotest.ceph.newdream.net/results/409-tv/group0/sepia23.ceph.dreamhost.com/debug/client.0.log :
13:11:20 DEBUG| [stdout] /usr/local/autotest/tests/pjd_fstest/src/tests/chmod/11.t ..... ok
13:11:43 DEBUG| [stdout] /usr/local/autotest/tests/pjd_fstest/src/tests/chown/00.t .....
13:11:43 DEBUG| [stdout] not ok 135
13:11:43 DEBUG| [stdout] not ok 136
13:11:43 DEBUG| [stdout] not ok 137
13:11:43 DEBUG| [stdout] Failed 3/171 subtests
- Target version changed from v0.27 to v0.28
- Subject changed from lchown not setting uid/gid to clustered mds: lchown not setting uid/gid
This isn't popping up with single mds... probably a clustering thing.
- Translation missing: en.field_position deleted (
355)
- Translation missing: en.field_position set to 640
Still unable to reproduce this locally, and running it again on the autotest cluster it didn't fail.
It's possible we fixed this as part of the rename stuff? Or else it was just an odd symptom of the generic kclient issues with multi-MDS clusters.
Here's an idea: run the autotest say 10 times (after the test, ssh to the sepia machines and ensure they've rebooted, and are not hanging on sync), if none of them fail then we'll call it resolved. And come back to it if up pops up again.
- Target version changed from v0.28 to v0.29
audit of the uclinet vs kclient code turned up one difference, but it was a bug fix in kclient that was missing from the uclient, a4bd854f86fe641207f83ab26a0f1b7fdd3ec4f0. does the uclient still pass now, i wonder?
also, are there logs of this happening with the kclient?
Greg, what did you do before to reproduce this?
- Target version changed from v0.29 to v0.30
I don't think that I ever did manage to reproduce it.
I haven't thought it through much, but it's also possible this got fixed with some of the other caps changes that got made in the last month or so. I'm thinking specifically of a few bugs we had that occasionally directed cap updates to non-auth MDSes, though I don't remember the circumstances of those bugs off-hand.
- Status changed from New to Can't reproduce
- Translation missing: en.field_position deleted (
659)
- Translation missing: en.field_position set to 391
- Target version deleted (
v0.30)
- Translation missing: en.field_position deleted (
401)
- Translation missing: en.field_position set to 1
- Translation missing: en.field_position changed from 1 to 685
Also available in: Atom
PDF