Bug #1666
closed
hadoop: time-related meta-data problems
Added by Noah Watkins over 12 years ago.
Updated about 5 years ago.
Component(FS):
Hadoop/Java
Description
The following exceptions are being thrown. It looks like something related to lstat?
pre>
java.io.IOException: The distributed cache object ceph://null/in-dir/_partition.lst#_partition.lst changed during the job from 11/4/11 3:11 AM to 11/3/11 1:06 PM
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:389)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:463)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:190)
at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1185)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2360)
at java.lang.Thread.run(Thread.java:662)
2011-10-31 16:51:10,617 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set finish time for task attempt_201110311641_0003_m_000037_2 when no start time is set, stackTrace is : java.lang.Exception
at org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:145)
at org.apache.hadoop.mapred.TaskInProgress.incompleteSubTask(TaskInProgress.java:670)
at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2942)
at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1159)
at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4739)
at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3683)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3378)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
Files
Neither of these errors are in code that's remotely familiar to me. So my first favorite question is:
Are your clocks synchronized? That can be a problem if you didn't set up NTP, or in a lot of virtual machine suspend/resume situations (dunno if you're on real hardware or not).
Assuming your clocks are good:
Can you tell me anything else about the code that this is being thrown in? (And are they some sort of fatal errors, or recoverable?)
Actually, unless you really had a job that went that long, I have to assume your clocks are off. But if that still doesn't fix it, I wonder if this is a result of our possibly not honoring Hadoop consistency semantics. So I'd be inclined to add sync and flush commands gratuitously, and if that fixes it we can put more effort into figuring out what the correct semantics are.
All of the local clocks on the nodes look good. The code is comparing timestamps (I assume since epoch), so maybe there is some rounding/conversion fuzz. I'll see if there is anything glaring along these lines, and if not I'll add some sync/flushes.
It looks like the check is equality of timestamps. So, I think Hadoop is setting an explicit timestamp, and sometime in the future expecting the exact same timestamp--since it looks like ceph_lstat wrapper does some conversion, that might be the issue?
So, I think I've got this nailed down. The good news is that the error was a clock sync issue. The bad news is that it doesn't appear Hadoop (or rather HDFS) assumes clock sync, and rather, Hadoop relies on the MDS to keep the official time on file's st_mtime. So the culprit seems to be in Ceph is in Client::write where ceph_clock_now updates st_mtime with the local (potentially non-syncd) time. I am assuming that the MDS doesn't overwrite mtime.
This may be a real issue for larger installations where keeping thousands of cluster nodes in sync with each other, in addition to needing to be synced with an out-of-cluster client node (who initializes the global cache and sets mtime).
Would it be possible to introduce a per-file flag that updates mtime at the MDS?
I'd have to look at the specifics again -- but it probably can't be done. If the client buffers a write and then flushes it out ten minutes later, and loses caps on the inode 2 minutes after that, what time is the MDS supposed to take as the mtime?
I'm a little confused about how an exact comparison could be breaking anyway, though -- presumably Hadoop is getting the reference time from a stat and then checking again later. The mtime shouldn't be updated unless somebody actually does something to the file in the intervening time, and mtime should be properly propagated, so it can't be a race in setting versus lookup. :/
You're right about that last point Greg, it doesn't quite add up--not thinking straight today.
Here is what happens in Hadoop. The * line is the hypothesis of what's going on, and if this
could be happening, it definitely points to your original sync/flush solution. Does this seem
plausible?
Time 0: Client-1 begins writing file F
Time 1: Client-1 finishes writing F at the app level
Time 2: Client-1 stat's F and records F's mtime as mtime.orig
- Time 3: Buffered data continues to be written by Client-1, updating F's mtime
Time 4: Client-1 broadcasts mtime.orig to other clients
Time 5: Client-X stat's F and compare's F's mtime with mtime.orig (boom)
Formatting oops:
Time 0: Client-1 begins writing file F
Time 1: Client-1 finishes writing F at the app level
Time 2: Client-1 stat's F and records F's mtime as mtime.orig
* Time 3: Buffered data continues to be written by Client-1, updating F's mtime
Time 4: Client-1 broadcasts mtime.orig to other clients
Time 5: Client-X stat's F and compare's F's mtime with mtime.orig (boom)
Something like this would make the most sense to me. (I'd have to check the specifics of mtime updating to see exactly.)
Just ran a little experiment that may shed some light on this.
Setup
- Clients: C1 and C2
- Clocks: C1 is ahead of C2 by 15 minutes
1. C1 writes a file, syncs, closes.
2. C1 stat's the file and records mtime (recorded as C1 time), closes file
3. C2 stat's the file and sees C2 time.
These are Hadoop clients, so there may be a lot of stuff going on. My best guess at this point is that prior to step (3) above some operation causes mtime to be updated. It looks like there may be a lot of vectors in Client.cc for this to occur. If this seems plausible, is there a good way to hunt it down?
If you can generate client logs for C1 and C2 (debug ms = 1, debug client = 10) that should tell us everything.
Sage Weil wrote:
If you can generate client logs for C1 and C2 (debug ms = 1, debug client = 10) that should tell us everything.
The relevant file in these traces is: _partition.lst
These seem to be relevant events:
Client-1.log: line 32259: client1 stats file and sees mtime = local time
Client-1.log: client1 does various things
Client-1.log: line 33910: client1 stats file and sees mtime = 15 minutes in the future
Client-2.log: line 2722: client2's first interaction with the file (stat)
Grepping for the inode number got me this:
2011-11-03 11:01:51.312498 7f69beffd700 client.5560 add_update_cap issued pAsxLsXsxFsxcrwb -> pAsxLsXsxFsxcrwb from mds.0 on 10000000981.head(ref=1
cap_refs={1024=0,4096=0,8192=0} open={1=0,2=0} mode=100777 size=129 mtime=2011-11-03 11:14:01.196477 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb)
objectset[10000000981 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7f69b81b9a80 0x41347950)
It's definitely the MDS setting it to that for some reason. Probably because the MDS and client are out of sync, and the MDS does some operation that updates the time but doesn't immediately flush it back to the client, and the client's later mtime update doesn't go through because it's an earlier mtime? (I think there are guards to prevent mtimes from moving backward, although I could be mistaken.)
If Client-1 is seeing a cached copy of the inode's mtime, then the following server-side scenario may explain what's going on (note that the server has the same time as client 2):
- Server.cc:handle_client_openc --- is called to create _partition.lst (in the MDS logs)
- mdr->now = ceph_clock_now(g_ceph_context)
- prepare_new_inode(mdr, ...)
- (line 1750) in->inode.ctime = in->inode.mtime = in->inode.atime = mdr->now
This seems to be happening when I look at the logs.
So the "bad" mtime is the same time the inode was created on the MDS server?
Greg Farnum wrote:
So the "bad" mtime is the same time the inode was created on the MDS server?
I think so. Here is the more explicit version of what I think is happening now:
Client-1: open f via handle_client_openc, write
MDS: records inode's mtime as local time (bad time in this scenario)
Client 1: sync, close, stat --> sees cached mtime (good time in this scenario)
Client 2: stat f --> bad mtime
If that's the case then I'm surprised the mtime didn't get updated at an earlier time. If nothing else we can probably fix it by encoding more metadata and shoving it out on the create response, although it makes me a little sad.
Do you mean that you are surprised that client-1's inode didn't get updated from the server's change before the stat occurred? If so, then it might be relevant to know that the entire scenario plays out in just a couple seconds.
Yeah, it's not impossible, I just would have thought that one of the other updates would have prompted the server to send its mtime back.
This is basically an annoying problem with how you handle mtimes in distributed systems. We could just always take the mtime encoded by the last person to modify the file, and that's what we try to do -- but going backwards in time is also really stupid so I believe we refuse to do so, on the theory that if you care your clocks are probably pretty close. sigh
Bummer. Well... for the time being it may be sufficient to force FileStatus.getModificationTime() to go directly to the server with getattr, or at least somehow invalidate the cache prior to the stat. This would I think validate the hypothesis. The former is a little more involved than the latter, so if you know of any trick off hand that might refresh the client's view then that'd be helpful for a stop-gap.
- Project changed from Ceph to CephFS
- Category deleted (
20)
- Priority changed from Normal to Low
Also see #7564
But low priority, for this is Hadoop
- Status changed from New to Resolved
We now take client timestamps for almost everything, so this should no longer be a problem and I'm closing it unless we hear of problems again.
- Component(FS) Hadoop/Java added
- Category deleted (
48)
- Labels (FS) Java/Hadoop added
Also available in: Atom
PDF