Bug #1666: hadoop: time-related meta-data problems - CephFS - Ceph

Actions

Copy link

Bug #1666

closed

hadoop: time-related meta-data problems

Added by Noah Watkins over 12 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Low

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Hadoop/Java

Labels (FS):

Java/Hadoop

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The following exceptions are being thrown. It looks like something related to lstat?

pre>
java.io.IOException: The distributed cache object ceph://null/in-dir/_partition.lst#_partition.lst changed during the job from 11/4/11 3:11 AM to 11/3/11 1:06 PM
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:389)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:463)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:190)
at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1185)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2360)
at java.lang.Thread.run(Thread.java:662)

2011-10-31 16:51:10,617 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set finish time for task attempt_201110311641_0003_m_000037_2 when no start time is set, stackTrace is : java.lang.Exception
at org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:145)
at org.apache.hadoop.mapred.TaskInProgress.incompleteSubTask(TaskInProgress.java:670)
at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2942)
at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1159)
at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4739)
at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3683)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3378)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

Files

Download all files

client-1.log (5.73 MB) client-1.log		Noah Watkins, 11/03/2011 01:20 PM
client-2.log (969 KB) client-2.log		Noah Watkins, 11/03/2011 01:20 PM

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Neither of these errors are in code that's remotely familiar to me. So my first favorite question is:
Are your clocks synchronized? That can be a problem if you didn't set up NTP, or in a lot of virtual machine suspend/resume situations (dunno if you're on real hardware or not).

Assuming your clocks are good:
Can you tell me anything else about the code that this is being thrown in? (And are they some sort of fatal errors, or recoverable?)

Actually, unless you really had a job that went that long, I have to assume your clocks are off. But if that still doesn't fix it, I wonder if this is a result of our possibly not honoring Hadoop consistency semantics. So I'd be inclined to add sync and flush commands gratuitously, and if that fixes it we can put more effort into figuring out what the correct semantics are.

Actions

Copy link

Updated by Noah Watkins over 12 years ago

All of the local clocks on the nodes look good. The code is comparing timestamps (I assume since epoch), so maybe there is some rounding/conversion fuzz. I'll see if there is anything glaring along these lines, and if not I'll add some sync/flushes.

Actions

Copy link

Updated by Noah Watkins over 12 years ago

It looks like the check is equality of timestamps. So, I think Hadoop is setting an explicit timestamp, and sometime in the future expecting the exact same timestamp--since it looks like ceph_lstat wrapper does some conversion, that might be the issue?

Actions

Copy link

Updated by Noah Watkins over 12 years ago

So, I think I've got this nailed down. The good news is that the error was a clock sync issue. The bad news is that it doesn't appear Hadoop (or rather HDFS) assumes clock sync, and rather, Hadoop relies on the MDS to keep the official time on file's st_mtime. So the culprit seems to be in Ceph is in Client::write where ceph_clock_now updates st_mtime with the local (potentially non-syncd) time. I am assuming that the MDS doesn't overwrite mtime.

This may be a real issue for larger installations where keeping thousands of cluster nodes in sync with each other, in addition to needing to be synced with an out-of-cluster client node (who initializes the global cache and sets mtime).

Would it be possible to introduce a per-file flag that updates mtime at the MDS?

Actions

Copy link

Updated by Greg Farnum over 12 years ago

I'd have to look at the specifics again -- but it probably can't be done. If the client buffers a write and then flushes it out ten minutes later, and loses caps on the inode 2 minutes after that, what time is the MDS supposed to take as the mtime?

I'm a little confused about how an exact comparison could be breaking anyway, though -- presumably Hadoop is getting the reference time from a stat and then checking again later. The mtime shouldn't be updated unless somebody actually does something to the file in the intervening time, and mtime should be properly propagated, so it can't be a race in setting versus lookup. :/

Actions

Copy link

Updated by Noah Watkins over 12 years ago

You're right about that last point Greg, it doesn't quite add up--not thinking straight today.

Here is what happens in Hadoop. The * line is the hypothesis of what's going on, and if this
could be happening, it definitely points to your original sync/flush solution. Does this seem
plausible?

Time 0: Client-1 begins writing file F
Time 1: Client-1 finishes writing F at the app level
Time 2: Client-1 stat's F and records F's mtime as mtime.orig

Time 3: Buffered data continues to be written by Client-1, updating F's mtime
Time 4: Client-1 broadcasts mtime.orig to other clients
Time 5: Client-X stat's F and compare's F's mtime with mtime.orig (boom)

Actions

Copy link

Updated by Noah Watkins over 12 years ago

Formatting oops:

Time 0: Client-1 begins writing file F
Time 1: Client-1 finishes writing F at the app level
Time 2: Client-1 stat's F and records F's mtime as mtime.orig
* Time 3: Buffered data continues to be written by Client-1, updating F's mtime
Time 4: Client-1 broadcasts mtime.orig to other clients
Time 5: Client-X stat's F and compare's F's mtime with mtime.orig (boom)

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Something like this would make the most sense to me. (I'd have to check the specifics of mtime updating to see exactly.)

Actions

Copy link

Updated by Noah Watkins over 12 years ago

Just ran a little experiment that may shed some light on this.

Setup
  - Clients: C1 and C2
  -  Clocks: C1 is ahead of C2 by 15 minutes

1. C1 writes a file, syncs, closes.
2. C1 stat's the file and records mtime (recorded as C1 time), closes file
3. C2 stat's the file and sees C2 time.

These are Hadoop clients, so there may be a lot of stuff going on. My best guess at this point is that prior to step (3) above some operation causes mtime to be updated. It looks like there may be a lot of vectors in Client.cc for this to occur. If this seems plausible, is there a good way to hunt it down?

Actions

Copy link

#10

Updated by Sage Weil over 12 years ago

If you can generate client logs for C1 and C2 (debug ms = 1, debug client = 10) that should tell us everything.

Actions

Copy link Download all files

#11

Updated by Noah Watkins over 12 years ago

File client-1.log client-1.log added
File client-2.log client-2.log added

Sage Weil wrote:

If you can generate client logs for C1 and C2 (debug ms = 1, debug client = 10) that should tell us everything.

The relevant file in these traces is: _partition.lst

These seem to be relevant events:

Client-1.log: line 32259: client1 stats file and sees mtime = local time
Client-1.log: client1 does various things
Client-1.log: line 33910: client1 stats file and sees mtime = 15 minutes in the future

Client-2.log: line 2722: client2's first interaction with the file (stat)

Actions

Copy link

#12

Updated by Greg Farnum over 12 years ago

Grepping for the inode number got me this:

2011-11-03 11:01:51.312498 7f69beffd700 client.5560 add_update_cap issued pAsxLsXsxFsxcrwb -> pAsxLsXsxFsxcrwb from mds.0 on 10000000981.head(ref=1
cap_refs={1024=0,4096=0,8192=0} open={1=0,2=0} mode=100777 size=129 mtime=2011-11-03 11:14:01.196477 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb)
objectset[10000000981 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7f69b81b9a80 0x41347950)

It's definitely the MDS setting it to that for some reason. Probably because the MDS and client are out of sync, and the MDS does some operation that updates the time but doesn't immediately flush it back to the client, and the client's later mtime update doesn't go through because it's an earlier mtime? (I think there are guards to prevent mtimes from moving backward, although I could be mistaken.)

Actions

Copy link

#13

Updated by Noah Watkins over 12 years ago

If Client-1 is seeing a cached copy of the inode's mtime, then the following server-side scenario may explain what's going on (note that the server has the same time as client 2):

- Server.cc:handle_client_openc --- is called to create _partition.lst (in the MDS logs)
   - mdr->now = ceph_clock_now(g_ceph_context)
   - prepare_new_inode(mdr, ...)
       - (line 1750) in->inode.ctime = in->inode.mtime = in->inode.atime = mdr->now

This seems to be happening when I look at the logs.

Actions

Copy link

#14

Updated by Greg Farnum over 12 years ago

So the "bad" mtime is the same time the inode was created on the MDS server?

Actions

Copy link

#15

Updated by Noah Watkins over 12 years ago

Greg Farnum wrote:

So the "bad" mtime is the same time the inode was created on the MDS server?

I think so. Here is the more explicit version of what I think is happening now:

Client-1: open f via handle_client_openc, write
MDS: records inode's mtime as local time (bad time in this scenario)
Client 1: sync, close, stat --> sees cached mtime (good time in this scenario)
Client 2: stat f --> bad mtime

Actions

Copy link

#16

Updated by Greg Farnum over 12 years ago

If that's the case then I'm surprised the mtime didn't get updated at an earlier time. If nothing else we can probably fix it by encoding more metadata and shoving it out on the create response, although it makes me a little sad.

Actions

Copy link

#17

Updated by Noah Watkins over 12 years ago

Do you mean that you are surprised that client-1's inode didn't get updated from the server's change before the stat occurred? If so, then it might be relevant to know that the entire scenario plays out in just a couple seconds.

Actions

Copy link

#18

Updated by Greg Farnum over 12 years ago

Yeah, it's not impossible, I just would have thought that one of the other updates would have prompted the server to send its mtime back.

This is basically an annoying problem with how you handle mtimes in distributed systems. We could just always take the mtime encoded by the last person to modify the file, and that's what we try to do -- but going backwards in time is also really stupid so I believe we refuse to do so, on the theory that if you care your clocks are probably pretty close. sigh

Actions

Copy link

#19

Updated by Noah Watkins over 12 years ago

Bummer. Well... for the time being it may be sufficient to force FileStatus.getModificationTime() to go directly to the server with getattr, or at least somehow invalidate the cache prior to the stat. This would I think validate the hypothesis. The former is a little more involved than the latter, so if you know of any trick off hand that might refresh the client's view then that'd be helpful for a stop-gap.

Actions

Copy link

#20