hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Collins <...@cloudera.com>
Subject Re: symlink support in Hadoop 2 GA
Date Tue, 17 Sep 2013 22:05:59 GMT
(Looping in Arun since this impacts 2.x releases)

I updated the versions on HADOOP-8040 and sub-tasks to reflect where
the changes have landed. All of these changes (modulo HADOOP-9417)
were merged to branch-2.1 and are in the 2.1.0 release.

While symlinks are in 2.1.0 I don't think we can really claim they're
ready until issues like HADOOP-9912 are resolved, and they are
supported in the shell, distcp and WebHDFS/HttpFS/Hftp (these are not
esoteric!).  Someone can create a symlink with FileSystem causing
someone else's distcp job to fail. Unlikely given they're not exposed
outside the Java API but still not great.   Ideally this work would
have been done on a feature branch and then merged when complete, but
that's water under the bridge.

I see the following options:

1. Fixup the current symlink support so that symlinks are ready for
2.2 (GA), or at least the public APIs. This means the APIs will be in
GA from the get go so while the functionality might be fully baked we
don't have to worry about incompatible changes like FileStatus#isDir()
changing behavior in 2.3 or a later update.  The downside is this will
take at least a couple weeks (to resolve HADOOP-9912 and potentially
implement the remaining pieces) and so may impact the 2.2 release
timing. This option means 2.2 won't remove the new APIs introduced in
2.1.  We'd want to spin a 2.1.2 beta with the new API changes so we
don't introduce new APIs in the beta to GA transition.

2. Revert symlinks from branch-2.1-beta and branch-2. Finish up the
work in trunk (or a feature branch) and merge for a subsequent 2.x
update.  While this helps get us to GA faster it would be preferable
to get an API change like this in for 2.2 GA since they may be
disruptive to introduce in an update (eg see example in #1). And of
course our users would like symlinks functionality in the GA release.
This option would mean 2.2 is incompatible with 2.1 because it's
dropping the new APIs, not ideal for a beta to GA transition.

3. Revert and punt symlinks to 3.x.  IMO should be the last resort.

If we have sufficient time I think option #1 would be best.  What do
others think?


On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang <andrew.wang@cloudera.com> wrote:
> Hi all,
> I wanted to broadcast plans for putting the FileSystem symlinks work
> (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think
> it's pretty important we get it in since it's not a compatible change; if
> it misses the GA train, we're not going to have symlinks until the next
> major release.
> However, we're still dealing with ongoing issues revealed via testing.
> There's user-code out there that only handles files and directories and
> will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
> for a nice example where globStatus returning symlinks broke Pig; some of
> us had a conference call to talk it through, and one definite conclusion
> was that this wasn't solvable in a generally compatible manner.
> There are also still some gaps in symlink support right now. For example,
> the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
> resolution, and tooling like the FsShell and Distcp still need to be
> updated as well.
> So, there's definitely work to be done, but there are a lot of users
> interested in the feature, and symlinks really should be in GA. Would
> appreciate any thoughts/input on the matter.
> Thanks,
> Andrew

View raw message