hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <steve.lough...@gmail.com>
Subject Re: where do side-projects go in trunk now that contrib/ is gone?
Date Fri, 08 Mar 2013 16:57:19 GMT
On 8 March 2013 16:15, Alejandro Abdelnur <tucu@cloudera.com> wrote:

> jumping a bit late into the discussion.
>
> yes. I started it in common-dev first, in the "where does contrib stuff go
now", moved to general, where the conclusion was "except for special cases
like FS clients, it isn't".

Now I'm trying to lay down the location for FS stuff, both for openstack,
and to handle so proposed dependency changes for s3n://


> I'd argue that unless those filesystems are part of hadoop, their clients
> should not be distributed/build by hadoop.
>
> an analogy to this is not wanting Yarn to be the home for AM
> implementations.
>
> a key concern is testability and maintainability.
>

We are already there with the S3 and Azure blobstores, as well as the FTP
filesystem

The testability is straightforward for blobstores precisely because all you
need is some credentials and cluster time; there's no requirement to have
some specific filesystem to hand. Any of those -very much in the vendors
hand to do their own testing, especially if the "it's a replacement for
HDFS" assertion is made.

If you look at HADOOP-9361 you can see that I've been defining more
rigorously than before what our FS expectations are, with HADOOP-9371
spelling it out "what happens when you try to readFully() past the end of a
file, or call getBlockLocations("/")? HDFS has actions here, and downstream
code depends on some things (e.g. getBlockLocations() behaviour on
directories)
https://issues.apache.org/jira/secure/attachment/12572328/HadoopFilesystemContract.pdf

So far my initially blobstore-specific tests for the functional parts of
the specification (not the consistency, concurrency, atomicity parts) are
in github
https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift


I've also added more tests to the existing FS contract test, and in doing
so showed that s3 and s3n have some data-loss risks which need to be fixed
-that's an argument in having favour of the (testable, low-maintenance
cost) filesystems somewhere where any of us is free to fix.

While we refine that spec better, I want to take those per-operation tests
from the SwiftFS support, make them retargetable at other filesystems, and
slowly apply them to all the distributed filesystems. Your colleague Andrew
Wang is helping there by abstracting FileSystem and FileContext away, so we
can test both.

still, i see bigtop as the integration point and the mean of making those
> jars avail to a setup.
>
>
I plan to put integration -the tests that try to run Pig with arbitrary
source and dest filesystems, same for hive, plus some scale tests -can we
upload an 8GB file? What do you get back? can I create > 65536 entries in a
single directory, and what happens to ls / performance?

To summarise then

   1. blobstores, ftpfilesystem & c could gradually move to a
   hadoop-common/hadoop-filesystem-clients
   2. A stricter specification of compliance, for the benefit of everyone
   -us, other FS implementors and users of FS APIs
   3. Lots of new functional tests for compliance -abstract in
   hadoop-common; FS-specific in hadoop-filesystem-clients..
   4. Integration & scale tests in bigtop
   5. Anyone writing a "hadoop compatible FS" can grab the functional and
   integration tests and see what breaks -fixing their code.
   6. The combination of (Java API files, specification doc, functional
   tests, HDFS implementation) define the expected behavior of a filesystem

-Steve


-Steve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message