crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dougan,Brian" <>
Subject FileTargetImpl and nested outputs...
Date Thu, 25 Jul 2013 17:23:28 GMT
Quick question.  In my attempt at an Hfile PathTarget implementation that extended from FileTargetImpl,
I ran into an issue with nested files and wanted to see what everyone's thoughts were.

First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 Hadoop/HBase versions (though
this shouldn't matter for this question).

So, Hfiles require a structure something like this

  *   Destination path
     *   {columnFamily}
        *   {randomGUID1} (corresponding to an HBase region}
        *   {randomGUID2} (corresponding to an HBase region}
        *   Etc…

So you get files in the working path like this

  *   workingpath/columnFamily1/guid1
  *   workingpath/columnFamily1/guid2
  *   workingpath/columnFamily1/…

FileTargetImpl allows consumers to override the getSourcePattern and getDestFile to help with
this, so the source pattern is something like this

  *   Path(workingPath, "[^_]*/*")

And the destination file is something like

  *   Path(destination / src.getParent.getName, src.getName)

The issue is that FileTargetImpl doesn't create any nested folders before trying to do the
file rename (except for the top-level root server).  So for instance, it may try to do something
like copying from

  *   workingPath/columnFamily1/guid1


  *   destinationPath/columnFamily1/guid1

But only destination path exists, not the nested columnFamily folder.  This makes the rename
silently fail and results in missing data in the destination path (the rename method actually
returns a boolean that should probably also be validated to alert on failures).

So, my question is, should we look at getting an enhancement to FileTargetImpl that would
build any parent directories required (might also make sense to make sure it's a folder under
destination path) or is the expectation for FileTargetImpl that it's only suppose to be used
by internal Crunch targets, so copying functionality (and adding this enhancement) would be
a task for anyone wanting to develop a new PathTarget?

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

View raw message