crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Shi <stepi...@live.com>
Subject Re: FileTargetImpl and nested outputs...
Date Fri, 26 Jul 2013 07:27:38 GMT
Hi Dougan,

I've been working on this for a while
(CRUNCH-212<https://issues.apache.org/jira/browse/CRUNCH-212>,
not finished yet). It seems that LoadIncrementalHFiles does not require the
input hfiles are named by random GUID, as long as you have the family as
the directory name. I tried this in my unit test, not in a real cluster, so
I may be wrong.


On Fri, Jul 26, 2013 at 1:23 AM, Dougan,Brian <Brian.Dougan@cerner.com>wrote:

>  Quick question.  In my attempt at an Hfile PathTarget implementation
> that extended from FileTargetImpl, I ran into an issue with nested files
> and wanted to see what everyone's thoughts were.
>
>  First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1
> Hadoop/HBase versions (though this shouldn't matter for this question).
>
>  So, Hfiles require a structure something like this
>
>    - Destination path
>       - {columnFamily}
>     - {randomGUID1} (corresponding to an HBase region}
>          - {randomGUID2} (corresponding to an HBase region}
>          - Etc…
>
> So you get files in the working path like this
>
>    - workingpath/columnFamily1/guid1
>    - workingpath/columnFamily1/guid2
>    - workingpath/columnFamily1/…
>
> FileTargetImpl allows consumers to override the getSourcePattern and
> getDestFile to help with this, so the source pattern is something like this
>
>    - Path(workingPath, "[^_]*/*")
>
> And the destination file is something like
>
>    - Path(destination / src.getParent.getName, src.getName)
>
> The issue is that FileTargetImpl doesn't create any nested folders before
> trying to do the file rename (except for the top-level root server).  So
> for instance, it may try to do something like copying from
>
>    - workingPath/columnFamily1/guid1
>
> To
>
>    - destinationPath/columnFamily1/guid1
>
> But only destination path exists, not the nested columnFamily folder.
>  This makes the rename silently fail and results in missing data in the
> destination path (the rename method actually returns a boolean that should
> probably also be validated to alert on failures).
>
>  So, my question is, should we look at getting an enhancement to
> FileTargetImpl that would build any parent directories required (might also
> make sense to make sure it's a folder under destination path) or is the
> expectation for FileTargetImpl that it's only suppose to be used by
> internal Crunch targets, so copying functionality (and adding this
> enhancement) would be a task for anyone wanting to develop a new PathTarget?
>   CONFIDENTIALITY NOTICE This message and any included attachments are
> from Cerner Corporation and are intended only for the addressee. The
> information contained in this message is confidential and may constitute
> inside or non-public information under international, federal, or state
> securities laws. Unauthorized forwarding, printing, copying, distribution,
> or use of such information is strictly prohibited and may be unlawful. If
> you are not the addressee, please promptly delete this message and notify
> the sender of the delivery error by e-mail or you may call Cerner's
> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>

Mime
View raw message