crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: FileTargetImpl and nested outputs...
Date Thu, 25 Jul 2013 18:03:41 GMT
I think that having Crunch handle it makes sense-- what do we do for Trevni
targets right now? Don't they also create nested subdirectories?

J


On Thu, Jul 25, 2013 at 10:23 AM, Dougan,Brian <Brian.Dougan@cerner.com>wrote:

>  Quick question.  In my attempt at an Hfile PathTarget implementation
> that extended from FileTargetImpl, I ran into an issue with nested files
> and wanted to see what everyone's thoughts were.
>
>  First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1
> Hadoop/HBase versions (though this shouldn't matter for this question).
>
>  So, Hfiles require a structure something like this
>
>    - Destination path
>       - {columnFamily}
>     - {randomGUID1} (corresponding to an HBase region}
>          - {randomGUID2} (corresponding to an HBase region}
>          - Etc…
>
> So you get files in the working path like this
>
>    - workingpath/columnFamily1/guid1
>    - workingpath/columnFamily1/guid2
>    - workingpath/columnFamily1/…
>
> FileTargetImpl allows consumers to override the getSourcePattern and
> getDestFile to help with this, so the source pattern is something like this
>
>    - Path(workingPath, "[^_]*/*")
>
> And the destination file is something like
>
>    - Path(destination / src.getParent.getName, src.getName)
>
> The issue is that FileTargetImpl doesn't create any nested folders before
> trying to do the file rename (except for the top-level root server).  So
> for instance, it may try to do something like copying from
>
>    - workingPath/columnFamily1/guid1
>
> To
>
>    - destinationPath/columnFamily1/guid1
>
> But only destination path exists, not the nested columnFamily folder.
>  This makes the rename silently fail and results in missing data in the
> destination path (the rename method actually returns a boolean that should
> probably also be validated to alert on failures).
>
>  So, my question is, should we look at getting an enhancement to
> FileTargetImpl that would build any parent directories required (might also
> make sense to make sure it's a folder under destination path) or is the
> expectation for FileTargetImpl that it's only suppose to be used by
> internal Crunch targets, so copying functionality (and adding this
> enhancement) would be a task for anyone wanting to develop a new PathTarget?
>   CONFIDENTIALITY NOTICE This message and any included attachments are
> from Cerner Corporation and are intended only for the addressee. The
> information contained in this message is confidential and may constitute
> inside or non-public information under international, federal, or state
> securities laws. Unauthorized forwarding, printing, copying, distribution,
> or use of such information is strictly prohibited and may be unlawful. If
> you are not the addressee, please promptly delete this message and notify
> the sender of the delivery error by e-mail or you may call Cerner's
> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message