crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: FileTargetImpl and nested outputs...
Date Fri, 26 Jul 2013 03:49:40 GMT
Agreed -- this is actually something that I had meant to do quite a while back when the FileNamingScheme
interface was introduced, but I never got around to it.

The idea around the FileNamingScheme is that a custom output structure can be given, and the
FileTargetImpl should respect the structure, creating sub-directories where needed. Going
further, the idea was that it would be possible to link partitioning information with a FileNamingScheme
to create a fanout based on some information in the partitions (which is probably exactly
what is needed for doing the HBase file writing a well).

- Gabriel

On 25 Jul 2013, at 20:03, Josh Wills <jwills@cloudera.com> wrote:

> I think that having Crunch handle it makes sense-- what do we do for Trevni targets right
now? Don't they also create nested subdirectories?
> 
> J
> 
> 
> On Thu, Jul 25, 2013 at 10:23 AM, Dougan,Brian <Brian.Dougan@cerner.com> wrote:
> Quick question.  In my attempt at an Hfile PathTarget implementation that extended from
FileTargetImpl, I ran into an issue with nested files and wanted to see what everyone's thoughts
were.  
> 
> First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 Hadoop/HBase versions (though
this shouldn't matter for this question).
> 
> So, Hfiles require a structure something like this
> Destination path
> {columnFamily}
> {randomGUID1} (corresponding to an HBase region}
> {randomGUID2} (corresponding to an HBase region}
> Etc…
> So you get files in the working path like this
> workingpath/columnFamily1/guid1
> workingpath/columnFamily1/guid2
> workingpath/columnFamily1/…
> FileTargetImpl allows consumers to override the getSourcePattern and getDestFile to help
with this, so the source pattern is something like this
> Path(workingPath, "[^_]*/*")
> And the destination file is something like
> Path(destination / src.getParent.getName, src.getName)
> The issue is that FileTargetImpl doesn't create any nested folders before trying to do
the file rename (except for the top-level root server).  So for instance, it may try to do
something like copying from
> workingPath/columnFamily1/guid1
> To 
> destinationPath/columnFamily1/guid1
> But only destination path exists, not the nested columnFamily folder.  This makes the
rename silently fail and results in missing data in the destination path (the rename method
actually returns a boolean that should probably also be validated to alert on failures). 
> 
> So, my question is, should we look at getting an enhancement to FileTargetImpl that would
build any parent directories required (might also make sense to make sure it's a folder under
destination path) or is the expectation for FileTargetImpl that it's only suppose to be used
by internal Crunch targets, so copying functionality (and adding this enhancement) would be
a task for anyone wanting to develop a new PathTarget?
> CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera
> Twitter: @josh_wills


Mime
View raw message