crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkwhita...@gmail.com>
Subject Re: FileTargetImpl and nested outputs...
Date Thu, 25 Jul 2013 18:07:47 GMT
They create multiple directories however the folder structure does not need
to be preserved.  So in that case flattening the files is acceptable.


On Thu, Jul 25, 2013 at 1:03 PM, Josh Wills <jwills@cloudera.com> wrote:

> I think that having Crunch handle it makes sense-- what do we do for
> Trevni targets right now? Don't they also create nested subdirectories?
>
> J
>
>
> On Thu, Jul 25, 2013 at 10:23 AM, Dougan,Brian <Brian.Dougan@cerner.com>wrote:
>
>>  Quick question.  In my attempt at an Hfile PathTarget implementation
>> that extended from FileTargetImpl, I ran into an issue with nested files
>> and wanted to see what everyone's thoughts were.
>>
>>  First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1
>> Hadoop/HBase versions (though this shouldn't matter for this question).
>>
>>  So, Hfiles require a structure something like this
>>
>>    - Destination path
>>       - {columnFamily}
>>     - {randomGUID1} (corresponding to an HBase region}
>>          - {randomGUID2} (corresponding to an HBase region}
>>          - Etc…
>>
>> So you get files in the working path like this
>>
>>    - workingpath/columnFamily1/guid1
>>    - workingpath/columnFamily1/guid2
>>    - workingpath/columnFamily1/…
>>
>> FileTargetImpl allows consumers to override the getSourcePattern and
>> getDestFile to help with this, so the source pattern is something like this
>>
>>    - Path(workingPath, "[^_]*/*")
>>
>> And the destination file is something like
>>
>>    - Path(destination / src.getParent.getName, src.getName)
>>
>> The issue is that FileTargetImpl doesn't create any nested folders before
>> trying to do the file rename (except for the top-level root server).  So
>> for instance, it may try to do something like copying from
>>
>>    - workingPath/columnFamily1/guid1
>>
>> To
>>
>>    - destinationPath/columnFamily1/guid1
>>
>> But only destination path exists, not the nested columnFamily folder.
>>  This makes the rename silently fail and results in missing data in the
>> destination path (the rename method actually returns a boolean that should
>> probably also be validated to alert on failures).
>>
>>  So, my question is, should we look at getting an enhancement to
>> FileTargetImpl that would build any parent directories required (might also
>> make sense to make sure it's a folder under destination path) or is the
>> expectation for FileTargetImpl that it's only suppose to be used by
>> internal Crunch targets, so copying functionality (and adding this
>> enhancement) would be a task for anyone wanting to develop a new PathTarget?
>>   CONFIDENTIALITY NOTICE This message and any included attachments are
>> from Cerner Corporation and are intended only for the addressee. The
>> information contained in this message is confidential and may constitute
>> inside or non-public information under international, federal, or state
>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>> or use of such information is strictly prohibited and may be unlawful. If
>> you are not the addressee, please promptly delete this message and notify
>> the sender of the delivery error by e-mail or you may call Cerner's
>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message