crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adric Eckstein (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-543) AvroPathPerKeyTarget copy nested subdirectories
Date Mon, 19 Oct 2015 14:57:05 GMT


Adric Eckstein commented on CRUNCH-543:

I see what you mean about having all those writers open, and grouping by keys is certainly
the safest way.  However, it can save a lot of time to avoid grouping especially if you have
a large amount of data for a single key (which would kill all the parallelism).  This led
me to try and use it for an ungrouped pcollection, however, because my keys were not necessarily
sorted, it was constantly opening and closing writers, which i think was leading to some bad

When i made these changes, it seemed to fix it so you could write out without grouping (making
it substantially faster for the case mentioned above).  It seems to work well for a couple
hundred files simultaneously, but that would obviously be a function of the input data.

> AvroPathPerKeyTarget copy nested subdirectories
> -----------------------------------------------
>                 Key: CRUNCH-543
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: IO
>            Reporter: Adric Eckstein
>            Assignee: Josh Wills
>             Fix For: 0.13.0
>         Attachments: CRUNCH-543.patch, CRUNCH-543b.patch, CRUNCH-543c.patch
> When using AvroPathPerKeyTarget to write out a subpath in the output directory using
a String key, the key might indicate multiple subfolders:
> Pair<String, String> kv = new Pair<String, String>("foo/bar", "value");
> PTable<String, String> kvs = pipeline.create(Arrays.asList(kv),Avros.tableOf(Avros.strings(),
> PTables.asPTable(kvs).write(new AvroPathPerKeyTarget("output"));
> This throws the error:
> java.lang.IllegalArgumentException: Reducer output name 'bar' cannot
be parsed
> 	at$CompletionHook.handleMultiPaths(
> ...
> In AvroPathPerKeyTarget the handleOutputs method would need to recursively copy subfolders
(currently only checks first level in output directory) to enable keys that define multiple
sub folders.

This message was sent by Atlassian JIRA

View raw message