crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Binning operation for the generation of Hive partitioned data
Date Tue, 22 Apr 2014 11:31:14 GMT
Hi Elliot,

On Tue, Apr 22, 2014 at 1:11 PM, Elliot West <teabot@gmail.com> wrote:
> Hello,
>
> I'm evaluating Apache Crunch as a possible replacement for some our data
> processing frameworks that run on Hadoop. I can find crunch constructs that
> map to most types of operation that we perform in our processes. However, we
> frequently bin data by a date field for the purpose of generating
> partitioned Hive tables - a fairly common operation I believe. I can't find
> a similar binning operation in the crunch user manual and was wondering
> if/how this would be achieve with Apache Crunch?

There is currently some support for something like this in Crunch,
provided that you're using Avro for your output files.

The AvroPathPerKeyTarget[1] takes a PTable<String,T>, where T is a
type that can be serialized by Avro, and writes the Avro values in a
subdirectory whose name is given by the String value for that record
in the PTable. As pointed out in the javadoc for AvroPathPerKeyTarget,
it's a good idea to ensure that all values for the same key are
together (i.e. that the elements in the PTable are sorted by key)
before using the AvroPathPerKeyTarget.

- Gabriel

1. http://crunch.apache.org/apidocs/0.9.0/org/apache/crunch/io/avro/AvroPathPerKeyTarget.html

Mime
View raw message