incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Confusion regarding SeqFileSource
Date Tue, 04 Dec 2012 21:12:40 GMT
You're welcome, and thanks for the feedback. As part of documenting the
From.java class, I should add functions that do the Writables.writables
part for you (i.e., you just pass in the Class<K extends Writable>, Class<V
extends Writable> arguments to make that easier to get rolling with. I'll
add a JIRA for it.

J


On Tue, Dec 4, 2012 at 12:53 PM, Mike Barretta <mike.barretta@gmail.com>wrote:

> Josh, thank you, that did help.  I'd found the From class, but not the
> Writables.writables.
>
>
> On Mon, Dec 3, 2012 at 5:42 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Mike,
>>
>> Sorry about that, it's mainly b/c they're tedious to write and I've been
>> lazy about it. Here's the skinny.
>>
>> For the SeqFileSource, we assume that you're only interested in the
>> "value" portion of the key-value pair for each record in the SequenceFile.
>> The PType<T> should be for whatever data type you expect to read from that
>> value, which is probably a class that implements Writable. The easy way to
>> do it is to do:
>>
>> import static org.apache.crunch.types.writable.Writables.writables;
>>
>> import org.apache.crunch.io.From;
>>
>> // This reads the value and ignore the key in each record
>> PCollection<MyWritable> in = pipeline.read(From.sequenceFile(<path>,
>> writables(MyWritable.class)));
>>
>> If you want both the key and the value, you need to read the SequenceFile
>> as a PTable<K, V>, as:
>>
>> PTable<MyKey, MyValue> in = pipeline.read(From.sequenceFile(<path>,
>> writables(MyKey.class), writables(MyValue.class)));
>>
>> After you read in the values, you're free to convert them to whatever
>> types you like using parallelDo and friends. I especially recommend using
>> the Avro-based PTypeFamily, since it will significantly outperform the
>> Writable family on jobs that involve complex joins or aggregations.
>>
>> Hope that helps, feel free to send follow-ups.
>>
>> Josh
>>
>>
>>
>> On Mon, Dec 3, 2012 at 2:25 PM, Mike Barretta <mike.barretta@gmail.com>wrote:
>>
>>> As there are no examples on using non-text files as input, I'm trying to
>>> piece together the steps involved in reading in sequence data.
>>>
>>> The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which
>>> takes a path and a PType.  The PType is where my confusion begins.
>>>
>>> How does PType relate to InputFormat and OutputFormat? Do I need to
>>> implement my own PTypes and the associated in/out MapFns?
>>>
>>> Thanks,
>>> Mike
>>>
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>

Mime
View raw message