crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Byte Offset for Records
Date Tue, 05 May 2015 06:53:05 GMT
Any Source<T> can be used as the input to an MR/Spark job via
Pipeline.read, but a ReadableSource<T> can read data into the local client
as well-- I'm assuming you're hitting an error trying to use your
formattedFile source w/a MemPipeline job? MemPipeline requires
ReadableSources since everything it does runs client-side, while MRPipeline
and SparkPipeline are happy to use regular Sources, like the one returned
by formattedFile.

The next question you would ask is "why doesn't formattedFile return a
ReadableSource<T>?" -- and it's a good one. I don't remember if there's a
good reason for it or if I was just being lazy. Will take a look and report
back.

J

On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <jeff@nunahealth.com> wrote:

> Hello,
>
> I would like to know the byte offset (absolute offset, not relative to
> split) for each record inside of my crunch pipeline.
>
> My planned approach is to use a custom `InputFormat` class.
>
> I have tried tried using `From#formattedFile` to apply a custom
> `InputFormat` class, however the returned class does not implement
> `ReadableSource`, and thus cannot be used as a parameter for
> `Pipeline#read`.
>
> What is the purpose of the `From#formattedFile` method if the Source class
> it returns output cannot actually be read? Is using a custom `InputFormat`
> class possible or recommended?
>
> Thanks,
>
> Jeff Quinn
> Data Engineer
> Nuna
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.




-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message