crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Byte Offset for Records
Date Tue, 05 May 2015 07:14:51 GMT
On Tue, May 5, 2015 at 9:08 AM, Jeff Quinn <jeff@nunahealth.com> wrote:

> Josh,
>
> Thanks so much for your response, you’re correct I hit the error while
> using the MemPipeline. The difference between Source and ReadableSource
> makes much more sense to me now.
>
> It sounds like I just need to implement ReadableSource and override the
> #read and #asReadable methods with behavior that is equivalent to how my
> `InputFormat`  would act. Then I should be able to use my `InputFormat` in
> my test suite with MemPipeline, and in my real pipeline I can rest assured
> those methods will never be called.
>

That will work, but I still think the right thing to do is to make those
formattedFile impls support ReadableSource. And there are definitely places
in the MRPipeline and MemPipeline where ReadableSources would be useful
w/formattedFiles (e.g., mapside joins) that we don't support right now.


>
> Best,
>
> Jeff
>
> On May 4, 2015, at 11:53 PM, Josh Wills <jwills@cloudera.com> wrote:
>
> Any Source<T> can be used as the input to an MR/Spark job via
> Pipeline.read, but a ReadableSource<T> can read data into the local client
> as well-- I'm assuming you're hitting an error trying to use your
> formattedFile source w/a MemPipeline job? MemPipeline requires
> ReadableSources since everything it does runs client-side, while MRPipeline
> and SparkPipeline are happy to use regular Sources, like the one returned
> by formattedFile.
>
> The next question you would ask is "why doesn't formattedFile return a
> ReadableSource<T>?" -- and it's a good one. I don't remember if there's a
> good reason for it or if I was just being lazy. Will take a look and report
> back.
>
> J
>
> On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <jeff@nunahealth.com> wrote:
>
>> Hello,
>>
>> I would like to know the byte offset (absolute offset, not relative to
>> split) for each record inside of my crunch pipeline.
>>
>> My planned approach is to use a custom `InputFormat` class.
>>
>> I have tried tried using `From#formattedFile` to apply a custom
>> `InputFormat` class, however the returned class does not implement
>> `ReadableSource`, and thus cannot be used as a parameter for
>> `Pipeline#read`.
>>
>> What is the purpose of the `From#formattedFile` method if the Source
>> class it returns output cannot actually be read? Is using a custom
>> `InputFormat` class possible or recommended?
>>
>> Thanks,
>>
>> Jeff Quinn
>> Data Engineer
>> Nuna
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com/>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message