gobblin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vicky Kak <vicky....@gmail.com>
Subject Re: Partition meta data not present.
Date Wed, 06 Sep 2017 12:20:38 GMT
Hey Guys,

I have checked in sample code demonstrating the pattern as explained above.

I am soon going to put the documentation about the same, please note that
it is just a quick hack to demonstrate the pattern as explained in the
email chain.


On Tue, Sep 5, 2017 at 6:48 PM, Vicky Kak <vicky.kak@gmail.com> wrote:

> I am not able to see this email yet in the email archive here
> https://lists.apache.org/list.html?user@gobblin.incubator.apache.org
> Can anyone take a note of it and get it working?
> Thanks,
> Vicky
> On Wed, Aug 30, 2017 at 4:08 PM, Vicky Kak <vicky.kak@gmail.com> wrote:
>> Hi Guys,
>> We have got a use case where there is no meta data information about the
>> data to be processed in Gobblin. We need to read the whole data chunk and
>> then create a partition, I would be interested to know how this is being
>> addressed by others. Let me explain it with the sample generic data, assume
>> that we have got data D with N records in it. We do the following
>> 1) In the Source implementation we pull all the data D using rest API. We
>> have got the N records in the Source implementation and we are creating
>> n(workunit number)*M( records to be processed by each workunit) = N.
>> 2) We are passing the starting id to the workunit via the SourceState.
>> 3) Each WorkUnit makes an redundant REST call to fetch the sub set of D,
>> starting from the id that is passed from Source to it.
>> So there are 1 REST call in Source and n REST calls to get the data,
>> total of n+1 calls are being made although the data can be fetched by a
>> single call in the Source.
>> What I am thinking is to have the data D in the memory ( it should be
>> distributed memory for YARN case) and pass the reference of it to the
>> WorkUnits for processing, however would like to know how this is being
>> addressed by others. This can be one of the patterns of data to be
>> processed by the Gobblin.
>> May be we can have a document explaining various data patterns, how to
>> partition them and use in the Gobblin.
>> Thanks,
>> Vicky

View raw message