apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <thomas.we...@gmail.com>
Subject Re: Kafka 0.9 operator to start consuming from a particular offset
Date Sat, 11 Jun 2016 04:14:29 GMT
Ananth,

If your goal is to merge the parquet files, then why not use these files as
source vs. going back to Kafka?

Thomas




On Fri, Jun 10, 2016 at 4:42 PM, Ananth Gundabattula <
agundabattula@gmail.com> wrote:

> Thanks for the thoughts Siyuan.
>
> Yes agree that the problem is inherently a batch oriented problem. We are
> hoping to build upon the window concepts to simulate a batch design. (
> Primary reason is that we do not want two different ETL processing pipeline
> platforms within our eco system ).
>
> We are using kafka as the source of data over which multiple data
> processing frameworks ( ETL, M/L frameworks etc) run through. Hence Kafka
> is being used  both for streaming (primarily ETL - Apex system ) and batch
> use cases ( primarily M/L ) .
>
> I shall create a ticket.
>
> Regards,
> Ananth
>
>
>
> On Sat, Jun 11, 2016 at 7:15 AM, hsy541@gmail.com <hsy541@gmail.com>
> wrote:
>
>> Hi Ananth,
>> Unlike files, Kafka is usually for streaming cases. Correct me if I'm
>> wrong, your use case seems like a batch processing. We didn't consider end
>> offset in our Kafka input operator design. But it could be a useful
>> feature. Unfortunately there is no easy way, as of I know, to extend
>> existing operator to achieve that.
>>
>> OffsetManager is not designed for end offset. It's only
>> a  customizable callback to update the committed offsets. And the start
>> offsets it loads are supposed for stateful application restart.
>>
>> Can you create a ticket and elaborate your use case there? Thanks!
>>
>> Regards,
>> Siyuan
>>
>>
>>
>>
>>
>> On Friday, June 10, 2016, Ananth Gundabattula <agundabattula@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> I was wondering what would be the community's thoughts on the following
>>> ?
>>>
>>> We are using kafka 0.9 input operator to read from a few topics. We are
>>> using this stream to generate a parquet file. Now this approach is all good
>>> for a beginners use case. At a later point in time, we would like to
>>> "merge" all of the parquet files previously generated and for this I would
>>> like to reprocess data exactly from a particular offset inside each of the
>>> partitions. Each of the partitions will have their own starting and ending
>>> offsets that I need to process for.
>>>
>>> I was wondering if there is an easy way to extend the Kafka 0.9 operator
>>> ( perhaps along the lines of the offset manager in the 0.8 versions of the
>>> kafka operator ) . Thoughts please ?
>>>
>>> Regards,
>>> Ananth
>>>
>>
>

Mime
View raw message