incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: new queue capability
Date Mon, 03 Mar 2014 14:34:25 GMT
Not yet.  Tim and I have discussed the need for a higher level API.  One
that would be similar to the existing one but the implementor wouldn't have
to know about the partitioning in the underlying table.  It would reside in
the controller of Blur and deal with all the partitioning, failures, etc.
 If you have any feedback on the API (QueueReader) that should be changed
for ease of use that would be greatly appreciated.  Since I am getting
blasted with snow right now, I plan on working on this later today.  Thanks!

Aaron


On Mon, Mar 3, 2014 at 6:26 AM, Dibyendu Bhattacharya <
dibyendu.bhattachary@gmail.com> wrote:

> Hi,
>
> Is there any simpler Client API will come for indexing data from Kafka to
> Blur ? As I mentioned below, I played with the Queue API, What I missed
> there was to have a partionar logic to channel the Kafka Stream to a
> correct Shard. Also Shard failover need to be handled if I need to
> implement the queue logic at Shard level.
>
> Regards,
> Dibyendu
>
>
> On Fri, Feb 28, 2014 at 5:32 PM, Dibyendu Bhattacharya <
> dibyendu.bhattachary@gmail.com> wrote:
>
>> Hi,
>>
>> I was just playing with the new QueueReader API, and as Tim pointed out ,
>> its very low level . I still tried to implement a KafkaConsumer .
>>
>> Here is my use case. Let me know if I have approached correctly.
>>
>> I have a given topic in Kafka, which has 3 Partitions. And in Blur I have
>> a table with 2 Shards . I need to index all messages from Kafka Topic to
>> Blur Table.
>>
>>  I have used Kafka ConsumerGroupAPI to consume in parallel in 2 streams (
>> from 3 partitions) for indexing into 2 Blur shards. As ConsumerGroup API
>> allow me to split any Kafka Topic into N number of streams, I can choose N
>> for my target shard count, here it is 2.
>>
>> For both shards I created two ShardContext and two BlurIndexSimpleWriter.
>> ( Is this okay ?)
>>
>> Now, I modified the BlurIndexSimpleWriter  to get handle to
>> the _queueReader object.  I used this _queueReader  to populate the
>> respective shards queue taking messages from KafkaStreams.
>>
>> Here is the TestCase (KafkaReaderTest) , KafkaStreamReader ( which reads
>> the Kafka Stream) , and the KafkaQueueReader ( The Q interface for Blur)
>>
>> Also attached the modified BlurIndexSimpleWriter. Just added
>>
>>   public QueueReader getQueueReader(){
>>
>>   return _queueReader;
>>   }
>>
>> With these changes, I am able to read Kafka messages in parallel streams
>> and index them into 2 shards. All documents from Kafka getting indexed
>> properly. But after TestCases run , I can see two Index Directory for two
>> path I created.
>>
>> Let me know if this approach is correct ? In this code, I have not taken
>> care of Shard Failure logic and as Tim pointed out, if that can be
>> abstracted form client that will be great.
>>
>> Regards,
>> Dibyendu
>>
>>
>>
>>
>> On Thu, Feb 27, 2014 at 9:40 PM, Aaron McCurry <amccurry@gmail.com>wrote:
>>
>>> What if we provide an implementation of the QueueReader concept that does
>>> what you are discussing.  That way in more extreme cases when the user is
>>> forced into implementing the lower level api (perhaps for performance)
>>> they
>>> can still do it, but for the normal case the partitioning (and other
>>> difficult issues) are handled by the controllers.
>>>
>>> I could see adding an enqueueMutate call to the controllers that pushes
>>> the
>>> mutates to the correct buckets for the user.  At the same time we could
>>> allow each of the controllers to pull from an external and push the
>>> mutates
>>> to the correct buckets for the shards.  I could see a couple of different
>>> ways of handling this.
>>>
>>> However I do agree that right now there is too much burden on the user
>>> for
>>> the 95% case.  We should make this simpler.
>>>
>>> Aaron
>>>
>>>
>>> On Thu, Feb 27, 2014 at 10:07 AM, Tim Williams <williamstw@gmail.com>
>>> wrote:
>>>
>>> > I've been playing around with the new QueueReader stuff and I'm
>>> > starting to believe it's at the wrong level of abstraction - in the
>>> > shard context - for a user.
>>> >
>>> > Between having to know about the BlurPartioner and handling all the
>>> > failure nuances, I'm thinking a much friendlier approach would be to
>>> > have the client implement a single message pump that Blur take's from
>>> > and handles.
>>> >
>>> > Maybe on startup the Controllers compete for the lead QueueReader
>>> > position, create it from the TableContext and run with it?  The user
>>> > would still need to deal with  Controller failures but that seems
>>> > easier to reason about then shard failures.
>>> >
>>> > The way it's crafted right now, the user seems burdened with a lot of
>>> > the hard problems that Blur otherwise solves.  Obviously, it trades
>>> > off a high burden for one of the controllers.
>>> >
>>> > Thoughts?
>>> > --tim
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message