incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <>
Subject Re: new queue capability
Date Fri, 28 Feb 2014 12:02:47 GMT

I was just playing with the new QueueReader API, and as Tim pointed out ,
its very low level . I still tried to implement a KafkaConsumer .

Here is my use case. Let me know if I have approached correctly.

I have a given topic in Kafka, which has 3 Partitions. And in Blur I have a
table with 2 Shards . I need to index all messages from Kafka Topic to Blur

 I have used Kafka ConsumerGroupAPI to consume in parallel in 2 streams (
from 3 partitions) for indexing into 2 Blur shards. As ConsumerGroup API
allow me to split any Kafka Topic into N number of streams, I can choose N
for my target shard count, here it is 2.

For both shards I created two ShardContext and two BlurIndexSimpleWriter. (
Is this okay ?)

Now, I modified the BlurIndexSimpleWriter  to get handle to
the _queueReader object.  I used this _queueReader  to populate the
respective shards queue taking messages from KafkaStreams.

Here is the TestCase (KafkaReaderTest) , KafkaStreamReader ( which reads
the Kafka Stream) , and the KafkaQueueReader ( The Q interface for Blur)

Also attached the modified BlurIndexSimpleWriter. Just added

  public QueueReader getQueueReader(){

  return _queueReader;

With these changes, I am able to read Kafka messages in parallel streams
and index them into 2 shards. All documents from Kafka getting indexed
properly. But after TestCases run , I can see two Index Directory for two
path I created.

Let me know if this approach is correct ? In this code, I have not taken
care of Shard Failure logic and as Tim pointed out, if that can be
abstracted form client that will be great.


On Thu, Feb 27, 2014 at 9:40 PM, Aaron McCurry <> wrote:

> What if we provide an implementation of the QueueReader concept that does
> what you are discussing.  That way in more extreme cases when the user is
> forced into implementing the lower level api (perhaps for performance) they
> can still do it, but for the normal case the partitioning (and other
> difficult issues) are handled by the controllers.
> I could see adding an enqueueMutate call to the controllers that pushes the
> mutates to the correct buckets for the user.  At the same time we could
> allow each of the controllers to pull from an external and push the mutates
> to the correct buckets for the shards.  I could see a couple of different
> ways of handling this.
> However I do agree that right now there is too much burden on the user for
> the 95% case.  We should make this simpler.
> Aaron
> On Thu, Feb 27, 2014 at 10:07 AM, Tim Williams <>
> wrote:
> > I've been playing around with the new QueueReader stuff and I'm
> > starting to believe it's at the wrong level of abstraction - in the
> > shard context - for a user.
> >
> > Between having to know about the BlurPartioner and handling all the
> > failure nuances, I'm thinking a much friendlier approach would be to
> > have the client implement a single message pump that Blur take's from
> > and handles.
> >
> > Maybe on startup the Controllers compete for the lead QueueReader
> > position, create it from the TableContext and run with it?  The user
> > would still need to deal with  Controller failures but that seems
> > easier to reason about then shard failures.
> >
> > The way it's crafted right now, the user seems burdened with a lot of
> > the hard problems that Blur otherwise solves.  Obviously, it trades
> > off a high burden for one of the controllers.
> >
> > Thoughts?
> > --tim
> >

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message