incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <dibyendu.bhattach...@gmail.com>
Subject Re: new queue capability
Date Fri, 28 Feb 2014 12:02:47 GMT
Hi,

I was just playing with the new QueueReader API, and as Tim pointed out ,
its very low level . I still tried to implement a KafkaConsumer .

Here is my use case. Let me know if I have approached correctly.

I have a given topic in Kafka, which has 3 Partitions. And in Blur I have a
table with 2 Shards . I need to index all messages from Kafka Topic to Blur
Table.

 I have used Kafka ConsumerGroupAPI to consume in parallel in 2 streams (
from 3 partitions) for indexing into 2 Blur shards. As ConsumerGroup API
allow me to split any Kafka Topic into N number of streams, I can choose N
for my target shard count, here it is 2.

For both shards I created two ShardContext and two BlurIndexSimpleWriter. (
Is this okay ?)

Now, I modified the BlurIndexSimpleWriter  to get handle to
the _queueReader object.  I used this _queueReader  to populate the
respective shards queue taking messages from KafkaStreams.

Here is the TestCase (KafkaReaderTest) , KafkaStreamReader ( which reads
the Kafka Stream) , and the KafkaQueueReader ( The Q interface for Blur)

Also attached the modified BlurIndexSimpleWriter. Just added

  public QueueReader getQueueReader(){

  return _queueReader;
  }

With these changes, I am able to read Kafka messages in parallel streams
and index them into 2 shards. All documents from Kafka getting indexed
properly. But after TestCases run , I can see two Index Directory for two
path I created.

Let me know if this approach is correct ? In this code, I have not taken
care of Shard Failure logic and as Tim pointed out, if that can be
abstracted form client that will be great.

Regards,
Dibyendu




On Thu, Feb 27, 2014 at 9:40 PM, Aaron McCurry <amccurry@gmail.com> wrote:

> What if we provide an implementation of the QueueReader concept that does
> what you are discussing.  That way in more extreme cases when the user is
> forced into implementing the lower level api (perhaps for performance) they
> can still do it, but for the normal case the partitioning (and other
> difficult issues) are handled by the controllers.
>
> I could see adding an enqueueMutate call to the controllers that pushes the
> mutates to the correct buckets for the user.  At the same time we could
> allow each of the controllers to pull from an external and push the mutates
> to the correct buckets for the shards.  I could see a couple of different
> ways of handling this.
>
> However I do agree that right now there is too much burden on the user for
> the 95% case.  We should make this simpler.
>
> Aaron
>
>
> On Thu, Feb 27, 2014 at 10:07 AM, Tim Williams <williamstw@gmail.com>
> wrote:
>
> > I've been playing around with the new QueueReader stuff and I'm
> > starting to believe it's at the wrong level of abstraction - in the
> > shard context - for a user.
> >
> > Between having to know about the BlurPartioner and handling all the
> > failure nuances, I'm thinking a much friendlier approach would be to
> > have the client implement a single message pump that Blur take's from
> > and handles.
> >
> > Maybe on startup the Controllers compete for the lead QueueReader
> > position, create it from the TableContext and run with it?  The user
> > would still need to deal with  Controller failures but that seems
> > easier to reason about then shard failures.
> >
> > The way it's crafted right now, the user seems burdened with a lot of
> > the hard problems that Blur otherwise solves.  Obviously, it trades
> > off a high burden for one of the controllers.
> >
> > Thoughts?
> > --tim
> >
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message