incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Spark - Apache Blur Connector : Index Kafka Messages into Blur using Spark Streaming
Date Tue, 30 Sep 2014 12:26:25 GMT
Hi Dibyendu,

If you still want to contribute the Spark code to Blur that would be
awesome!  I think that we would need a ICLA from you.

http://www.apache.org/licenses/icla.txt

It might be a good idea to get CCLA from your company.

http://www.apache.org/licenses/cla-corporate.txt

Maybe someone else can help out here, will we need a Software Grant as well?

Thanks!

Aaron

On Tue, Sep 23, 2014 at 8:00 AM, Dibyendu Bhattacharya <
dibyendu.bhattachary@gmail.com> wrote:

> Thanks Aaron. I would love to do that.
>
> Dibyendu
> On Sep 23, 2014 5:17 PM, "Aaron McCurry" <amccurry@gmail.com> wrote:
>
> > On Mon, Sep 22, 2014 at 4:21 AM, Dibyendu Bhattacharya <
> > dibyendu.bhattachary@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Last few days I am working on a Spark - Apache Blur Connector to index
> > > Kafka messages into Apache Blur using Spark Streaming. We have been
> > working
> > > on to build a distributed search platform for our NRT use cases and we
> > have
> > > been playing with Spark Streaming and Apache Blur for the same. We are
> > > presently working on Apache Blur and here is a Spark Connector I would
> > like
> > > to share with community to get a feedback for this.
> > >
> > > This Connector uses the Low Level Kafka Consumer which I had written
> few
> > > weeks back (https://github.com/dibbhatt/kafka-spark-consumer). There
> > was a
> > > separate thread on this Kafka Consumer in Spark group.
> > >
> > > Even though I was able to index Kafka messages using this low level
> > > consumer via Apache Blur Queuing API , I wanted to try out the Spark
> > > saveAsHadoop* API which can perform bulk loading of RDD into Apache
> Blur.
> > >
> > > For that I have written this Blur Connector for Spark (
> > > https://github.com/dibbhatt/spark-blur-connector).
> > >
> > > This connector uses the same Kafka Low level consumer which I mentioned
> > > above, and partition the RDD which is same as number of Shards for
> target
> > > Blur Table. For this I had to use a Custom Partitioner logic so that
> > > Partition of Keys in RDD is same as Partition of Keys into Targte Blur
> > > Shard.
> > >
> > > I also implemented a Custom BlurOutputFormat  to return
> > > the BlurOutputCommitter which use the new Hadoop api
> > > (org.apache.hadoop.mapreduce).
> > >
> > > There are few minor changes I did in existing GenericBlurRecordWriter
> > > and BlurOutputCommitter and used modified RecordWriter and
> OutputCommiter
> > > for this Spark Blur connector. If those minor issues are fixed in
> Apache
> > > blur, no need to use these custom code .
> > >
> > >
> > > Have tested this connector to index activity streams coming to Kafka
> > > cluster,  and it nicely index Kafka messages into Target Apache Blur
> > > tables.
> > >
> > > Would love to hear what you think. I have copied both Apache Blur and
> > Spark
> > > community..
> > >
> >
> > I think this is awesome!  I have read through the code but I still need
> to
> > get it running to put it through it's paces.  :-) Do you think that you
> > would want to contribute this code into Blur?
> >
> > Aaron
> >
> >
> > >
> > >
> > > Regards,
> > > Dibyendu
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message