incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Spark - Apache Blur Connector : Index Kafka Messages into Blur using Spark Streaming
Date Tue, 23 Sep 2014 11:47:18 GMT
On Mon, Sep 22, 2014 at 4:21 AM, Dibyendu Bhattacharya <
dibyendu.bhattachary@gmail.com> wrote:

> Hi,
>
> Last few days I am working on a Spark - Apache Blur Connector to index
> Kafka messages into Apache Blur using Spark Streaming. We have been working
> on to build a distributed search platform for our NRT use cases and we have
> been playing with Spark Streaming and Apache Blur for the same. We are
> presently working on Apache Blur and here is a Spark Connector I would like
> to share with community to get a feedback for this.
>
> This Connector uses the Low Level Kafka Consumer which I had written few
> weeks back (https://github.com/dibbhatt/kafka-spark-consumer). There was a
> separate thread on this Kafka Consumer in Spark group.
>
> Even though I was able to index Kafka messages using this low level
> consumer via Apache Blur Queuing API , I wanted to try out the Spark
> saveAsHadoop* API which can perform bulk loading of RDD into Apache Blur.
>
> For that I have written this Blur Connector for Spark (
> https://github.com/dibbhatt/spark-blur-connector).
>
> This connector uses the same Kafka Low level consumer which I mentioned
> above, and partition the RDD which is same as number of Shards for target
> Blur Table. For this I had to use a Custom Partitioner logic so that
> Partition of Keys in RDD is same as Partition of Keys into Targte Blur
> Shard.
>
> I also implemented a Custom BlurOutputFormat  to return
> the BlurOutputCommitter which use the new Hadoop api
> (org.apache.hadoop.mapreduce).
>
> There are few minor changes I did in existing GenericBlurRecordWriter
> and BlurOutputCommitter and used modified RecordWriter and OutputCommiter
> for this Spark Blur connector. If those minor issues are fixed in Apache
> blur, no need to use these custom code .
>
>
> Have tested this connector to index activity streams coming to Kafka
> cluster,  and it nicely index Kafka messages into Target Apache Blur
> tables.
>
> Would love to hear what you think. I have copied both Apache Blur and Spark
> community..
>

I think this is awesome!  I have read through the code but I still need to
get it running to put it through it's paces.  :-) Do you think that you
would want to contribute this code into Blur?

Aaron


>
>
> Regards,
> Dibyendu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message