incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: BlurRDD for Spark
Date Tue, 29 Sep 2015 11:54:44 GMT
Sure,

It's experimental at best right now.  If you take a look at:

https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-spark/src/main/java/org/apache/blur/spark/UsingBlurRDD.java;h=a09db32fdcfc096ff00656d538039598ef60d42c;hb=9ae5bf35b5eea0015e456873e8288961da70a9aa

This shows how you can write an anonymous inner function that is deployed
at Spark job startup (so no deploying of new jars and restarting of shard
servers) and executed inside in the shard servers directly against the
Lucene index.  The StreamWriter is used to collect whatever data you desire
(as long as the type is serializable).  So you can create your own POJOs to
pull back whatever data you desire.

Blur commands (also experimental) were built to allow this type of
functionality but I found them clunky to write.  Along with that I found
that I was going to have to build something like Spark's RDD to get the
answers I wanted.  So instead of reinventing the wheel I have built a way
to integrate the two together.  Overall performance is pretty good if you
leave the Spark application running.

Also if you desire to execute code against Blur without Spark you can use:

https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-core/src/main/java/org/apache/blur/command/stream/StreamClient.java;h=20d1bcd5d0438e314e4f3528170d3bdd73230e17;hb=9ae5bf35b5eea0015e456873e8288961da70a9aa

It's not as nice to use but that's what the BlurRDD uses to communicate.

I will likely be making a lot of changes to the internals of the stream
server and client soon, but for now it's seems to work.  :-)

Aaron


On Tue, Sep 29, 2015 at 7:40 AM, Dibyendu Bhattacharya <
dibyendu.bhattachary@gmail.com> wrote:

> Hi Aaron,
>
> I see you started some work on Streaming execution on Spark using BlurRDD.
>
> Is it possible to explain little bit on this. As you know, I am able to
> integrate Spark Streaming with Blur for High throughput indexing of Kafka
> Stream.. and same idea I will be presenting tomorrow at Apache Big Data
> Conference Europe.
>
> BlurRDD is definitely an excellent addition and that can lead us to build
> Spark Data Frames on top of Blur Tables for query execution .
>
> If you can explain a bit about this Streaming Server and Client and usage
> of BlurRDD, that will be great. If you wish I can even make some reference
> in my talk as well about this feature of strong integration of Blur with
> Spark .
>
> Regards,
> Dibyendu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message