spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stadin, Benjamin" <>
Subject Re: Is Spark the right tool for me?
Date Mon, 01 Dec 2014 14:43:36 GMT
… Sorry, I forgot to mention why I’m basically bound to SQLite. The workflow involves more
data processings than I mentioned. There are several tools in the chain which either rely
on SQLite as exchange format, or processings like data cleaning that are done orders of magnitude
faster / or using less resources than a heavy weight db for these specialized (and temporary)

Von: andy petrella <<>>
Datum: Montag, 1. Dezember 2014 15:07
An: Benjamin Stadin <<>>,
"<>" <<>>
Betreff: Re: Is Spark the right tool for me?

Not quite sure which geo processing you're doing are they raster, vector? More info will be
appreciated for me to help you further.

Meanwhile I can try to give some hints, for instance, did you considered GeoMesa<>?
Since you need a WMS (or alike), did you considered GeoTrellis<>
(go to the batch processing)?

When you say SQLite, you mean that you're using Spatialite? Or your db is not a geo one, and
it's simple SQLite. In case you need an r-tree (or related) index, you're headaches will come
from congestion within your database transaction... unless you go to a dedicated database
like Vertica (just mentioning)


On Mon Dec 01 2014 at 2:49:44 PM Stadin, Benjamin <<>>
Hi all,

I need some advise whether Spark is the right tool for my zoo. My requirements share commonalities
with „big data“, workflow coordination and „reactive“ event driven data processing
(as in for example Haskell Arrows), which doesn’t make it any easier to decide on a tool

NB: I have asked a similar question on the Storm mailing list, but have been deferred to Spark.
I previously thought Storm was closer to my needs – but maybe neither is.

To explain my needs it’s probably best to give an example scenario:

 *   A user uploads small files (typically 1-200 files, file size typically 2-10MB per file)
 *   Files should be converted in parallel and on available nodes. The conversion is actually
done via native tools, so there is not so much big data processing required, but dynamic parallelization
(so for example to split the conversion step into as many conversion tasks as files are available).
The conversion typically takes between several minutes and a few hours.
 *   The converted files gathered and are stored in a single database (containing geometries
for rendering)
 *   Once the db is ready, a web map server is (re-)configured and the user can make small
updates to the data set via a web UI.
 *   … Some other data processing steps which I leave away for brevity …
 *   There will be initially only a few concurrent users, but the system shall be able to
scale if needed

My current thoughts:

 *   I should avoid to upload files into the distributed storage during conversion, but probably
should rather have each conversion filter download the file it is actually converting from
a shared place. Other wise it’s bad for scalability reasons (too many redundant copies of
same temporary files if there are many concurrent users and many cluster nodes).
 *   Apache Oozie seems an option to chain together my pipes into a workflow. But is it a
good fit with Spark? What options do I have with Spark to chain a workflow from pipes?
 *   Apache Crunch seems to make it easy to dynamically parallelize tasks (Oozie itself can’t
do this). But I may not need crunch after all if I have Spark, and it also doesn’t seem
to fit to my last problem following.
 *   The part that causes me the most headache is the user interactive db update: I consider
to use Kafka as message bus to broker between the web UI and a custom db handler (nb, the
db is a SQLite file). But how about update responsiveness, isn’t it that Spark will cause
some lags (as opposed to Storm)?
 *   The db handler probably has to be implemented as a long running continuing task, so when
a user sends some changes the handler writes these to the db file. However, I want this to
be decoupled from the job. So file these updates should be done locally only on the machine
that started the job for the whole lifetime of this user interaction. Does Spark allow to
create such long running tasks dynamically, so that when another (web) user starts a new task
a new long–running task is created and run on the same node, which eventually ends and triggers
the next task? Also, is it possible to identify a running task, so that a long running task
can be bound to a session (db handler working on local db updates, until task done), and eventually
restarted / recreated on failure?

View raw message