spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy petrella <>
Subject Re: Is Spark the right tool for me?
Date Tue, 02 Dec 2014 09:00:43 GMT
The point 4 looks weird to me, I mean if you intent to have such workflow
to run in a single session (maybe consider sessionless arch)
Is such process for each user? If it's the case, maybe finding a way to do
it for all at once would be better (more data but less scheduling).

For the micro updates, considering something like a queue (kestrel? or even
kafk... whatever, something that works) would be great. So you remove the
load off the instances, and the updates can be done at its own pace. Also,
you can reuse it to notify the WMS.
Isn't there a way to do tiling directly? Also, do you need indexes, I mean
do you need the full OGIS power, or just some classical operators are
enough (using BBox only for instance)?

The more you can simplify the better :-D.

These are only my2c, it's hard to think or react appropriately without
knowing the whole context.
BTW, to answer your very first question: yes, it looks like Spark will help


On Mon Dec 01 2014 at 4:36:44 PM Stadin, Benjamin <> wrote:

> Yes, the processing causes the most stress. But this is parallizeable by
> splitting the input source. My problem is that once the heavy preprocessing
> is done, I’m in a „micro-update“ mode so to say (user-interactive part of
> the whole workflow). Then the map is rendered directly from the SQLite file
> by the map server instance on that machine – this is actually a favorable
> setup for me for resource consumption and implementation costs (I just need
> to tell the web ui to refresh after something was written to the db, and
> the map server will render the updates without me changing / coding
> anything). So my workflow requires to break out of parallel processing for
> some time.
> Do you think for my my generalized workflow and tool chain can be like so?
>    1. Pre-Process many files in a parallel way. Gather all results,
>    deploy them on one single machine. => Spark coalesce() + Crunch (for
>    splitting input files into separate tasks)
>    2. On the machine where preprocessed results are on, configure a map
>    server to connect to the local SQLite source. Do user-interactive
>    micro-updates on that file (web UI gets updated).
>    3. Post-process the files in parallel. => Spark + Crunch
>    4. Design all of the above as a workflow, runnable (or assignable) as
>    part of a user session. => Oozie
> Do you think this is ok?
> ~Ben
> Von: andy petrella <>
> Datum: Montag, 1. Dezember 2014 15:48
> An: Benjamin Stadin <>, "
>" <>
> Betreff: Re: Is Spark the right tool for me?
> Indeed. However, I guess the important load and stress is in the
> processing of the 3D data (DEM or alike) into geometries/shades/whatever.
> Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob
> for more info) to perform these operations then keep an RDD of only the
> resulting geometries.
> Those geometries won't probably that heavy, hence it might be possible to
> coalesce(1, true) to have to whole thing on one node (or if your driver is
> more beefy, do a collect/foreach) to create the index.
> You could also create a GeoJSON of the geometries and create the r-tree on
> it (not sure about this one).
> On Mon Dec 01 2014 at 3:38:00 PM Stadin, Benjamin <
>> wrote:
>> Thank you for mentioning GeoTrellis. I haven’t heard of this before. We
>> have many custom tools and steps, I’ll check our tools fit in. The end
>> result after is actually a 3D map for native OpenGL based rendering on iOS
>> / Android [1].
>> I’m using GeoPackage which is basically SQLite with R-Tree and a small
>> library around it (more lightweight than SpatialLite). I want to avoid
>> accessing the SQLite db from any other machine or task, that’s where I
>> thought I can use a long running task which is the only process responsible
>> to update a local-only stored SQLite db file. As you also said SQLite  (or
>> mostly any other file based db) won’t work well over network. This isn’t
>> only limited to R-Tree but expected limitation because of file locking
>> issues as documented also by SQLite.
>> I also thought to do the same thing when rendering the (web) maps. In
>> combination with the db handler which does the actual changes, I thought to
>> run a map server instance on each node, configure it to add the database
>> location as map source once the task starts.
>> Cheers
>> Ben
>> [1]
>> Von: andy petrella <>
>> Datum: Montag, 1. Dezember 2014 15:07
>> An: Benjamin Stadin <>, "
>>" <>
>> Betreff: Re: Is Spark the right tool for me?
>> Not quite sure which geo processing you're doing are they raster, vector? More
>> info will be appreciated for me to help you further.
>> Meanwhile I can try to give some hints, for instance, did you considered
>> GeoMesa <>?
>> Since you need a WMS (or alike), did you considered GeoTrellis
>> <> (go to the batch processing)?
>> When you say SQLite, you mean that you're using Spatialite? Or your db is
>> not a geo one, and it's simple SQLite. In case you need an r-tree (or
>> related) index, you're headaches will come from congestion within your
>> database transaction... unless you go to a dedicated database like Vertica
>> (just mentioning)
>> kr,
>> andy
>> On Mon Dec 01 2014 at 2:49:44 PM Stadin, Benjamin <
>>> wrote:
>>> Hi all,
>>> I need some advise whether Spark is the right tool for my zoo. My
>>> requirements share commonalities with „big data“, workflow coordination and
>>> „reactive“ event driven data processing (as in for example Haskell Arrows),
>>> which doesn’t make it any easier to decide on a tool set.
>>> NB: I have asked a similar question on the Storm mailing list, but have
>>> been deferred to Spark. I previously thought Storm was closer to my needs –
>>> but maybe neither is.
>>> To explain my needs it’s probably best to give an example scenario:
>>>    - A user uploads small files (typically 1-200 files, file size
>>>    typically 2-10MB per file)
>>>    - Files should be converted in parallel and on available nodes. The
>>>    conversion is actually done via native tools, so there is not so much big
>>>    data processing required, but dynamic parallelization (so for example to
>>>    split the conversion step into as many conversion tasks as files are
>>>    available). The conversion typically takes between several minutes and a
>>>    few hours.
>>>    - The converted files gathered and are stored in a single database
>>>    (containing geometries for rendering)
>>>    - Once the db is ready, a web map server is (re-)configured and the
>>>    user can make small updates to the data set via a web UI.
>>>    - … Some other data processing steps which I leave away for brevity …
>>>    - There will be initially only a few concurrent users, but the
>>>    system shall be able to scale if needed
>>> My current thoughts:
>>>    - I should avoid to upload files into the distributed storage during
>>>    conversion, but probably should rather have each conversion filter download
>>>    the file it is actually converting from a shared place. Other wise it’s
>>>    for scalability reasons (too many redundant copies of same temporary files
>>>    if there are many concurrent users and many cluster nodes).
>>>    - Apache Oozie seems an option to chain together my pipes into a
>>>    workflow. But is it a good fit with Spark? What options do I have with
>>>    Spark to chain a workflow from pipes?
>>>    - Apache Crunch seems to make it easy to dynamically parallelize
>>>    tasks (Oozie itself can’t do this). But I may not need crunch after all
>>>    I have Spark, and it also doesn’t seem to fit to my last problem following.
>>>    - The part that causes me the most headache is the user interactive
>>>    db update: I consider to use Kafka as message bus to broker between the web
>>>    UI and a custom db handler (nb, the db is a SQLite file). But how
>>>    about update responsiveness, isn’t it that Spark will cause some lags (as
>>>    opposed to Storm)?
>>>    - The db handler probably has to be implemented as a long running
>>>    continuing task, so when a user sends some changes the handler writes these
>>>    to the db file. However, I want this to be decoupled from the job. So file
>>>    these updates should be done locally only on the machine that started the
>>>    job for the whole lifetime of this user interaction. Does Spark allow to
>>>    create such long running tasks dynamically, so that when another (web) user
>>>    starts a new task a new long–running task is created and run on the same
>>>    node, which eventually ends and triggers the next task? Also, is it
>>>    possible to identify a running task, so that a long running task can be
>>>    bound to a session (db handler working on local db updates, until task
>>>    done), and eventually restarted / recreated on failure?
>>> ~Ben

View raw message