spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy petrella <>
Subject Re: Is Spark the right tool for me?
Date Mon, 01 Dec 2014 14:51:01 GMT
not applicable to your problem, but interesting enough to share on this

On Mon Dec 01 2014 at 3:48:14 PM andy petrella <>

> Indeed. However, I guess the important load and stress is in the
> processing of the 3D data (DEM or alike) into geometries/shades/whatever.
> Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob
> for more info) to perform these operations then keep an RDD of only the
> resulting geometries.
> Those geometries won't probably that heavy, hence it might be possible to
> coalesce(1, true) to have to whole thing on one node (or if your driver is
> more beefy, do a collect/foreach) to create the index.
> You could also create a GeoJSON of the geometries and create the r-tree on
> it (not sure about this one).
> On Mon Dec 01 2014 at 3:38:00 PM Stadin, Benjamin <
>> wrote:
>> Thank you for mentioning GeoTrellis. I haven’t heard of this before. We
>> have many custom tools and steps, I’ll check our tools fit in. The end
>> result after is actually a 3D map for native OpenGL based rendering on iOS
>> / Android [1].
>> I’m using GeoPackage which is basically SQLite with R-Tree and a small
>> library around it (more lightweight than SpatialLite). I want to avoid
>> accessing the SQLite db from any other machine or task, that’s where I
>> thought I can use a long running task which is the only process responsible
>> to update a local-only stored SQLite db file. As you also said SQLite  (or
>> mostly any other file based db) won’t work well over network. This isn’t
>> only limited to R-Tree but expected limitation because of file locking
>> issues as documented also by SQLite.
>> I also thought to do the same thing when rendering the (web) maps. In
>> combination with the db handler which does the actual changes, I thought to
>> run a map server instance on each node, configure it to add the database
>> location as map source once the task starts.
>> Cheers
>> Ben
>> [1]
>> Von: andy petrella <>
>> Datum: Montag, 1. Dezember 2014 15:07
>> An: Benjamin Stadin <>, "
>>" <>
>> Betreff: Re: Is Spark the right tool for me?
>> Not quite sure which geo processing you're doing are they raster, vector? More
>> info will be appreciated for me to help you further.
>> Meanwhile I can try to give some hints, for instance, did you considered
>> GeoMesa <>?
>> Since you need a WMS (or alike), did you considered GeoTrellis
>> <> (go to the batch processing)?
>> When you say SQLite, you mean that you're using Spatialite? Or your db is
>> not a geo one, and it's simple SQLite. In case you need an r-tree (or
>> related) index, you're headaches will come from congestion within your
>> database transaction... unless you go to a dedicated database like Vertica
>> (just mentioning)
>> kr,
>> andy
>> On Mon Dec 01 2014 at 2:49:44 PM Stadin, Benjamin <
>>> wrote:
>>> Hi all,
>>> I need some advise whether Spark is the right tool for my zoo. My
>>> requirements share commonalities with „big data“, workflow coordination and
>>> „reactive“ event driven data processing (as in for example Haskell Arrows),
>>> which doesn’t make it any easier to decide on a tool set.
>>> NB: I have asked a similar question on the Storm mailing list, but have
>>> been deferred to Spark. I previously thought Storm was closer to my needs –
>>> but maybe neither is.
>>> To explain my needs it’s probably best to give an example scenario:
>>>    - A user uploads small files (typically 1-200 files, file size
>>>    typically 2-10MB per file)
>>>    - Files should be converted in parallel and on available nodes. The
>>>    conversion is actually done via native tools, so there is not so much big
>>>    data processing required, but dynamic parallelization (so for example to
>>>    split the conversion step into as many conversion tasks as files are
>>>    available). The conversion typically takes between several minutes and a
>>>    few hours.
>>>    - The converted files gathered and are stored in a single database
>>>    (containing geometries for rendering)
>>>    - Once the db is ready, a web map server is (re-)configured and the
>>>    user can make small updates to the data set via a web UI.
>>>    - … Some other data processing steps which I leave away for brevity …
>>>    - There will be initially only a few concurrent users, but the
>>>    system shall be able to scale if needed
>>> My current thoughts:
>>>    - I should avoid to upload files into the distributed storage during
>>>    conversion, but probably should rather have each conversion filter download
>>>    the file it is actually converting from a shared place. Other wise it’s
>>>    for scalability reasons (too many redundant copies of same temporary files
>>>    if there are many concurrent users and many cluster nodes).
>>>    - Apache Oozie seems an option to chain together my pipes into a
>>>    workflow. But is it a good fit with Spark? What options do I have with
>>>    Spark to chain a workflow from pipes?
>>>    - Apache Crunch seems to make it easy to dynamically parallelize
>>>    tasks (Oozie itself can’t do this). But I may not need crunch after all
>>>    I have Spark, and it also doesn’t seem to fit to my last problem following.
>>>    - The part that causes me the most headache is the user interactive
>>>    db update: I consider to use Kafka as message bus to broker between the web
>>>    UI and a custom db handler (nb, the db is a SQLite file). But how
>>>    about update responsiveness, isn’t it that Spark will cause some lags (as
>>>    opposed to Storm)?
>>>    - The db handler probably has to be implemented as a long running
>>>    continuing task, so when a user sends some changes the handler writes these
>>>    to the db file. However, I want this to be decoupled from the job. So file
>>>    these updates should be done locally only on the machine that started the
>>>    job for the whole lifetime of this user interaction. Does Spark allow to
>>>    create such long running tasks dynamically, so that when another (web) user
>>>    starts a new task a new long–running task is created and run on the same
>>>    node, which eventually ends and triggers the next task? Also, is it
>>>    possible to identify a running task, so that a long running task can be
>>>    bound to a session (db handler working on local db updates, until task
>>>    done), and eventually restarted / recreated on failure?
>>> ~Ben

View raw message