crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stadin, Benjamin" <>
Subject Crunch, workflow management and user interaction
Date Mon, 01 Dec 2014 17:33:01 GMT
I have a mixed bag of requirements, ranging from parallel data processing to local file updates
(single / same node), and „reactive“ filter interaction. I’m undecided what frameworks
I should settle on.

It’s probably best explained by an example usage scenario:

 *   A web site user uploads small files (typically 1-200 files, file size typically 2-10MB
per file)
 *   Files should be converted in parallel and on available nodes. The conversion is actually
done via native tools, but I consider to use Crunch for dynamic parallelization of the conversion
according to the number of uploaded files. The conversion will likely take between several
minutes and a few hours.
 *   The converted files are gathered and stored in a single *SQLite* (!) database (containing
geometries for rendering). This needs to be done on one node only (file lockings etc). You
may say I should not use SQLite, but believe me I really do =).
 *   Once the SQLite db is ready, a web map server is (re-)configured on the very same server
as the one where the db job was started, and the user can interact with a web application
and make small updates to the data set via a web map editing UI. This is a temporary service.
After a few minutes when user interaction is done, the server is "shut down“ (it isn’t
really, just the data source is remeoved form it and reconfigured).
 *   When the user is done and hit’s the save button, the workflow triggers another parallelizable
job which does some post-processings on the data

The main two things causing me headache:

 *   I’m not sure how to implement „reactivity“ as it’s called in Haskell Arrows with
my filters. How should I design a Crunch job as a long-running job which accepts input, and
in addition runs only on a single node? In Spark one could call coalesce(1, true), but in
either case I’m not sure how to cleanly implement a reactive filter in Crunch or Spark.
 *   Workflow management: In my scenario, there is are n user sessions and each can start
different workflows in parallel (above outlines just one of the workflows). What shall I take
to chain my pipes into workflows? Oozie? Crunch-Jobs? Could you pint me to an example how
to do this?


View raw message