Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of andrea.zonca@gmail.com
 designates 209.85.128.179 as permitted sender)
MIME-Version: 1.0
From: andrea zonca <andrea.zonca@gmail.com>
Date: Fri, 12 Jul 2013 23:43:19 +0200
Message-ID: 
 <CAN0a5oc=-iBWD2h6bdhmnxWDHwnco59sf6S_PrkQhdcv6Y2e0A@mail.gmail.com>
Subject: Running hadoop for processing sources in full sky maps
To: user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I have few tens of full sky maps, in binary format (FITS) of about 600MB each.

For each sky map I already have a catalog of the position of few
thousand sources, i.e. stars, galaxies, radio sources.

For each source I would like to:

open the full sky map
extract the relevant section, typically 20MB or less
run some statistics on them
aggregate the outputs to a catalog

I would like to run hadoop, possibly using python via the streaming
interface, to process them in parallel.

I think the input to the mapper should be each record of the catalogs,
then the python mapper can open the full sky map, do the processing
and print the output to stdout.

Is this a reasonable approach?
If so, I need to be able to configure hadoop so that a full sky map is
copied locally to the nodes that are processing one of its sources.
How can I achieve that?
Also, what is the best way to feed the input data to hadoop? for each
source I have a reference to the full sky map, latitude and longitude

Thanks,
I posted this question on StackOverflow:
http://stackoverflow.com/questions/17617654/running-hadoop-for-processing-sources-in-full-sky-maps

Regards,
Andrea Zonca