hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tim robertson <timrobertson...@gmail.com>
Subject Re: Generating many small PNGs to Amazon S3 with MapReduce
Date Tue, 14 Apr 2009 14:10:24 GMT
Sorry Brian, can I just ask please...

I have the PNGs in the Sequence file for my sample set.  If I use a
second MR job and push to S3 in the map, surely I run into the
scenario where multiple tasks are running on the same section of the
sequence file and thus pushing the same data to S3.  Am I missing
something obvious (e.g. can I disable this behavior)?



On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
<timrobertson100@gmail.com> wrote:
> Thanks Brian,
> This is pretty much what I was looking for.
> Your calculations are correct but based on the assumption that at all
> zoom levels we will need all tiles generated.  Given the sparsity of
> data, it actually results in only a few 100GBs.  I'll run a second MR
> job with the map pushing to S3 then to make use of parallel loading.
> Cheers,
> Tim
> On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
>> Hey Tim,
>> Why don't you put the PNGs in a SequenceFile in the output of your reduce
>> task?  You could then have a post-processing step that unpacks the PNG and
>> places it onto S3.  (If my numbers are correct, you're looking at around 3TB
>> of data; is this right?  With that much, you might want another separate Map
>> task to unpack all the files in parallel ... really depends on the
>> throughput you get to Amazon)
>> Brian
>> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>>> Hi all,
>>> I am currently processing a lot of raw CSV data and producing a
>>> summary text file which I load into mysql.  On top of this I have a
>>> PHP application to generate tiles for google mapping (sample tile:
>>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>>> Here is a (dev server) example of the final map client:
>>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
>>> dynamic grids as you zoom are all pre-calculated.
>>> I am considering (for better throughput as maps generate huge request
>>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>>> each.
>>> Could someone please recommend the best place to generate the PNGs and
>>> when to push them to S3 in a MR system?
>>> If I did the PNG generation and upload to S3 in the reduce the same
>>> task on multiple machines will compete with each other right?  Should
>>> I generate the PNGs to a local directory and then on Task success push
>>> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
>>> good idea.
>>> I will use EC2 for the MR for the time being, but this will be moved
>>> to a local cluster still pushing to S3...
>>> Cheers,
>>> Tim

View raw message