hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sky USC <sky...@hotmail.com>
Subject Help me with architecture of a somewhat non-trivial mapreduce implementation
Date Wed, 18 Apr 2012 21:56:09 GMT


Please help me architect the design of my first significant MR task beyond "word count". My
program works well. but I am trying to optimize performance to maximize use of available computing
resources. I have 3 questions at the bottom. 

Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt
     * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS (range could
be 10000 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk
So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year,
ebook-version). 
2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook
1334 -> publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named "<publisher>_<year>_<ebook-version>"
 contains a list of all "ebook urls" that met that criteria.
example: 
File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I also want to output statistics such that:
<publisher>_<year>_<ebook-version>  <COUNT_OF_URLs>
PENGUIN_2001_3.12     250,111
RANDOMHOUSE_1999_2.01  11,322
etc

Here is how I implemented:
* My launcher gets list of MM manifests 
* My Mapper gets one manifest. 
 --- It reads the manifest, within a WHILE loop, 
    --- fetches each EBOOK,  and obtain attributes from each ebook, 
    --- updates the manifest for that ebook
    --- context.write(new Text("RANDOMHOUSE_1999_2.01"), new Text("storage:/root/1.manif/1223.folder/2143.Ebook.ebk"))
 --- Once all ebooks in the manifest are read, it saves the updated Manifest, and exits
* My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls.
 --- It writes a new file "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" with all the storage
urls for the ebooks
 --- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), new IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))

As I mentioned, its working. I launch it on 15 elastic instances. I have three questions:
1. Is this the best way to implement the MR logic?
2. I dont know if each of the instances is getting one task or multiple tasks simultaneously
for the MAP portion. If it is not getting multiple MAP tasks, should I go with the route of
"multithreaded" reading of ebooks from each manifest? Its not efficient to read just one ebook
at a time per machine. Is "Context.write()" threadsafe?
3. I can see log4j logs for main program, but no visibility into logs for Mapper or Reducer.
Any idea?


 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message