hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul rai <rairahul7...@yahoo.in>
Subject map reduce optimization
Date Sun, 03 May 2015 18:56:22 GMT
Can somebody help preparation of map reduce settings.
We recently set up Hadoop 10 nodes Hadoop cluster and have 10 TB of data mostly xml data small
files in zipped. This is our initial POC. 

I would like to know a few things , the steps to be followed

1. As per my understanding these 10 TB of data we upload to NameNode. 
2. Since the files are very small average 100 kb xml files . How should we combine these xml
files . Can we combine roughly 1 TB each so that it would come to 10 files.3. We set the block
size to 128MB.4. Do we need to put these 10 files each 1TB in one folder or 10 files in just
1 folder.5. In the case of 4 above  if it is only 1 folder having 10 files of 1 TB we run
the map reduce . Is it better to run 10 map reduce jobs for each folder in case of 10 folders
or just one map reduce for 1 folder.
6. In case we run 1 map reduce job using 1 folder having 10 files each 1 TB , the map tasks
I calculate is 10*1*1024*1024/128=81920 mappers. Can the system sustain these many mappers

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message