hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Black, Michael (IS)" <Michael.Bla...@ngc.com>
Subject dictionary.csv
Date Thu, 23 Dec 2010 14:28:38 GMT
Using hadoop-0.20.2+737 on Redhat's distribution.

I'm trying to use a dictionary.csv file from a Lucene index inside a map 
function plus another comma delimited file.

It's just a simple loop of reading a line, split the line on commas, and add 
the dictionary entry to a hash map.

It's about an 8M file with 1.5M lines.  I'm using an absolute path so the file 
read is local (and not hdfs).  I've verified no hdfs reads occurring from the 
job status.

When I run this outside of hadoop it executes in 6 seconds.

Inside hadoop it takes 13 seconds and the java process is 100% CPU the whole 

This makes absolutely no sense to me...I would've thought it should execute in 
the same time frame seeing as how it's just reading a local file (I'm only 
running one task at the moment).

I'm also reading another file in a similar fashion and see 3.4 seconds vs 0.3 
seconds (longer lines that are also getting split).  This one is 45 lines and 

It appears that perhaps the split function is running slower since the smaller 
file with more columns runs 10X slower than the large file which is "only" 2X 

Anybody have any idea why file input is slower under hadoop?

View raw message