hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James R. Leek" <le...@llnl.gov>
Subject Identifying lines in map()
Date Mon, 30 Nov 2009 01:00:40 GMT
I want to use hadoop to discover if there is any token that appears in 
every line of a file.  I thought that this should be pretty 
straightforward, but I'm having a heck of a time with it.  (I'm pretty 
new to hadoop.  I've been using it for about two weeks.)

My original idea was to have the mapper produce every token as the key, 
with the line number as the value.  But I couldn't find any InputFormat 
that would give me line numbers.

However, it seemed that FileInputFormat would give me the position in 
the file as the key, and the line as the value.  I assume that the key 
would be the position in the file of the beginning of the line.  With 
that I could have the token be the key, and the line position as the 
value, and use a hash table in the reducer to determine if the token 
appeared in every line.  However, I found that it actually seems to give 
the position of the input split.  I figured this out because, rather 
than getting 50,000 unique keys to the mapper (the number of lines in 
the file), I was getting 220 unique keys.  (The number of mappers/input 
splits.)

So, what should I do?

Thanks,
Jim

Mime
View raw message