hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev Konar" <maha...@yahoo-inc.com>
Subject RE: Is Hadoop Really the right framework for me?
Date Thu, 10 Jul 2008 22:15:15 GMT
I think 
src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java
is what you want.

Mahadev

> -----Original Message-----
> From: Michael Bieniosek [mailto:michael@powerset.com]
> Sent: Thursday, July 10, 2008 3:09 PM
> To: core-user@hadoop.apache.org; Sandy
> Subject: Re: Is Hadoop Really the right framework for me?
> 
> My understanding is that Hadoop doesn't know where the line breaks are
> when it divides up your file, so each mapper will get some
equally-sized
> chunk of file containing some number of lines.  It then does some
patching
> so that you get only whole lines for each mapper, but this does means
that
> 1) you can't guarantee that each map task will contain exactly one
line
> (though you can set the number of mappers high enough so that all
mappers
> get zero or one lines), and 2) you can't get the line numbers back.
> 
> -Michael
> 
> On 7/10/08 2:47 PM, "Sandy" <snickerdoodle08@gmail.com> wrote:
> 
> Hello,
> 
> I have been posting on the forums for a couple of weeks now, and I
really
> appreciate all the help that I've been receiving. I am fairly new to
Java,
> and even newer to the Hadoop framework. While I am sufficiently
impressed
> with the Hadoop, quite a bit of the underlying functionality is masked
to
> the user (which, while I understand is the point of a Map Reduce
> Framework,
> can be a touch frustrating for someone who is still trying to learn
their
> way around), and the documentation is sometimes difficult to navigate.
I
> have been thusfar unable to sufficiently find an answer to this
question
> on
> my own.
> 
> My goal is to implement a fairly simple map reduce algorithm. My
question
> is, "Is Hadoop really the right framework to use for this algorithm?"
> 
> I have one very large file containing multiple lines of text. I want
to
> assign a mapper job to each line. Furthermore, the mapper needs to be
able
> to know what line it is processing. If we were thinking about this in
> terms
> of the Word Count Example, let's say we have a modification where we
want
> to
> just see where the words came from, rather than just the count of the
> words.
> 
> 
> For this example, we have the file:
> 
> Hello World
> Hello Hadoop
> Goodbye Hadoop
> 
> 
> I want to assign a mapper to each line. Each mapper will emit a word
and
> its
> corresponding line number. For this example, we would have three
mappers,
> (call them m1, m2, and m3). Each mapper will emit the following:
> 
> m1 emits:
> <"Hello", 1> <"World", 1>
> 
> m2 emits:
> <"Hello", 2> <"Hadoop", 2>
> 
> m3 emits:
> <"Goodbye",3> <"Hadoop", 3>
> 
> 
> My reduce function will count the number of words based on the
-instances-
> of line numbers they have, which is necessary, because I wish to use
the
> line numbers for another purpose.
> 
> 
> I have tried Hadoop Pipes, and the Hadoop Python interface. I am now
> looking
> at the Java interface, and am still puzzled how quite to implement
this,
> mainly because I don't see how to assign mappers to lines of files,
rather
> than to files themselves. From what I can see from the documentation,
> Hadoop
> seems to be more suitable for applications that deal multiple files
rather
> than multiple lines. I want it to be able to spawn for any input file,
a
> number of mappers corresponding to the number of lines. There can be a
cap
> on the number of mappers spawned (e.g. 128) so that if the number of
lines
> exceed the number of mappers, then the mappers can concurrently
process
> lines until all lines are exhausted. I can't see a straightfoward way
to
> do
> this using the Hadoop framework.
> 
> Please keep in mind that I cannot put each line in its own separate
file;
> the number of lines in my file is sufficiently large that this is
really
> not
> a good idea.
> 
> 
> Given this information, is Hadoop really the right framework to use?
If
> not,
> could you please suggest alternative frameworks? I am currently
looking at
> Skynet and Erlang, though I am not too familiar with either.
> 
> I would appreciate any feedback. Thank you for your time.
> 
> Sincerely,
> 
> -SM


Mime
View raw message