hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev Konar" <maha...@yahoo-inc.com>
Subject RE: Is Hadoop Really the right framework for me?
Date Thu, 10 Jul 2008 22:15:15 GMT
I think 
is what you want.


> -----Original Message-----
> From: Michael Bieniosek [mailto:michael@powerset.com]
> Sent: Thursday, July 10, 2008 3:09 PM
> To: core-user@hadoop.apache.org; Sandy
> Subject: Re: Is Hadoop Really the right framework for me?
> My understanding is that Hadoop doesn't know where the line breaks are
> when it divides up your file, so each mapper will get some
> chunk of file containing some number of lines.  It then does some
> so that you get only whole lines for each mapper, but this does means
> 1) you can't guarantee that each map task will contain exactly one
> (though you can set the number of mappers high enough so that all
> get zero or one lines), and 2) you can't get the line numbers back.
> -Michael
> On 7/10/08 2:47 PM, "Sandy" <snickerdoodle08@gmail.com> wrote:
> Hello,
> I have been posting on the forums for a couple of weeks now, and I
> appreciate all the help that I've been receiving. I am fairly new to
> and even newer to the Hadoop framework. While I am sufficiently
> with the Hadoop, quite a bit of the underlying functionality is masked
> the user (which, while I understand is the point of a Map Reduce
> Framework,
> can be a touch frustrating for someone who is still trying to learn
> way around), and the documentation is sometimes difficult to navigate.
> have been thusfar unable to sufficiently find an answer to this
> on
> my own.
> My goal is to implement a fairly simple map reduce algorithm. My
> is, "Is Hadoop really the right framework to use for this algorithm?"
> I have one very large file containing multiple lines of text. I want
> assign a mapper job to each line. Furthermore, the mapper needs to be
> to know what line it is processing. If we were thinking about this in
> terms
> of the Word Count Example, let's say we have a modification where we
> to
> just see where the words came from, rather than just the count of the
> words.
> For this example, we have the file:
> Hello World
> Hello Hadoop
> Goodbye Hadoop
> I want to assign a mapper to each line. Each mapper will emit a word
> its
> corresponding line number. For this example, we would have three
> (call them m1, m2, and m3). Each mapper will emit the following:
> m1 emits:
> <"Hello", 1> <"World", 1>
> m2 emits:
> <"Hello", 2> <"Hadoop", 2>
> m3 emits:
> <"Goodbye",3> <"Hadoop", 3>
> My reduce function will count the number of words based on the
> of line numbers they have, which is necessary, because I wish to use
> line numbers for another purpose.
> I have tried Hadoop Pipes, and the Hadoop Python interface. I am now
> looking
> at the Java interface, and am still puzzled how quite to implement
> mainly because I don't see how to assign mappers to lines of files,
> than to files themselves. From what I can see from the documentation,
> Hadoop
> seems to be more suitable for applications that deal multiple files
> than multiple lines. I want it to be able to spawn for any input file,
> number of mappers corresponding to the number of lines. There can be a
> on the number of mappers spawned (e.g. 128) so that if the number of
> exceed the number of mappers, then the mappers can concurrently
> lines until all lines are exhausted. I can't see a straightfoward way
> do
> this using the Hadoop framework.
> Please keep in mind that I cannot put each line in its own separate
> the number of lines in my file is sufficiently large that this is
> not
> a good idea.
> Given this information, is Hadoop really the right framework to use?
> not,
> could you please suggest alternative frameworks? I am currently
looking at
> Skynet and Erlang, though I am not too familiar with either.
> I would appreciate any feedback. Thank you for your time.
> Sincerely,
> -SM

View raw message