Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of john.d.howland@gmail.com
 designates 72.14.204.235 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:mime-version:content-type;
        b=kD19OHyr7s9JfZC07nu2hn/zgdazqDJSgC/A06YMGz/iflouMfOIuuCGcDJa/0McvI
         4P9A9kL5vvl++I1PGqrxAkVGyY03BtnNrMqUjip5eKnMTd+oTPfl48kCL6+BXa7oI5U9
         14NGbHN8rLn5Ca/y7PL6gAZdW8p50SmG8lyl4=
Message-ID: <e651dff0809071622l35d199e6x390358e283b6b7db@mail.gmail.com>
Date: Sun, 7 Sep 2008 19:22:06 -0400
From: "John Howland" <john.d.howland@gmail.com>
To: core-user@hadoop.apache.org
Subject: Getting started questions
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_40818_15916195.1220829726927"

------=_Part_40818_15916195.1220829726927
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If anyone
could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents ranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?

 - The bodies of the documents usually range from 1K-100KB in size, but some
outliers can be as big as 4-5GB.
 - I will also need to store some metadata for each document which I figure
could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard operations
on the bodies, like word frequency and searching.

Is there a canned FileInputFormat that makes sense? Should I roll my own?
How can I access the bodies as streams so I don't have to read them into RAM
all at once? Am I right in thinking that I should treat each document as a
record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no reduction),
where I'm calculating new metadata fields on each document. To end up with a
good result set, I'll need to copy the entire input record + new fields into
another set of output files. Is there a better way? I haven't wanted to go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of reference
on the datanodes.

3. I'm sure this is not a new idea, but I haven't seen anything regarding
it... I'll need to run several MR jobs as a pipeline... is there any way for
the map tasks in a subsequent stage to begin processing data from previous
stage's reduce task before that reducer has fully finished?

Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more "real".
A whole heap of thanks in advance,

John

------=_Part_40818_15916195.1220829726927--