hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: Getting started questions
Date Mon, 08 Sep 2008 04:02:46 GMT

John Howland wrote:
> I've been reading up on Hadoop for a while now and I'm excited that I'm
> finally getting my feet wet with the examples + my own variations. If anyone
> could answer any of the following questions, I'd greatly appreciate it.
> 1. I'm processing document collections, with the number of documents ranging
> from 10,000 - 10,000,000. What is the best way to store this data for
> effective processing?

AFAIK hadoop doesn't do well with, although it can handle, a large 
number of small files.  So it would be better to read in the documents 
and store them in SequenceFile or MapFile format.  This would be similar 
to the way the Fetcher works in Nutch.  10M documents in a sequence/map 
file on DFS is comparatively small and can be handled efficiently.

>  - The bodies of the documents usually range from 1K-100KB in size, but some
> outliers can be as big as 4-5GB.

I would say store your document objects as Text objects, not sure if 
Text has a max size.  I think it does but not sure what that is.  If it 
does you can always store as a BytesWritable which is just an array of 
bytes.  But you are going to have memory issues reading in and writing 
out that large of a record.

>  - I will also need to store some metadata for each document which I figure
> could be stored as JSON or XML.
>  - I'll typically filter on the metadata and then doing standard operations
> on the bodies, like word frequency and searching.

It is possible to create an OutputFormat that writes out multiple files. 
  You could also use a MapWritable as the value to store the document 
and associated metadata.

> Is there a canned FileInputFormat that makes sense? Should I roll my own?
> How can I access the bodies as streams so I don't have to read them into RAM

A writable is read into RAM so even treating it like a stream doesn't 
get around that.

One thing you might want to consider is to  tar up say X documents at a 
time and store that as a file in DFS.  You would have many of these 
files.  Then have an index that has the offsets of the files and their 
keys (document ids).  That index can be passed as input into a MR job 
that can then go to DFS and stream out the file as you need it.  The job 
will be slower because you are doing it this way but it is a solution to 
handling such large documents as streams.

> all at once? Am I right in thinking that I should treat each document as a
> record and map across them, or do I need to be more creative in what I'm
> mapping across?
> 2. Some of the tasks I want to run are pure map operations (no reduction),
> where I'm calculating new metadata fields on each document. To end up with a
> good result set, I'll need to copy the entire input record + new fields into
> another set of output files. Is there a better way? I haven't wanted to go
> down the HBase road because it can't handle very large values (for the
> bodies) and it seems to make the most sense to keep the document bodies
> together with the metadata, to allow for the greatest locality of reference
> on the datanodes.

If you don't specify a reducer, the IdentityReducer is run which simply 
passes through output.

> 3. I'm sure this is not a new idea, but I haven't seen anything regarding
> it... I'll need to run several MR jobs as a pipeline... is there any way for
> the map tasks in a subsequent stage to begin processing data from previous
> stage's reduce task before that reducer has fully finished?

Yup, just use FileOutputFormat.getOutputPath(previousJobConf);

> Whatever insight folks could lend me would be a big help in crossing the
> chasm from the Word Count and associated examples to something more "real".
> A whole heap of thanks in advance,
> John

View raw message