hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: How to use MapFile?
Date Mon, 13 Nov 2006 18:59:26 GMT
A good way to update a very large MapFile-based dataset is to:

1. Add new entries to SequenceFile's in a dataset.add directory.
2. Run a MapReduce job specifying input directories of both dataset and 
dataset.new.  If you need to update existing entries, specify a reduce 
function that merges existing entries with new entries.  Specify 
MapFileOutputFormat.  Specify dataset.new as the output directory.
3. Rename dataset.new to dataset.
4. Use MapFileOutputFormat.getReaders() and 
MapFileOutputFormat.getEntry() to randomly access entries in the dataset 
with a single read (the indexes are read into memory).  Or, for batch 
operations, use MapReduce directly on the dataset (as an input 
directory) to generate derivative datasets.

This is the way that, e.g., Nutch updates it's crawl DB.


张茂森 wrote:
>  Hi all: 
> Now I want to do some operations like ‘update’ or ‘insert’, which can
> describe like this:
> 1. I have a base dataset
> 2. Everyday I will get more data from other places, and then I want to
> update or insert these new data into my base dataset. 
> 3. After I’ve read API Doc, I think MapFile is a good way to solve this
> problem. As far as I know, I only need to append my new data at the end of
> base dataset, and update the index file of MapFile. I understand right?
> 4.  If I am right, I want to know how to do these operations using MapFile. 
> Firstly, I could only find MapFileOutputFormat and couldn’t find
> MapFileInputFormat, so how to read the MapFile?
> Secondly, how to update the index and append the data? Do you have some
> experience or samples?
> Any suggestion would be appreciated.
> Thank you!

View raw message