hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2951) contrib package provides a utility to build or update an index
Date Wed, 12 Mar 2008 09:20:46 GMT
A contrib package to update an index using
 Map/Reduce
In-Reply-To: <1732673116.1204817458160.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577771#action_12577771
] 

Enis Soztutar commented on HADOOP-2951:
---------------------------------------

I have not examined the patch in sufficient detail, but it seems good. I think we can include
this in the contrib directory unless anyone objects. 

> contrib package provides a utility to build or update an index
A contrib package to update an index using Map/Reduce
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2951
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2951
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Ning Li
>         Attachments: contrib_index.tar.gz
>
>
> This contrib package provides a utility to build or update an index
> using Map/Reduce.
> A distributed "index" is partitioned into "shards". Each shard corresponds
> to a Lucene instance. org.apache.hadoop.contrib.index.main.UpdateIndex
> contains the main() method which uses a Map/Reduce job to analyze documents
> and update Lucene instances in parallel.
> The Map phase of the Map/Reduce job formats, analyzes and parses the input
> (in parallel), while the Reduce phase collects and applies the updates to
> each Lucene instance (again in parallel). The updates are applied using the
> local file system where a Reduce task runs and then copied back to HDFS.
> For example, if the updates caused a new Lucene segment to be created, the
> new segment would be created on the local file system first, and then
> copied back to HDFS.
> When the Map/Reduce job completes, a "new version" of the index is ready
> to be queried. It is important to note that the new version of the index
> is not derived from scratch. By leveraging Lucene's update algorithm, the
> new version of each Lucene instance will share as many files as possible
> as the previous version.
> The main() method in UpdateIndex requires the following information for
> updating the shards:
>   - Input formatter. This specifies how to format the input documents.
>   - Analysis. This defines the analyzer to use on the input. The analyzer
>     determines whether a document is being inserted, updated, or deleted.
>     For inserts or updates, the analyzer also converts each input document
>     into a Lucene document.
>   - Input paths. This provides the location(s) of updated documents,
>     e.g., HDFS files or directories, or HBase tables.
>   - Shard paths, or index path with the number of shards. Either specify
>     the path for each shard, or specify an index path and the shards are
>     the sub-directories of the index directory.
>   - Output path. When the update to a shard is done, a message is put here.
>   - Number of map tasks.
> All of the information can be specified in a configuration file. All but
> the first two can also be specified as command line options. Check out
> conf/index-config.xml.template for other configurable parameters.
> Note: Because of the parallel nature of Map/Reduce, the behaviour of
> multiple inserts, deletes or updates to the same document is undefined.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message