hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Li (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1913) [HBase] Build a Lucene index on an HBase table
Date Tue, 18 Sep 2007 02:31:43 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Ning Li updated HADOOP-1913:

    Attachment: build_table_index.take2.patch

Thanks for the comments!

> Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase settings
in the mix (Would IndexConfiguration be a better name than IndexConf).

The content of an index configuration is actually a property value in an hbase configuration.
You can see an example in BuildTableIndex.java

> You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME.  You should fix.
 Otherwise it won't apply when hudson tries to apply it.

Done in take2.

> You way you add the per-column config. into a hadoop configuration is very cute.  I'm
unclear how mulitple columns are done..... Should there be a columns element to hold multiple
column elements?   I'd suggest you add javadoc with example config. ('cos trying to read conjure
the xml produced by the code takes a little effort).

There is an example index configuration in BuildTableIndex.java. Configurations for a column
are in a "column" element. I'll add the example to javadoc once we agree on the best way to
do index configuration.

> Ning, have you tried your patch on a distributed cluster?  Does your column trick get
properly distributed out and your LuceneDocumentWrapper work in the distributed context?
> Did you use lucene 2.2 or something else?
> I had a problem compiling:

Oops. The compiling problem was my mistake (forgot to remove some unused code). All fixed
in take2. Yes, I included Lucene 2.2 in hbase/lib. And yes, I have tested on a distributed
cluster. Since an index configuration content is a property in an hbase configuration, it
does work properly in the distributed environment.

> [HBase] Build a Lucene index on an HBase table
> ----------------------------------------------
>                 Key: HADOOP-1913
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1913
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Ning Li
>            Priority: Minor
>         Attachments: build_table_index.patch, build_table_index.take2.patch
> This patch provides a Reducer class and other related classes which help to build a Lucene
index on an HBase table. The index build part is similar to that of Nutch.
>   - Each row is modeled as a Lucene document: row key is indexed in its untokenized form,
column name-value pairs are Lucene field name-value pairs.
>   - IndexConf is used to configure various Lucene parameters, specify whether to optimize
an index and which columns to index and/or store, in tokenized or untokenized form, etc.
>   - The number of reduce tasks decides the number of indexes (partitions). The index(es)
is stored in the output path of job configuration.
>   - The index build process is done in the reduce phase. Users can use the map phase
to join rows from different tables or to pre-parse/analyze column content, etc.
>   - A junit test is added to test the build of an index on an HBase table with an identity
mapper. It also serves as an example on how to use the new classes.
>   - BuildTableIndex is provided to help building an index on an HBase table. It should
be moved to examples package if HBase decides to have one.
> This patch requires the inclusion of the Lucene library.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message