hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Using Hadoop for Record storage
Date Thu, 12 Apr 2007 16:46:38 GMT
I'm curious what others will say about Hadoop.  I'll just recomment BDB, as I have good experience
combining Lucene indices where only the id field is stored, and BDBs are used to store and
retrieve data for a set of ids for a given search result.


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Andy Liu <andyliu1227@gmail.com>
To: hadoop-user@lucene.apache.org
Sent: Tuesday, April 10, 2007 5:41:36 PM
Subject: Using Hadoop for Record storage

Currently I'm working on a search application that uses Lucene.  Many of the
fields I index in Lucene are stored fields, because I need to retrieve the
actual text and metadata of each document, and subsequently present the data
to the user.

We're starting to work with tens of millions of documents, so scalability of
our application is a concern that we're currently addressing.  One specific
point we're looking at is whether or not it makes sense to use Lucene as
strictly an inverted index, and store the document text and metadata in a
different type of datastore.  From my understanding, the advantages of doing
this are:

1. Indexing will be faster, since stored fields need to be written and
re-written during Lucene segment merging.
2. The separation affords more flexibility, if say I want to do multiple
indexes and a distributed search, the records data can be distributed
differently/separately from the Lucene index.
3. Maybe it is possible to select a datastore technology that would be
faster than Lucene at retrieving document data, especially in the 30M-50M
document collection range

I'm exploring the possibility of using the Hadoop records framework to store
these document records on disk.  Here are my questions:

1. Is this a good application of the Hadoop records framework, keeping in
mind that my goals are speed and scalability?  I'm assuming the answer is
yes, especially considering Nutch uses the same approach

2. Is Hadoop records the fastest and most scalable technology to tackle this
problem?  Are there other record storage technologies out there that you can
recommend?  I'm assuming traditional RDBMS's would not scale as gracefully,
although if anybody has successfully tackled this problem using a
traditional database let me know.


View raw message