lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srikant Jakilinki <sriks6...@bluebottle.com>
Subject Re: Lucene-based Distributed Index Leveraging Hadoop
Date Sat, 09 Feb 2008 07:15:09 GMT
Hi Ning,

In continuation with our offline conversation, here is a public 
expression of interest in your work and a description of our work. Sorry 
for the length in advance and I hope that the folk will be able to 
collaborate and/or share experiences and/or give us some pointers...

1) We are trying to leverage Lucene on Hadoop for blog archiving and 
searching i.e. ever-increasing data (in terabytes) on commodity hardware 
in a generic LAN. These machines are not hi-spec nor are dedicated but 
actually used within the lab by users for day to day tasks. 
Unfortunately, Nutch and Solr are not applicable to our situation - 
atleast directly. Think of us as an academic oriented Technorati

2) There are 2 aspects.One is that we want to archive the blogposts that 
we hit under a UUID/timestamp taxonomy. This archive can be used for 
many things like cached copies, diffing, surf acceleration etc. The 
other aspect is to archive the indexes. You see, the indexes have a 
lifecycle. For simplicity sake, an index consists of one days worth of 
blogposts (roughly, 15MM documents) and follow the <YYYYMMDD> taxonomy. 
Ideally, we want to store an indefinite archive of blogposts and their 
indexes side-by-side but 1 year or 365 days is a start

3) We want to use the taxonomical name of the post as a specific ID 
field in the Lucene index and want to get away with not storing the 
content of the post at all but only a file pointer/reference to it. This 
we hope will keep the index sizes low but the fact remains that this is 
a case of multiple threads on multiple JVMs handling multiple indexes on 
multiple machines. Further, the posts and indexes are mostly WORM but 
there may be situations where they have to be updated. For example, if 
some blog posts have edited content or have to be removed for copyright, 
or updated with metadata like rank. There is some duplication detection 
work that has to be done here but it is out of scope for now. And oh, 
the lab is a Linux-Windows environment

4) Our first port of call is to have Hadoop running on this group of 
machines (without clustering or load balancing or grid or master/slave 
mumbo jumbo) in the simplest way possible. The goal being to make 
applications see the bunch of machines as a reliable, scalable, 
fault-tolerant, average-performing file store with simple, file CRUD 
operations. For example, the blog crawler should be able to put the 
blogposts in this HDFS in live or in batch mode. With about 20 machines 
and each being installed with a 240GB drive for the experiment, we have 
about 4.5 TB of storage available

5) Next we want to handle Lucene and exploit the structure of its index 
and the algorithms behind it. Since a Lucene index is a directory of 
files, we intend to 'tag' the files as belonging to one index and store 
them on the HDFS. At any instant in time, an index can be regenerated 
and used. The regenerated index is however not used directly from HDFS 
but copied into the local filesystem of the indexer/searcher. This copy 
is subject to change and every once in a while, the constituent files in 
the HDFS are overwritten with the latest files. Hence, naming is quite 
important to us. Even so, we feel that the number of files that have to 
be updated are quite less and that we can use MD5 sums to make sure we 
only update the content changed files. However, this means that out of 
4.5 TB available, we use half of it for archival and the other half for 
searching. Even so, we should be able to store a years worth of posts 
and indexes. Disks are no problem

6) Right then. So, we have (365*15MM) posts and (365*LFi) Lucene file 
segments on the HDFS. Suppose there are N machines online, then each 
machine will have to own 365/N indexes. N constantly keeps changing but 
at any instant the 365 indexes should be live and we are working on the 
best way to achieve this kind of 'fair' autonomic computing cloud where 
when a machine goes down, the other machines will add some indexes to 
their kitty. If a machine is added, then it relieves other machines of 
some indexes. The searcher runs on each of these machines and is a 
service (IP:port) and queries are served using a ParallelMultiSearch() 
[on the machines] and a MultiSearch() [within the machines] so that we 
need not have an unmanageable number of JVMs per machine. Atmost, 1 for 
Hadoop, 1 for Cloud and 1 for Search. We are wondering if Solr can be 
used for search if it supports multiple indexes available on the same 
machine

As you can see, this is not a simple endeavour and it is obvious, I 
suppose, that we are still in theory stage and only now getting to know 
the Lucene projects better. There is a huge body of work, albeit not 
acknowledged in the scientific community as it should be, and I want to 
say kudos to all who have been responsible for it.
I wish and hope to utilize the collective consciousness to mount our 
challenge. Any pointers, code, help, collaboration et al. for any of the 
6 points above - it goes with saying/asking - is welcome and look 
forward to share our experiences in a formal written discourse as and 
when we have them.

Cheers,
Srikant

Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
>  1) Doug Cutting's "Index Server Project Proposal" at
>     http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
>  2) Solr's "Distributed Search" at
>     http://wiki.apache.org/solr/DistributedSearch
>  3) Mark Butler's "Distributed Lucene" at
>     http://wiki.apache.org/hadoop/DistributedLucene
>

----------------------------------------------------------------------
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3


Mime
View raw message