hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Distributed indexing
Date Mon, 28 Apr 2008 15:21:37 GMT

You probably want to mail core-user, not core-dev....

Here is what I wrote on java-user@lucene yesterday (in answer to Samuel Gao's question there):

There are actually several distributed indexing or searching projectsin Lucene (the top-level
ASF Lucene project, not Lucene Java), and it'stime to start thinking about the possibility
of bringing them together,finding commonalities, etc.

Here is the summary:
- Lucene - distributed search via ParallelMultiSearcher.  How you split indices/shards is
up to you.
- Solr - distributed search via SOLR-303 (see DistributedSearch on its Wiki).  How you split
indices/shards is up to you.
- Nutch - distributed search via its org.apache.nutch.ipc (I think).  How you split indices/segments
is up to you.
- Nutch - see the bottom of http://wiki.apache.org/nutch/Nutch2Architecture for a new push
to come up with shard management tools

There is also Hadoop:
- Using MapReduce + HDFS to build a single Lucene index in a distributed fashion (see contrib/index
in Hadoop).

There is also GridLucene project somewhere on the web...
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Matt Wood <matt.wood@sanger.ac.uk>
> To: core-user@hadoop.apache.org
> Cc: core-dev@hadoop.apache.org
> Sent: Monday, April 28, 2008 4:50:00 PM
> Subject: Distributed indexing
> Hello all,
> I was wondering if someone in the know could tell me about the current  
> state of play with building and searching large indices with hadoop?
> Some background: I work on the human genome project, and we're  
> currently setting up a new facility based around the next generation  
> of DNA sequencing. We're currently producing around 50Tb of data a  
> week, some of which we would like to provide fast access to via an  
> index.
> Having read up on hadoop, it appears that it could play a central part  
> in our infrastructure, and that others have tried (and succeeded) in  
> building a distributed indexing and retrieval system with hadoop. I'd  
> be interested if anyone could point me in the right direction to more  
> information or examples of such a system. Yahoo! (with webmap) seems  
> to be close to the sort of thing we would need.
> Would map/reduce be a suitable approach for indexing _and_ retrieval,  
> or just indexing? Would Solr/Lucene be a good fit? Any help or  
> pointers to more information would be  much appreciated!
> If you would like any more details, I'd be more than happy to supply  
> them!
> Many thanks,
> ~ Matt
> -------------
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
> -- 
>  The Wellcome Trust Sanger Institute is operated by Genome Research 
>  Limited, a charity registered in England with number 1021457 and a 
>  company registered in England with number 2742969, whose registered 
>  office is 215 Euston Road, London, NW1 2BE. 

View raw message