hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Wood <matt.w...@sanger.ac.uk>
Subject Distributed indexing
Date Mon, 28 Apr 2008 14:50:00 GMT
Hello all,

I was wondering if someone in the know could tell me about the current  
state of play with building and searching large indices with hadoop?

Some background: I work on the human genome project, and we're  
currently setting up a new facility based around the next generation  
of DNA sequencing. We're currently producing around 50Tb of data a  
week, some of which we would like to provide fast access to via an  

Having read up on hadoop, it appears that it could play a central part  
in our infrastructure, and that others have tried (and succeeded) in  
building a distributed indexing and retrieval system with hadoop. I'd  
be interested if anyone could point me in the right direction to more  
information or examples of such a system. Yahoo! (with webmap) seems  
to be close to the sort of thing we would need.

Would map/reduce be a suitable approach for indexing _and_ retrieval,  
or just indexing? Would Solr/Lucene be a good fit? Any help or  
pointers to more information would be  much appreciated!

If you would like any more details, I'd be more than happy to supply  

Many thanks,

~ Matt


Matt Wood
Sequencing Informatics // Production Software

 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

View raw message