hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Distributed indexing
Date Mon, 28 Apr 2008 15:49:52 GMT

Check out the bailey and katta projects on sourceforge.

Also take a look at Nutch.

Hadoop is certainly good for indexing and it isn't that hard to put
distributed search alongside hadoop with indexes being pulled from HDFS to
local storage or RAM for speed.

On 4/28/08 7:50 AM, "Matt Wood" <matt.wood@sanger.ac.uk> wrote:

> Hello all,
> I was wondering if someone in the know could tell me about the current
> state of play with building and searching large indices with hadoop?
> Some background: I work on the human genome project, and we're
> currently setting up a new facility based around the next generation
> of DNA sequencing. We're currently producing around 50Tb of data a
> week, some of which we would like to provide fast access to via an
> index.
> Having read up on hadoop, it appears that it could play a central part
> in our infrastructure, and that others have tried (and succeeded) in
> building a distributed indexing and retrieval system with hadoop. I'd
> be interested if anyone could point me in the right direction to more
> information or examples of such a system. Yahoo! (with webmap) seems
> to be close to the sort of thing we would need.
> Would map/reduce be a suitable approach for indexing _and_ retrieval,
> or just indexing? Would Solr/Lucene be a good fit? Any help or
> pointers to more information would be  much appreciated!
> If you would like any more details, I'd be more than happy to supply
> them!
> Many thanks,
> ~ Matt
> -------------
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk

View raw message