hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuel Guo <guosi...@gmail.com>
Subject Re: Distributed indexing
Date Mon, 28 Apr 2008 16:11:25 GMT
Ted Dunning 写道:
> Check out the bailey and katta projects on sourceforge.
>   
I get nothing when checking out the katta project on sourceforge :(
> Also take a look at Nutch.
>
> Hadoop is certainly good for indexing and it isn't that hard to put
> distributed search alongside hadoop with indexes being pulled from HDFS to
> local storage or RAM for speed.
>
>
> On 4/28/08 7:50 AM, "Matt Wood" <matt.wood@sanger.ac.uk> wrote:
>
>   
>> Hello all,
>>
>> I was wondering if someone in the know could tell me about the current
>> state of play with building and searching large indices with hadoop?
>>
>> Some background: I work on the human genome project, and we're
>> currently setting up a new facility based around the next generation
>> of DNA sequencing. We're currently producing around 50Tb of data a
>> week, some of which we would like to provide fast access to via an
>> index.
>>
>> Having read up on hadoop, it appears that it could play a central part
>> in our infrastructure, and that others have tried (and succeeded) in
>> building a distributed indexing and retrieval system with hadoop. I'd
>> be interested if anyone could point me in the right direction to more
>> information or examples of such a system. Yahoo! (with webmap) seems
>> to be close to the sort of thing we would need.
>>
>> Would map/reduce be a suitable approach for indexing _and_ retrieval,
>> or just indexing? Would Solr/Lucene be a good fit? Any help or
>> pointers to more information would be  much appreciated!
>>
>> If you would like any more details, I'd be more than happy to supply
>> them!
>>
>> Many thanks,
>>
>> ~ Matt
>>
>>
>> -------------
>>
>> Matt Wood
>> Sequencing Informatics // Production Software
>> www.sanger.ac.uk
>>
>>
>>     
>
>   


Mime
View raw message