hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuel LEMOINE <samuel.lemo...@lingway.com>
Subject Re: lucene with hadoop but without nutch, looking for documentation
Date Thu, 19 Jul 2007 07:40:16 GMT
I'm well aware of the 2 possibilities you're proposing, but I don't 
think it would fit with the existing software of the company I'm working 
in. I guess I'll have to crawl among Nutch's guts to find what I'm 
looking for, and export it. Once I'll have managed this, I'll try to 
make the tutorial that today lacks for me.
> Nutch is intended to handle large collections.  The simplest way to get hold
> of large collections is to simply search the web.
> But Nutch is not just a web search engine.  It also provides distributed
> creation of indexes and distributed search which is the motivation of my
> comment about it being the networked version of Lucene.
> So, while I agree with your statement that Nutch was "especially designed to
> deal with web documents", but would strongly disagree that this is a
> limitation.  For one thing, if you actually have gobs of documents, you
> probably will have to store them in a networked form somehow.  That
> networked form is probably pretty easy to make accessible via HTTP and that
> makes a web-oriented search engine like Nutch just what you need.
> Another way to say this is that is if you need a general purpose
> networked/distributed search engine and you have a web-oriented distributed
> search engine, you can either adapt the search engine to not be web
> oriented, or you can adapt your collection to be web-oriented.
> On 7/18/07 8:32 AM, "Samuel LEMOINE" <samuel.lemoine@lingway.com> wrote:
>> You quote Nutch as being "the networked version of Lucene", but from
>> what I've seen it's more precise than that, especially designed to deal
>> with web documents... am I wrong assuming this ?

View raw message