nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Naylor <naylor...@gmail.com>
Subject Re: desktop search
Date Tue, 16 Aug 2011 04:15:37 GMT
> If immediate reindexing of modified documents is strictly required you may
> need to drop Nutch and go for a stand-alone Solr with a lot of scripting
and
> some file alteration monitor you can use cross-platform.

Thanks Marcus, I'll see if I really need that.  One thing I might do is
simply use an existing desktop indexer and just use Tika to parse files
(mostly I want to get a list of indexed terms).

On Mon, Aug 15, 2011 at 1:43 PM, Markus Jelsma
<markus.jelsma@openindex.io>wrote:

>
> > The KDE thing is very interesting, thanks for the link!  I wash hoping
> for
> > something cross-platform though.
>
> KDE is almost pure QT so most of it is cross-platform. You might want to
> check
> with their lists for details and feasibility.
>
> >
> > As regards using Nutch: how would it handle file updates?  It seems to me
> a
> > Web crawler would only get new files and changes on each crawl, whereas a
> > desktop search engine like Spotlight for instance indexes a file as soon
> as
> > it gets made or modified.
>
> Nutch will crawl a (local) url and increment a timestamp with a constant
> (default 30 days) or based on some algorithm; the fetch time. At this time
> in
> the future the url becomes eligible for refetch all the associated
> processing.
>
> You can also hook-up some file alteration monitor daemon that can run some
> script to reindex a specific file in Solr. This cannot be used with Nutch,
> it
> will not recrawl and index an url if it is not eligible for fetch.
> This is not a big problem as both Nutch and Solr use the Tika libs for
> document parsing but may become a problem is both use different versions
> and
> if you have custom Nutch pluging.
> To be short: forced reindexing of a given url cannot go through Nutch.
>
> >
> > There's also this document I found on the Web: it describes some problems
> > with using Nutch on the personal scale owing to its specialization for
> web
> > crawling----it says there is a limit on files crawled per directory, and
> > size of files crawled.  This was all I was able to find under "Nutch
> > desktop search" in Google.  However, now that I look at it more closely
> > it's from 2004, so it seems to me Nutch might have gotten rid of these
> > problems in the interim....
>
> There are limits indeed but they are configurable, num outlinks (applies to
> directory lists as well) and max content limit and such.
>
> If immediate reindexing of modified documents is strictly required you may
> need to drop Nutch and go for a stand-alone Solr with a lot of scripting
> and
> some file alteration monitor you can use cross-platform.
>
> Good luck
>
> >
> >
> http://docs.google.com/viewer?a=v&q=cache:bDjjs__eYPcJ:www.commercenet.com/
> >
> images/0/06/CN-TR-04-04.pdf+nutch+desktop+search&hl=en&gl=us&pid=bl&srcid=A
> >
> DGEESg12Bq0VDGk3FpevwOHIdbfr1bCkEZ3CH1yojEliyfeCJv_3JhGRe1gMPx66LiywsUYFWJh
> >
> KKzsLBVoCtATNcghrW4DRLWlT5sd4YhIWMVaQjMKs5xN-8vqTOHFV2pw9bzCtoQY&sig=AHIEtb
> > TpxSL0xmZJxa5CWm8MzDWD4vyAAg
> >
> > Thanks,
> >
> > Andrew
> >
> > On Mon, Aug 15, 2011 at 6:07 AM, Markus Jelsma
> >
> > <markus.jelsma@openindex.io>wrote:
> > > With Nutch you can crawl your FS with ease and index to a Solr
> instance.
> > > It'll
> > > surely work. But you may also be interested in the cool KDE
> technologies
> > > that
> > > are specifically built for desktop search.
> > >
> > >
> http://thomasmcguire.wordpress.com/2009/10/03/akonadi-nepomuk-and-strigi-
> > > explained/
> > >
> > > On Monday 15 August 2011 04:41:11 Andrew Naylor wrote:
> > > > Any suggestions for the best way to get desktop search in the
> > > > Lucene/Solr/Nutch/Tika ecosystem?  I want to be able to access (from
> my
> > >
> > > own
> > >
> > > > program) lists of terms that are indexed and weights for each file,
> for
> > > > example, but if a filesystem indexer and index updater already exists
> > > > somewhere I'd like to use it rather than write my own.
> > > >
> > > > I'm planning on working in Clojure, btw, not that that should make
> any
> > > > difference---
> > > >
> > > > Thanks,
> > > >
> > > > Andrew
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message