lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Veselin Kantsev <>
Subject Re: Indexing local PDFs: Lucene/Solr/Nutch ?
Date Sat, 27 Dec 2008 20:42:46 GMT
I am now using solr 1.3 with tomcat6 on a debian lenny box.

Could you please advise of any other instructions/HowTos on integrating Tika or
maybe RichDocumentHandler with Solr, that I can find online? 
Apart from the Solr Wiki, as following those examples did not help in my

Thank you.

Veselin K.

On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
> Thank you Erik, Hoss.
> - If using either Solr's "stream.file" or Nutch's crawler,
>   what is the procedure of adding new files?
>   That is to say, if I did not know which are the new files in a
>   specific folder and thus I passed all files to Solr/Nutch, would it
>   skip the ones that have already been indexed?
> - Also what if I file gets modified, would Solr/Nutch detect
>   the change and re-index just this modified the file? 
>   Or should some kind of cache be cleared and everything re-indexed?
> - In order to provide the user with an option to search the indexes of
>   two separete Solr/Nutch servers, do I need to link both servers
>   somehow and join their indexes into one, or is it just a question of
>   designing the web front-end so that it offers the choice to send your
>   search query to one or multiple different servers.
> Thank you,
> Veselin K
> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
> > 
> > : the easiest way to get rolling.  A simple script that recurses your folders
> > : and issues a simple request posting each file in turn to Solr will give you a
> > : full text searchable index in no time (well, ok, it'll take a little time, but
> > : it'll be as fast as anything else out there).
> > 
> > if all the files are "local" on the machine that Solr is running on you 
> > don't even need to POST them, Solr can be configured to read the files by 
> > local filename using the "stream.file" param...
> > 
> >
> >
> > that said: if your fileserver implementation already exposes all of the 
> > files over HTTP, then using Nutch and it's crawler might be an easier way 
> > to get started on indexing all of them ... hard to say without being in 
> > your shoes.  you may want to experiement with both.
> > 
> > 
> > 
> > -Hoss
> > 

View raw message