lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Veselin K <vese...@campbell-lange.net>
Subject Re: Indexing local PDF/Doc/XLS files with Solr?
Date Sun, 05 Apr 2009 12:09:26 GMT
Hello, I think the latest tarball worked for me out of the box.

I'm trying to design my Schema at present.
My goal is to index PDF/Doc/XLS files with the following fields:

0. ID number
1. Filename
2. File path
3. Modification date
4. File contents 
5. Number of pages

- Any tips on what type of fields should I use to get this data indexed?

- Is there a way to get the ID number incremented automatically by Solr,
  each time a document is added to the index?

- Would I be able to extract the information above using just the
  Solr/Tika features? Or would I have to source all values myself, except
  "file contents" and pass them to solr when indexing?


Thank you much.

Regards,
Veselin K


On Sat, Dec 27, 2008 at 09:29:05PM -0500, Grant Ingersoll wrote:
> Can you provide details about the part of the examples that weren't  
> clear?  Perhaps I can clean up the docs or help you figure it out.
>
> -Grant
>
> On Dec 27, 2008, at 3:42 PM, Veselin Kantsev wrote:
>
>> Hello,
>> I am now using solr 1.3 with tomcat6 on a debian lenny box.
>>
>> Could you please advise of any other instructions/HowTos on  
>> integrating Tika or
>> maybe RichDocumentHandler with Solr, that I can find online?
>> Apart from the Solr Wiki, as following those examples did not help in 
>> my
>> case.
>>
>>
>> Thank you.
>>
>> Veselin K.
>>
>>
>> On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
>>> Thank you Erik, Hoss.
>>>
>>> - If using either Solr's "stream.file" or Nutch's crawler,
>>>  what is the procedure of adding new files?
>>>  That is to say, if I did not know which are the new files in a
>>>  specific folder and thus I passed all files to Solr/Nutch, would it
>>>  skip the ones that have already been indexed?
>>>
>>> - Also what if I file gets modified, would Solr/Nutch detect
>>>  the change and re-index just this modified the file?
>>>  Or should some kind of cache be cleared and everything re-indexed?
>>>
>>> - In order to provide the user with an option to search the indexes  
>>> of
>>>  two separete Solr/Nutch servers, do I need to link both servers
>>>  somehow and join their indexes into one, or is it just a question of
>>>  designing the web front-end so that it offers the choice to send  
>>> your
>>>  search query to one or multiple different servers.
>>>
>>>
>>> Thank you,
>>> Veselin K
>>>
>>>
>>> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
>>>>
>>>> : the easiest way to get rolling.  A simple script that recurses  
>>>> your folders
>>>> : and issues a simple request posting each file in turn to Solr  
>>>> will give you a
>>>> : full text searchable index in no time (well, ok, it'll take a  
>>>> little time, but
>>>> : it'll be as fast as anything else out there).
>>>>
>>>> if all the files are "local" on the machine that Solr is running  
>>>> on you
>>>> don't even need to POST them, Solr can be configured to read the  
>>>> files by
>>>> local filename using the "stream.file" param...
>>>>
>>>> 	http://wiki.apache.org/solr/ContentStream
>>>>
>>>> that said: if your fileserver implementation already exposes all  
>>>> of the
>>>> files over HTTP, then using Nutch and it's crawler might be an  
>>>> easier way
>>>> to get started on indexing all of them ... hard to say without  
>>>> being in
>>>> your shoes.  you may want to experiement with both.
>>>>
>>>>
>>>>
>>>> -Hoss
>>>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>

Mime
View raw message