lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Help on DOCX and XLSX
Date Wed, 07 Mar 2012 11:26:20 GMT
So you want to index different fields and search on those fields and
are asking whether you can do that in lucene?  The answer is yes.

I still think you should look at Solr but if you are determined to use
Lucene, get hold of a copy of the second edition of Lucene In Action
http://www.manning.com/hatcher3/.


--
Ian.


On Wed, Mar 7, 2012 at 11:13 AM, Prasad KVSH <Prasad.Kokepudi@ness.com> wrote:
> Hi Ian,
>
> Thanks for your quick reply.
>
> Our documents will have the following common key information like
>
> 1. Document Type ID,
> 2. Document Date,
> 3. Document Author ID,
> 4. Document Status
> 5. Document Group ID.
>
> While creating the indexing, we would like to add the above key values
> along the content index. So that it will not read entire index and
> search on Document Type ID  or Date Range.  Can we implement this
> approach?
>
> Currently search text is being performed on indexing, then we are
> filtering the documents by reading document record from database table
> for the above key values.
>
> Thanks
> Prasad
>
>
>
> -----Original Message-----
> From: Ian Lea [mailto:ian.lea@gmail.com]
> Sent: Wednesday, March 07, 2012 4:03 PM
> To: java-user@lucene.apache.org
> Subject: Re: Help on DOCX and XLSX
>
> You'll have to find something that parses the formats you are interested
> in and extracts the text you want.  Apache Tika comes to mind.
>
> Why are you using such an old version of Lucene?  Why aren't you using
> Solr?  That might just work for you out of the box.  See also
> http://www.lucidimagination.com/devzone/technical-articles/content-extra
> ction-tika
>
> As for the size, I wouldn't worry about it.  Disk space is cheap.  If
> you really do care, scan the FAQ at
> http://wiki.apache.org/lucene-java/LuceneFAQ.  Lots of useful info on
> all sorts of things.
>
>
> --
> Ian.
>
>
> On Wed, Mar 7, 2012 at 9:40 AM, Prasad KVSH <Prasad.Kokepudi@ness.com>
> wrote:
>> Dear All,
>>
>>
>>
>> We started using Lucene version 3.0.3, we have different types of
>> documents like PDF, XLS, XLSX, DOC, DOCX,TXT etc., at a specified
>> folder.
>>
>>
>>
>> We have created index on these files(using IndexFiles.java), Indexing
>> has took 17.2 MB for 69.4MB Documents. This index created using
>> Standard Analyzer with limited index fields. And able to search a
>> given text in PDF(text content only), *.doc and *.xls(MS Word
>> 1997-2003) versions only.
>>
>>
>>
>> Now I need help on .docx and .xlsx files indexing. How I can run
>> indexing on these files. These files are ignored when we do a string
>> search
>>
>>
>>
>> Writer is defined as below:
>>
>> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
>> StandardAnalyzer(Version.LUCENE_CURRENT), true,
>> IndexWriter.MaxFieldLength.LIMITED);
>>
>>
>>
>> Another question is on the size of index folder, whether we can
>> optimize the size
>>
>>
>>
>> Thanks
>>
>> Prasad
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message