lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Help on DOCX and XLSX
Date Wed, 07 Mar 2012 10:32:44 GMT
You'll have to find something that parses the formats you are
interested in and extracts the text you want.  Apache Tika comes to
mind.

Why are you using such an old version of Lucene?  Why aren't you using
Solr?  That might just work for you out of the box.  See also
http://www.lucidimagination.com/devzone/technical-articles/content-extraction-tika

As for the size, I wouldn't worry about it.  Disk space is cheap.  If
you really do care, scan the FAQ at
http://wiki.apache.org/lucene-java/LuceneFAQ.  Lots of useful info on
all sorts of things.


--
Ian.


On Wed, Mar 7, 2012 at 9:40 AM, Prasad KVSH <Prasad.Kokepudi@ness.com> wrote:
> Dear All,
>
>
>
> We started using Lucene version 3.0.3, we have different types of
> documents like PDF, XLS, XLSX, DOC, DOCX,TXT etc., at a specified
> folder.
>
>
>
> We have created index on these files(using IndexFiles.java), Indexing
> has took 17.2 MB for 69.4MB Documents. This index created using Standard
> Analyzer with limited index fields. And able to search a given text in
> PDF(text content only), *.doc and *.xls(MS Word 1997-2003) versions
> only.
>
>
>
> Now I need help on .docx and .xlsx files indexing. How I can run
> indexing on these files. These files are ignored when we do a string
> search
>
>
>
> Writer is defined as below:
>
> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
> StandardAnalyzer(Version.LUCENE_CURRENT), true,
> IndexWriter.MaxFieldLength.LIMITED);
>
>
>
> Another question is on the size of index folder, whether we can optimize
> the size
>
>
>
> Thanks
>
> Prasad
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message