lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Does Lucene Java 2.3.2 supports parsing of Microsoft office 2007 documents...
Date Fri, 27 Jun 2008 13:13:10 GMT
Lucene doesn't actually support any of the document types. What happens
is that some program is used to parse the files into an indexable stream
and that stream is indexed. That used to be POI in the old days.

I confess I haven't used the latest demo, but I assume that under the
covers there's some program installed that Microsoft documents are
pushed through to get indexable tokens. So the real question is
whether that program handles the documents you're interested in.

I know this isn't very helpful, but you'll have to dig into this in some
detail if you really want to index Microsoft documents. If you don't
need to, then you don't need to waste time on this issue.

Best
Erick

On Fri, Jun 27, 2008 at 7:08 AM, Kumar Gaurav <gaurav.kumar@spsoftindia.com>
wrote:

> Dear all,
>
>
>
> Currently I am using Lucene jave 2.3.2 demo to parse Microsoft 2003 and
> 2007
> docs and PDF files.
>
> It is able to parse files with *.pdf, *.doc, *.xls etc.
>
> But it does not search in files of Microsoft 2007 docs.
>
> It shows indexing *.docx and other Microsoft 2007 doc files.
>
>
>
> Does Lucene java supports parsing of extensions *.docx, *.pptx, *.mpp i.e.
> Microsoft Windows 2007 documents?
>
> If it supports, what should be done in Lucene demo 2.3.2 to search queries
> on file with above mentioned extensions?
>
>
>
> Thanks
>
> Kumar
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message