lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <>
Subject Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Tue, 13 May 2008 14:23:35 GMT
Does it make sense to consider using OpenOffice to convert from MS formats
to PDF or HTML before indexing. Would this yield me a lower fail rate as
opposed to pure POI approach? I don't care about formating now I care about
content in the first place. Formating would be important only in the case
that Nutch or other piece of software would be able to accommodate this
information into Lucene index (such that words in headline would yield
higher boost for example).

Couple of words about my motivation:

We released SharePoint 2007 in our company. We are not very satisfied with
its search capabilities so I started to looking for some alternatives. The
first thing I was looking at is Google Search Appliance as they claim it can
crawl, index and search SharePoint portals.

I realized that their integration with outer world is done via connector
manager which is itself an open source project written in Java and
Sharepoint connector implementation is as well released as a open source in
Java. This makes me think that I should be able to test Sharepoint connector
replacing GSA black box with Lucene,Nutch,Solr or whatever and test how well
this connector thing works. This would be a perfect test before we invest
more in GSA. On the other hand if I would be able to run the Sharepoint
connector without GSA (replaced by Lucene based product) then text
extraction from MS family formats can be the main impenetrable barrier.


On Tue, May 13, 2008 at 4:13 PM, Andrzej Bialecki <> wrote:

> Grant Ingersoll wrote:
> > I've used POI, as well as commercial providers.  As always, it depends
> > :-)  I wasn't particularly impressed with the commercial providers given the
> > amount of money they wanted for it.   PDF was particularly tricky, but you
> > weren't asking about that.   At least w/ POI, you have the opportunity to
> > fix things that don't work based on your priorities.  I don't know what the
> > failure rate is for the commercial providers, but my experience is they will
> > all fail at least once, so you better plan on it.  I'd look to use a
> > framework like Tika or Aperture, where you can easily upgrade or plug in new
> > or different libraries (including commercial providers) as needed w/o
> > rewriting your code.  Additionally, with something like Tika or Aperture,
> > you could easily mix and match your solutions, such that you use one for
> > Word and a different one for PPT or PDF.
> >
> > One issue with any of them is how you plan to use them.  If you need
> > more than bag of words, they all get less reliable, especially when it comes
> > to PDFs and Office docs.  Dealing with things like tables, columns,
> > captions, labels, etc. has always been problematic in my experience when one
> > wants to do higher level processing (beyond keyword search).
> >
> Yet another option ... In the past I used a licensed copy of MS Office to
> extract things that I wanted, using a bit of OLE automation and VBscript.
> Worked reasonably well, in the sense that I had no issues whatsoever with
> extracting the content _and_ formatting from any documents that could be
> normally opened with MS Office - however, performance was an issue, ie. it
> was slow, CPU/memory hog, and occasionally it would get stuck in a weird
> state when only complete reboot would help.
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  Contact: info at sigram dot com
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message