lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay O'Leary" <jay.ole...@gmail.com>
Subject Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Tue, 13 May 2008 15:49:40 GMT
If it's windows only, you can roll your own with IFilters (
http://www.ifilter.org/).

On Tue, May 13, 2008 at 10:23 AM, Lukas Vlcek <lukas.vlcek@gmail.com> wrote:

> Does it make sense to consider using OpenOffice to convert from MS formats
> to PDF or HTML before indexing. Would this yield me a lower fail rate as
> opposed to pure POI approach? I don't care about formating now I care
> about
> content in the first place. Formating would be important only in the case
> that Nutch or other piece of software would be able to accommodate this
> information into Lucene index (such that words in headline would yield
> higher boost for example).
>
> Couple of words about my motivation:
>
> We released SharePoint 2007 in our company. We are not very satisfied with
> its search capabilities so I started to looking for some alternatives. The
> first thing I was looking at is Google Search Appliance as they claim it
> can
> crawl, index and search SharePoint portals.
>
> I realized that their integration with outer world is done via connector
> manager which is itself an open source project written in Java and
> Sharepoint connector implementation is as well released as a open source
> in
> Java. This makes me think that I should be able to test Sharepoint
> connector
> replacing GSA black box with Lucene,Nutch,Solr or whatever and test how
> well
> this connector thing works. This would be a perfect test before we invest
> more in GSA. On the other hand if I would be able to run the Sharepoint
> connector without GSA (replaced by Lucene based product) then text
> extraction from MS family formats can be the main impenetrable barrier.
>
> Lukas
>
> On Tue, May 13, 2008 at 4:13 PM, Andrzej Bialecki <ab@getopt.org> wrote:
>
> > Grant Ingersoll wrote:
> >
> > > I've used POI, as well as commercial providers.  As always, it depends
> > > :-)  I wasn't particularly impressed with the commercial providers
> given the
> > > amount of money they wanted for it.   PDF was particularly tricky, but
> you
> > > weren't asking about that.   At least w/ POI, you have the opportunity
> to
> > > fix things that don't work based on your priorities.  I don't know
> what the
> > > failure rate is for the commercial providers, but my experience is
> they will
> > > all fail at least once, so you better plan on it.  I'd look to use a
> > > framework like Tika or Aperture, where you can easily upgrade or plug
> in new
> > > or different libraries (including commercial providers) as needed w/o
> > > rewriting your code.  Additionally, with something like Tika or
> Aperture,
> > > you could easily mix and match your solutions, such that you use one
> for
> > > Word and a different one for PPT or PDF.
> > >
> > > One issue with any of them is how you plan to use them.  If you need
> > > more than bag of words, they all get less reliable, especially when it
> comes
> > > to PDFs and Office docs.  Dealing with things like tables, columns,
> > > captions, labels, etc. has always been problematic in my experience
> when one
> > > wants to do higher level processing (beyond keyword search).
> > >
> >
> > Yet another option ... In the past I used a licensed copy of MS Office
> to
> > extract things that I wanted, using a bit of OLE automation and
> VBscript.
> > Worked reasonably well, in the sense that I had no issues whatsoever
> with
> > extracting the content _and_ formatting from any documents that could be
> > normally opened with MS Office - however, performance was an issue, ie.
> it
> > was slow, CPU/memory hog, and occasionally it would get stuck in a weird
> > state when only complete reboot would help.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>  --
> http://blog.lukas-vlcek.com/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message