lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Harris (JIRA)" <>
Subject [jira] Commented: (SOLR-284) Parsing Rich Document Types
Date Wed, 07 May 2008 19:16:58 GMT


Chris Harris commented on SOLR-284:

I'm not sure this patch entirely reinvents the wheel, as it does most of the heavy lifting
with preexisting components, namely PDFBox, POI, and Solr's own HTMLStripReader. It also has
the advantage of already existing, whereas tying Solr to Tika or Aperture would take additional

Tika or Aperture do look really nice, though. The most obvious advantage these projects have
over this patch is that they can already extract text from more file formats than this patch,
and that the developers will probably continue to add more file formats over time. Are you
thinking of additional advantages on top of this, Grant? Do you have any cool ideas about
how Tika/Aperture's metadata extraction facilities might be integrated into Solr? Is there
a potentially interesting interface between Aperture's crawling facilities and Solr?

> Parsing Rich Document Types
> ---------------------------
>                 Key: SOLR-284
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>         Attachments:, rich.patch, rich.patch, rich.patch,,,,
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports
streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message