lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Harris (JIRA)" <>
Subject [jira] Commented: (SOLR-284) Parsing Rich Document Types
Date Tue, 25 Mar 2008 21:33:24 GMT


Chris Harris commented on SOLR-284:

I'm thinking it would be handy if RichDocumentRequestHandler could support indexing text and
HTML files, in addition to the fancier formats (pdf, doc, etc.). That way I could use RichDocumentRequestHandler
for all my indexing needs (except commits and optimizes), rather than use it for for some
doc types but still have to use XmlUpdateRequestHandler for text and HTML docs. Would anyone
else find this useful?

I skimmed the source, and adding support for text files looks trivial. (It's just a pass-through.)
And if you had this, then I guess you'd have at least one version of HTML support for free;
in particular, you could upload your HTML file to RichDocumentRequestHandler, telling the
handler that the document is in plain text format, and then strip off the HTML tags later
by using the HTMLStripStandardTokenizer in your schema.xml.

Alternatively, RichDocumentRequestHandler could provide its own explicit HTML to text conversion.
There would probably be some advantages to this, but I'm not sure exactly what they would
be. One, I guess, would be that you could use tokenizers that didn't make use of HTMLStripReader.

> Parsing Rich Document Types
> ---------------------------
>                 Key: SOLR-284
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>         Attachments:, rich.patch,,,
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports
streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message