jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Created: (JCR-2219) Improved background text extraction
Date Thu, 16 Jul 2009 14:54:15 GMT
Improved background text extraction

                 Key: JCR-2219
                 URL: https://issues.apache.org/jira/browse/JCR-2219
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: indexing, jackrabbit-core
            Reporter: Jukka Zitting
            Priority: Minor

As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la),
the current approach to text extraction in background threads doesn't work that well especially
with the Tika-based extractors that support streamed parsing of many document types.

Also, we currently *all* of the extracted text streams are buffered into Strings before being
passed into the Lucene index. It would be good if we could somehow get back to passing just
Readers to Lucene.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message