Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 84145 invoked from network); 5 Aug 2009 19:30:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Aug 2009 19:30:37 -0000 Received: (qmail 44861 invoked by uid 500); 5 Aug 2009 19:30:44 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 44779 invoked by uid 500); 5 Aug 2009 19:30:44 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 44771 invoked by uid 99); 5 Aug 2009 19:30:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Aug 2009 19:30:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Aug 2009 19:30:34 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id E3A7F29A0016 for ; Wed, 5 Aug 2009 12:30:14 -0700 (PDT) Message-ID: <2049799018.1249500614931.JavaMail.jira@brutus> Date: Wed, 5 Aug 2009 12:30:14 -0700 (PDT) From: "Marcel Reutegger (JIRA)" To: dev@jackrabbit.apache.org Subject: [jira] Commented: (JCR-2219) Improved background text extraction In-Reply-To: <1980884801.1247756055232.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739690#action_12739690 ] Marcel Reutegger commented on JCR-2219: --------------------------------------- Re-applied some of the 801135 changes to make test execution more reliable. svn revision: 801375 > Improved background text extraction > ----------------------------------- > > Key: JCR-2219 > URL: https://issues.apache.org/jira/browse/JCR-2219 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: indexing, jackrabbit-core > Reporter: Jukka Zitting > Priority: Minor > Fix For: 2.0.0 > > Attachments: JCR-2219.patch > > > As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types. > Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.