Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 71727 invoked from network); 18 Feb 2010 09:24:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Feb 2010 09:24:21 -0000 Received: (qmail 83962 invoked by uid 500); 18 Feb 2010 09:24:20 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 83902 invoked by uid 500); 18 Feb 2010 09:24:20 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 83894 invoked by uid 99); 18 Feb 2010 09:24:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Feb 2010 09:24:20 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [64.18.2.171] (HELO exprod7og109.obsmtp.com) (64.18.2.171) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 18 Feb 2010 09:24:12 +0000 Received: from source ([72.14.220.159]) by exprod7ob109.postini.com ([64.18.6.12]) with SMTP ID DSNKS30HJ3abVjPQ4xScMT/UH8tgzBK6ROda@postini.com; Thu, 18 Feb 2010 01:23:52 PST Received: by fg-out-1718.google.com with SMTP id e12so38317fga.17 for ; Thu, 18 Feb 2010 01:23:51 -0800 (PST) MIME-Version: 1.0 Received: by 10.103.84.15 with SMTP id m15mr3510330mul.43.1266485031039; Thu, 18 Feb 2010 01:23:51 -0800 (PST) In-Reply-To: <91f3b2651002172339m1ce8d8ect81614644832eaa34@mail.gmail.com> References: <510143ac1002170722g76806f23he522cd7fc05a2499@mail.gmail.com> <697f8381002170751u6c090ae3kf413238a998ac9e6@mail.gmail.com> <91f3b2651002172339m1ce8d8ect81614644832eaa34@mail.gmail.com> Date: Thu, 18 Feb 2010 10:23:51 +0100 Message-ID: <697f8381002180123t7292b15erf31b02fcd9bc44c1@mail.gmail.com> Subject: Re: [jr3] Search index in content From: Ard Schrijvers To: dev@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Thu, Feb 18, 2010 at 8:39 AM, Thomas M=FCller w= rote: > > The fulltext index is (potentially) slow, specially fulltext > extraction. Therefore, fulltext index should be done asynchronously if would this be in line with the spec? > it takes too long. Also, in a clustered environment, at least text > extraction should only be done in one cluster node. I would still use > Apache Tika and Apache Lucene for this. Especially pdf extraction can kill the performance of an entire cluster. As pdfs can be part of a document at our structure, where it needs to be nodescope indexed every time the document is saved again, we use an approach to store as binary (to use the DataStore) version an extracted version of the pdf and index this extracted version: Only one node in the cluster will now do the extraction, only one user is blocked. The other nodes just index the extracted text version, which is quite fast. Not sure if we should have this kind of option part of JR regards Ard > > Regards, > Thomas >