Return-Path: X-Original-To: apmail-jackrabbit-users-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9392310557 for ; Wed, 18 Dec 2013 09:52:16 +0000 (UTC) Received: (qmail 64648 invoked by uid 500); 18 Dec 2013 09:52:16 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 64497 invoked by uid 500); 18 Dec 2013 09:52:14 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 64485 invoked by uid 99); 18 Dec 2013 09:52:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Dec 2013 09:52:11 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of patrickwelfringer@gmail.com designates 209.85.192.179 as permitted sender) Received: from [209.85.192.179] (HELO mail-pd0-f179.google.com) (209.85.192.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Dec 2013 09:52:06 +0000 Received: by mail-pd0-f179.google.com with SMTP id r10so8057811pdi.10 for ; Wed, 18 Dec 2013 01:51:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=3HfRniC9GfEENGHkhJn67XoxYVGS8JmB6ZVl0nF7eQ4=; b=PvuJJ/uMBT10GjSnDEaFp0ulEEO3sFtKJgAYwmY8kbElLq8NMYsgOhGM03zR0A/NvL vv5shRKof6ozcoazC/Z/0DyXYFp4s6oMv6vCgdE71mO0MYMmx4Y6iBzPj/QJcCXqaH8q Ov5ncMcOzKez00wd+TEDvyR/dPJWqQYft0EYgygM8+qHOuQiGRxarnRyRLbEs5XViGEp KwdUJ7xopRqpaw/bBYsWAOrHNXHD4hyC8RCTottiWz/f43nucizchgitDOQE4gAQJ11V ZULeSgNwMD1eEmvp2KjRN+Oz258o6Het87/CEcOmD6MYcfsxJXhzdWfUWQkcihWf0FTT PMTw== X-Received: by 10.66.102.4 with SMTP id fk4mr32192986pab.59.1387360306230; Wed, 18 Dec 2013 01:51:46 -0800 (PST) MIME-Version: 1.0 Received: by 10.70.60.138 with HTTP; Wed, 18 Dec 2013 01:51:06 -0800 (PST) From: Patrick Welfringer Date: Wed, 18 Dec 2013 10:51:06 +0100 Message-ID: Subject: Can Lucene be configured to avoid downloading file contents? To: users@jackrabbit.apache.org Content-Type: multipart/alternative; boundary=047d7bd8fb483a7c1204edcbfe70 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bd8fb483a7c1204edcbfe70 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, *Can anyone familiar with Lucene please share their insight?* The question is this: *is there any way to configure Lucene to index only certain whitelisted metadata*, or exclude blacklisted metadata? Indeed, we believe that excluding the =E2=80=9Cfile=E2=80=9D metadata could= dramatically reduce the time it takes Lucene to download and process the large number of PDF files in our particular setup. We don=E2=80=99t need file contents to be indexed, only other metadata like =E2=80=9Ccreation date=E2=80=9D, =E2=80=9Ckeywords=E2=80=9D etc. The =E2=80=9CLuke=E2=80=9D tool tells us that none of the file contents are= indexed. Yet during the hour long indexing, we see all of the metadata being downloaded and written to disk, including document contents. If you can help us find a way to prevent Lucene to index the entire Jackrabbit repository, you=E2=80=99ll cheer up many mailing list subscriber= s that have similar issues! Cheers, Patrick --047d7bd8fb483a7c1204edcbfe70--