Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 63995 invoked from network); 22 Feb 2010 16:30:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 Feb 2010 16:30:12 -0000 Received: (qmail 78987 invoked by uid 500); 22 Feb 2010 16:30:10 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78911 invoked by uid 500); 22 Feb 2010 16:30:10 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 78896 invoked by uid 99); 22 Feb 2010 16:30:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2010 16:30:10 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nigelspleen@gmail.com designates 216.239.58.184 as permitted sender) Received: from [216.239.58.184] (HELO gv-out-0910.google.com) (216.239.58.184) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2010 16:30:02 +0000 Received: by gv-out-0910.google.com with SMTP id n40so64799gve.23 for ; Mon, 22 Feb 2010 08:29:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=lKXQsr9FMh+Ztnmm3QfBEK2NiQRow/l7iXaFw2rmU84=; b=Q+nn+BE60w0iyAEv+6h/FPKCK2o7Cve5N/FodNwWj6lDKewcW9v9MpTdP4IVAvuK5x nznWMLNRZc5stNd99kt8l735ywy1c/Fie8QinQDDUba6vMVfNr9jaCGMUXPo/Y4Fa5wf 5XgNjgkLasYswdI1r7NkUB8WFfJpjPbJ66jOU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=H6uVcuOjZEEWb/V1VaNbp2bjVv+z3lHTXvrjsxeiJQR6/1rr1szp6r+1zlebTpA5d8 5SzkTcTKKprMPiXE7+hda1fYiHa+U2ZC/U3OnD1eqndgwwYBVYRL9TWfWCZeDLnVqaRO QnZTm3HvwAOLDVTUuwWKz5gDUskqwT8/ntwOM= MIME-Version: 1.0 Received: by 10.102.169.17 with SMTP id r17mr60346mue.123.1266856181371; Mon, 22 Feb 2010 08:29:41 -0800 (PST) Date: Mon, 22 Feb 2010 11:29:41 -0500 Message-ID: <843920a31002220829v24cfcd96uabce0fb5c32b8106@mail.gmail.com> Subject: Scanning docs at index time From: Nigel To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00163641798fac1538048032f075 X-Virus-Checked: Checked by ClamAV on apache.org --00163641798fac1538048032f075 Content-Type: text/plain; charset=ISO-8859-1 I'd like to scan documents as they're being indexed, to find out immediately if any of them match certain queries. The goal is to find out of there are any new hits for these queries as soon as possible, without re-searching the index over and over (which would be inefficient, and higher latency). The documents still need to be indexed (not just scanned) so they can be searched later with different queries not known at index time. The indexing throughput is in the tens of millions per day, and there are maybe a thousand queries or so to be matched. So this has to work pretty fast. (-: Fortunately the number and size of fields are both fairly small. This scanning could of course be completely decoupled from the indexing process. But my thinking was that since we already have the documents in hand, and we'll be analyzing various fields in the course of indexing, we could ideally reuse those token streams somehow for this on-the-fly scanning process. I took a look at the org.apache.lucene.index.memory.MemoryIndex class in contrib. It looks like that would work, but I'm not sure if it's the most appropriate solution (for one thing, it would have to re-analyze all the fields). Has anyone here done something similar and/or know of other classes that would be suitable? Thanks, Chris --00163641798fac1538048032f075--