Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 16117 invoked from network); 29 Dec 2008 09:55:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Dec 2008 09:55:51 -0000 Received: (qmail 61534 invoked by uid 500); 29 Dec 2008 09:55:43 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61500 invoked by uid 500); 29 Dec 2008 09:55:43 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61489 invoked by uid 99); 29 Dec 2008 09:55:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Dec 2008 01:55:43 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chris.lu@gmail.com designates 209.85.198.236 as permitted sender) Received: from [209.85.198.236] (HELO rv-out-0506.google.com) (209.85.198.236) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Dec 2008 09:55:36 +0000 Received: by rv-out-0506.google.com with SMTP id f6so5894300rvb.5 for ; Mon, 29 Dec 2008 01:55:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=Rl+dmApxqThN6zVK9SaiCns2OSsf8dQO5k6H5HQTWiU=; b=uXzBpvvpmugdfVFOQH+u3Z1CUjnJlEyEtN/fNRa4dAq0gtzHCPlinWDeOGm56sPj8Y eIOruUw0TYaaMrHePsV2FnRQ80HMMuQT6nZQfhPsccaX3fvUCY+EGPBuT4qmZBMiR+LF klO/krSek/NiA+P4Fe4XSZl5nRikMlPthLbyQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=I7Si4KPtpViQT6iZTwi+4nX33HcCSc2yxl7T3GGUCshNCc79wIu5RACKPVCP9T+S/Q YgyoIU18Mk2TsBRffdtGp/idYioGhpZgXq+7LEDiDZEJ4H0IISjBs7NAA63YE3HvmQDF AmMtuRt2K206nPQegT4JgH/gT/WFkEtEascZQ= Received: by 10.140.170.21 with SMTP id s21mr6650109rve.205.1230544514782; Mon, 29 Dec 2008 01:55:14 -0800 (PST) Received: by 10.140.127.7 with HTTP; Mon, 29 Dec 2008 01:55:14 -0800 (PST) Message-ID: <6e3ae6310812290155n1fa218c0na314757f28dae3fe@mail.gmail.com> Date: Mon, 29 Dec 2008 01:55:14 -0800 From: "Chris Lu" To: "java-user@lucene.apache.org" Subject: duplication checking while indexing MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_102075_10130915.1230544514772" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_102075_10130915.1230544514772 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I am wondering whether there is an easy way to avoid duplication while indexing, just using the index being created, without creating other data structures. In some cases, the incoming document list can have duplicates. For example, when creating spell checking indexes for phrases. Each phrase is one document. So I want to check whether the phrase is already indexed or not. To do so, I can either create a hash map for all the indexed phrases. But the hash map would consume a lot of memory. A possible alternative is to search existing index. But remember the index is being created, and not all contents are flushed to disk yet. Is it possible to query the not-yet-closed index? -- Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! ------=_Part_102075_10130915.1230544514772--