Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A8A411773D for ; Fri, 6 Nov 2015 07:59:26 +0000 (UTC) Received: (qmail 1578 invoked by uid 500); 6 Nov 2015 07:59:25 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 1526 invoked by uid 500); 6 Nov 2015 07:59:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 1514 invoked by uid 99); 6 Nov 2015 07:59:24 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Nov 2015 07:59:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 70ACD1A423B for ; Fri, 6 Nov 2015 07:59:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id c6tIEIrBu33Y for ; Fri, 6 Nov 2015 07:59:23 +0000 (UTC) Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com [74.125.82.53]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 61D17206E8 for ; Fri, 6 Nov 2015 07:59:22 +0000 (UTC) Received: by wmec201 with SMTP id c201so10753668wme.0 for ; Thu, 05 Nov 2015 23:59:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WlwJ/hbh2Q4QX6NmDWATsXbJl17EpO/rDO89RNi7lTA=; b=VXCd4kxvpiY0oc4+JOIH04cVyhrOOt1TMau0Z7BMGnQIdUk8wxoFteouJbLoHQW6cf LeZIy51KIbwhkeqB9jjVGOkDm6qJrj1Vj6p1lifLnuIZZRlOynPDi77/ks1iv+ragxL1 mlanGqUJt/AD6iVxXoDjN0Xj0UR/KoMZ6HvQvTqVbhSKduuRybEtyw2wSEq5/u5P/vI/ UBlhYtuQ5RjC7M5XGGnrmN6KNN7/69EXIQ5nzizaQlniE2q/GVgPuLtAQDpUiQR8mpxU giVivA6w9MC4CM6XUqS5aZtIi7u2y5bb0PVUsFpdf+xlnMUmDV1AmZCGi+gl+zg8hOgh ZrPg== MIME-Version: 1.0 X-Received: by 10.28.132.13 with SMTP id g13mr8534872wmd.71.1446796762083; Thu, 05 Nov 2015 23:59:22 -0800 (PST) Received: by 10.28.226.4 with HTTP; Thu, 5 Nov 2015 23:59:22 -0800 (PST) In-Reply-To: <563B5F0F.7060805@gmail.com> References: <563B5F0F.7060805@gmail.com> Date: Fri, 6 Nov 2015 08:59:22 +0100 Message-ID: Subject: Re: index size growing while deleting From: Rob Audenaerde To: "java-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=001a114435a810f9470523da9f9e --001a114435a810f9470523da9f9e Content-Type: text/plain; charset=UTF-8 Hi will, others Thanks for you reply, As far as I understand it, deleting a document is just setting the deleted bit, and when segments are merged, then the documents are removed. (not really sure what this means exactly; I guess the document gets removed from the store, the terms will no longer refer to that document. Not sure if terms get removed if no longer needed, etc). If there are resources to read to improve my understanding I havo not found them (yet), if you could point me to some that be great! I use the default IndexWriterConfig, which I see uses TieredMergePolicy. I never close my InderWriter; as I use NRT searching I just alwyas keep it open. My two guesses are that: a) old segments are not removed from disk or b) deletes are not cleaned up as well as I though they would be. I have made a testcase which indexes 5 million rows (five iterations, five indexing thread, indexing and deleting all such documents after each iterator with deleteByQuery), the rows randomly generated. I see the Taxonomy ever growing (which is logical, because facet-ordinals are never removed as far as I understand); the index grows, but also shrinks when deleting. So I cannot reproduce my problem easily :( I will start diving into the Lucene source code, but I was hoping I just did something wrong. . Any hints are appreciated! -Rob On Thu, Nov 5, 2015 at 2:52 PM, will wrote: > Hi Rob: > > Do you understand how deletes work and how an index is compacted? > > There's some configuration/runtime activities you don't mention.... And > you make testing process sound like a mirror of production? (Including > configuration?) > > > -will > > > On 11/5/15 7:33 AM, Rob Audenaerde wrote: > >> Hi all, >> >> I'm currently investigating an issue we have with our index. It keeps >> getting bigger, and I don't het why. >> >> Here is our use case: >> >> We index a database of about 4 million records; spread over a few hundred >> tables. The data consists of a mix of text, dates, numbers etc. We also >> add >> all these fields as facets. >> Each night we delete about 90% of the data, which in testing reduces the >> index size significantly. >> We store the data as StoredFields as well, to prevent having to access the >> database at all. >> We use FloatAssociatedFacet fields for the facets. >> >> >> In production however, it seems the index is only growing, up to 71 GB for >> these records for a month of running. >> >> It seems that lucene's index in just getting bigger there. >> >> We use lucene 5.3 on CentOS, java 8 64 bit. >> >> The taxonomy-index does not grow significantly. >> >> How should I go about checking what is wrong? >> >> Thanks! >> >> > --001a114435a810f9470523da9f9e--