Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1057F200C6F for ; Tue, 9 May 2017 19:14:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0EEEE160BB6; Tue, 9 May 2017 17:14:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2E428160B9A for ; Tue, 9 May 2017 19:14:06 +0200 (CEST) Received: (qmail 69193 invoked by uid 500); 9 May 2017 17:14:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 69181 invoked by uid 99); 9 May 2017 17:13:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 May 2017 17:13:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 63CD51883B8 for ; Tue, 9 May 2017 17:13:59 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.397 X-Spam-Level: X-Spam-Status: No, score=-0.397 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 4QFAz3Aq8ixb for ; Tue, 9 May 2017 17:13:57 +0000 (UTC) Received: from mail-oi0-f44.google.com (mail-oi0-f44.google.com [209.85.218.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 0617E5F367 for ; Tue, 9 May 2017 17:13:57 +0000 (UTC) Received: by mail-oi0-f44.google.com with SMTP id w10so7725938oif.0 for ; Tue, 09 May 2017 10:13:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=TSkBDfkBXY8+Em5YZlmiKYg6nYCdHUjblVL6nQQ4828=; b=jvkTiGcLK/irjFRvFq7L335d8hWjf6lcPBvmZkXjZD5N2FUn+6XwOO98UhGEgnxZYD V/4L1sj2kOcPiO4hxGgZV526RILmw+8xcPO1NviFtnzwff9PPwRChkvbHi1ivxjeIJdJ y4sMesGaWEcPOyaeBqp46bRdDDIM4hCCYcVmAd1+SgdQnEKX/6mfuKOJL3sxW9cF1d9X 4KU1zRMIest9yFyaWCRL8PIVJBfgspKV+ZIq/LuzeGpyU+wLzEodpbUAskAiJoGaTx35 Br4G8LcNctU08V5L1q+Df/3mbQ1efCKupUtNSwwcW0M4KERmn5eL9DlQicZUbgt+7uQ+ VIxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=TSkBDfkBXY8+Em5YZlmiKYg6nYCdHUjblVL6nQQ4828=; b=LMOXbLHrZTvw6StPNz5a+dOtm1iflpQg13pABoK2E56nrfr9ZEM4Lq7TupfXIOUPjw L9QFuF/xT6Cs2fERZTJeJYz2wiXCJDB+fGI1QjQLI0hzWv7IiNmXB79BqY9F6riUG4Qp HWSZkYgQP5Nk2qDvEO3rvQqVVkWJBR1KVTuyvtg49HFLcOPdF6sd3vONoYBeends2Rb9 LI1ZqRsPOCWF773EIyD0J2TEOL6ktQ1mj9u5XOERCgV6OG8RS1qYZzUTLRwyzdzBJhia vbpw0VFLSebudtBIVoZUBgHvxT62CT/6R3FCoYEfkzx+29mmKov/VPFu+DShAw/2HCJu voAQ== X-Gm-Message-State: AODbwcC0KqNyW3WhdNxstKMep69AAh2PBYmIdEMc64SOdER+QWnOt2VE Uj7szufFrayZclJ2+ptaJpDfeO0Z6Q== X-Received: by 10.202.205.209 with SMTP id d200mr508143oig.23.1494350035672; Tue, 09 May 2017 10:13:55 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?B?S3VkcmV0dGluIEfDvGxlcnnDvHo=?= Date: Tue, 09 May 2017 17:13:45 +0000 Message-ID: Subject: Re: Lucene update performance To: "java-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=001a1134e2180bea04054f1a7b11 archived-at: Tue, 09 May 2017 17:14:07 -0000 --001a1134e2180bea04054f1a7b11 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Fair enough, however, I see this: $ cat log Tue May 9 07:19:45 EDT 2017: Indexing starts Tue May 9 07:32:33 EDT 2017: Deletion starts with a list of 1278635 files Tue May 9 07:49:47 EDT 2017: Deletion complete, Addition starts with 1272334 files $ date Tue May 9 13:12:58 EDT 2017 I am using two phase commit model. Deletion logic above utilizes writer.deleteDocuments(query), and addition utilizes writer.addDocument(doc). Judging simply from this log deletion doesn't seem to be taking long. What am I missing? On Tue, May 9, 2017 at 10:23 AM Adrien Grand wrote: > addDocument can be a significant gain compared to updateDocument as doing= a > PK lookup on a unique field has a cost that is not negligible compared to > indexing a document, especially if the indexing chain is simple (no large > text fields with complex analyzers). Reindexing in place will also cause > more merging. Overall I find the 3x factor a bit high, but not too > surprising if documents and the analysis chain are simple, and/or if > storage is slow. > > Le mar. 9 mai 2017 =C3=A0 16:06, Rob Audenaerde a > =C3=A9crit : > > > As far as I know, the updateDocument method on the IndexWriter delete a= nd > > add. See also the javadoc: > > > > [..] Updates a document by first deleting the document(s) > > containing term and then adding the new > > document. The delete and then add are atomic as seen > > by a reader on the same index (flush may happen only after > > the add). [..] > > > > > > On Tue, May 9, 2017 at 3:37 PM, Kudrettin G=C3=BClery=C3=BCz > > wrote: > > > > > I do update the entire document each time. Furthermore, this sometime= s > > > means deleting compressed archives which are stores as multiple > documents > > > for each compressed archive file and readding them. > > > > > > Is there an update method, is it better performance than remove then > > add? I > > > was simply removing modified files from the index (which doesn't seem > to > > > take long), and readd them. > > > > > > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde < > rob.audenaerde@gmail.com> > > > wrote: > > > > > > > Do you update each entire document? (vs updating numeric docvalues?= ) > > > > > > > > That is implemented as 'delete and add' so I guess that will be > slower > > > than > > > > clean sheet indexing. Not sure if it is 3x slower, that seems a bit > > much? > > > > > > > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin G=C3=BClery=C3=BCz < > > kudrettin@gmail.com> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > For a 5.2.1 index that contains around 1.2 million documents, > > updating > > > > the > > > > > index with 1.3 million files seems to take 3X longer than doing a > > > scratch > > > > > indexing. (Files are crawled over NFS, indexes are stored on a > > > mechanical > > > > > disk locally (Btrfs)) > > > > > > > > > > Is this expected for Lucene's update index logic, or should I > further > > > > debug > > > > > my part of the code for update performance? > > > > > > > > > > Thank you, > > > > > Kudret > > > > > > > > > > > > > > > --001a1134e2180bea04054f1a7b11--