Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B1F03200B92 for ; Wed, 28 Sep 2016 12:12:25 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B080E160AD4; Wed, 28 Sep 2016 10:12:25 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CEE48160AB4 for ; Wed, 28 Sep 2016 12:12:24 +0200 (CEST) Received: (qmail 52610 invoked by uid 500); 28 Sep 2016 10:12:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 52596 invoked by uid 99); 28 Sep 2016 10:12:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Sep 2016 10:12:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id C0C0BC0C04 for ; Wed, 28 Sep 2016 10:12:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=mikemccandless-com.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id gWfPaATFSDNG for ; Wed, 28 Sep 2016 10:12:20 +0000 (UTC) Received: from mail-io0-f177.google.com (mail-io0-f177.google.com [209.85.223.177]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 025EB5FB5A for ; Wed, 28 Sep 2016 10:12:19 +0000 (UTC) Received: by mail-io0-f177.google.com with SMTP id 92so48511688iol.2 for ; Wed, 28 Sep 2016 03:12:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mikemccandless-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=QGbxEisEIm11c1hC3wA+0eDgaXyOoHIdnlpST7900zg=; b=jCIh+RGbcbAw1dv649tD6oA9kIEVVubAOhLsMeEBk/SaZQHVocFMKZm59W1jvc5Ndc ltBvsGMs/ig6oEpr8a+5xaIB8FJvEwsV1utYoaualtuFsnGPqZB7LnTnC2D+BDxc2d1w 7/1XkLO0gSrNGbz3y0a2FdDsH5cxqYdb/yBjdwIPtaWuQ3lygNYRCLDV5yEj/4UhVTsp wPwZ+AlV72ot7T6+dZgq2pl6yFCAPVvi0qjd0XcvMF4TQkP98xTLkrbRoav9/JyIqdRB BLyWYU5PJwnoLzucst0gpXMl0cujDt5nWHftcAwxAqLWfUaM/5BXZEUBZFxVGTwYuZJi EdNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=QGbxEisEIm11c1hC3wA+0eDgaXyOoHIdnlpST7900zg=; b=On9moc/HqgB4DyJ1erccFKGoUZumeWGzEKU7v9dLzZD6e4d1OANZmpYXFh4IXJwVru L7CPkuOs0lueuWTZgsPVFi15AJ7Kv9WrgisLCS8qGlVdRLvwuba+WNUbxEYdXlxdx9zj B2qqQMOGInyvx9+XFbfsVCiygqGcTIk4AkIAqBLmCq4LlMJ7B0rmctKnWJQs9HxqqGWf BJvv+Avfub7LdF7dfYCu61hMOugJ+nDT7Wlka4Uc1WI7fN3w/xYxKg3j9hqkkrn5WrZR 9rZuzPMAxR5jbTrR5pVlvvgWSpQmRkbK+S/ruButj0NTMh3YSBnYcpPr/h6GaLa7t3su fO9Q== X-Gm-Message-State: AA6/9RlUzx8P9/5J5+/Dp4m3ZWZ6SjjzTH3pLVGJrV5GdUxsekU7eb6a2eDGOT/LgcdNyklER5o+zFF+pwfvkQ== X-Received: by 10.36.196.5 with SMTP id v5mr1507720itf.84.1475057538567; Wed, 28 Sep 2016 03:12:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.151.202 with HTTP; Wed, 28 Sep 2016 03:11:58 -0700 (PDT) In-Reply-To: References: From: Michael McCandless Date: Wed, 28 Sep 2016 12:11:58 +0200 Message-ID: Subject: Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization To: Lucene Users Content-Type: text/plain; charset=UTF-8 archived-at: Wed, 28 Sep 2016 10:12:25 -0000 On Tue, Sep 27, 2016 at 7:05 AM, Shai Erera wrote: > Hmm ... the commit part of the two indexes is always tricky. The javadocs > are correct because the order of indexing is as follows: when you index a > document with facets, the facets are first added to the taxonomy index and > only then the document is indexed in IW. > > Therefore if you concurrently index and commit, then committing TIW first > ensures that all "known" facets up to this point are committed. Then when > you commit IW, the documents in there are guaranteed to have their facet > ordinals already in the committed TIW (which may at this point include more > facets than are indexed in IW, but that's OK). Hmm but if you commit TIW first, then IW after, isn't it possible that after TIW commit finishes that I index a few more documents into IW that added new taxonomy nodes/labels/ordinals and then when I call IW.commit those last few documents are now referencing taxonomy nodes that do not exist in the TIW commit point? Mike McCandless http://blog.mikemccandless.com >> On Tue, Sep 27, 2016 at 2:08 AM, William Moss >> wrote: >> > We're using Lucene 5.2.0 (I know it's old, we're in the process of >> > upgrading) to handle searching over our listings here at Airbnb. >> >> 6.2.1 is a compelling upgrade because of more efficient indexing and >> searching of numerics (among many other things!)... >> >> > I've been >> > digging into our realtime indexing code and how we use Lucene and I >> wanted >> > to check a few assumptions around synchronization, since we see some >> > periodic exceptions[1] that I can't quite explain. >> > >> > First, a tiny bit of background >> > 1. We use facets and therefore are writing realtime updates using both >> > a IndexWriter and DirectoryTaxonomyWriter. >> > 2. We have multiple update threads, consuming messages (from Kafka) and >> > updating the index. >> > 3. Once we process a batch of messages, we call commit (first on >> > DirectoryTaxonomyWriter then on IndexWriter). >> >> I see TaxonomyWriter's javadocs say that is the correct order, but I >> would have expected the opposite, if you are concurrently indexing >> documents. >> >> > 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher. >> > 5. We periodically call forceMerge on our IndexWriter (to improve >> > performance). >> >> This is dubious: if your index continues to receive changes, you >> should skip forceMerge and let Lucene's natural merging run at >> defaults. forceMerge is an incredibly costly operation and it's >> unclear you get that much speedup at search time. >> >> > So, now to a few questions: >> > 1. My understand is the right way to handle a DirectoryTaxonomyWriter and >> > an IndexWriter is to call commit on DirectoryTaxonomyWriter before >> > IndexWriter. Is this correct? Since we're using multiple threads, we need >> > to synchronize these calls within the process regardless, but curious to >> > understand the design. >> >> You should not have to block index updates while committing, if you >> don't need/want to. >> >> If you don't block updates, I would think you need to commit the >> DirectoryTaxonomyWriter second so that any new nodes in the taxonomy >> tree, referenced by the main index, are guaranteed to be present in >> the DirectoryTaxonomyWriter's commit. >> >> Maybe Shai can shed some more light here... >> >> > 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those >> > need to be synchronized with the commit calls to either IndexWriter or >> > DirectoryTaxonomyWriter? >> >> No. >> >> Commit can be a costly, slow operation (calling fsync on N files), and >> it's designed internally in IndexWriter to not block operations like >> merging and refreshing. >> >> > Do we need to call it after ever time we call >> > commit? The comment suggests we call it "periodically," but I'm not >> clear >> > on how often that should be or what conditions trigger the index to >> change >> > in way that this would be required. >> >> You don't have to call refresh on every commit. When you call it is >> entirely up to you. >> >> Commit makes changes durable on disk, so an OS crash, power loss, >> etc., won't lose those changes (a bad disk WILL lose them of course). >> >> Refresh makes changes visible for searching. >> >> The two ops are entirely separate. >> >> Some apps call commit periodically and never refresh, others call >> refresh periodically and never commit :) It's your call. >> >> > 3. Lastly, what about forceMerge? Is there any worry there or can that >> just >> > safely happen in the background? Is there any need to call commit >> > afterward? Or does forceMerge effectively do that? >> >> Force merge does not call commit itself. >> >> If you do force merge, then it is a good idea to both commit and >> refresh afterwards, as this will let Lucene free up resources (files, >> file descriptors) with the old un-merged segments. >> >> > Presumably, we would not >> > see the new index until maybeRefresh was called the next time? >> >> Exactly. >> >> > Sorry, that was a lot of questions, would love help on any and all of >> them. >> >> No worries, keep them coming! >> >> > [1] When calling maybeRefresh, we've seen error that look like: >> > java.nio.file.NoSuchFileException: /6/_vj1.cfe >> >> Need the full stack trace / context here to understand what's happening... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org