Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 47A42200C1B for ; Tue, 14 Feb 2017 12:50:17 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 463A1160B5F; Tue, 14 Feb 2017 11:50:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6ACCF160B52 for ; Tue, 14 Feb 2017 12:50:16 +0100 (CET) Received: (qmail 60102 invoked by uid 500); 14 Feb 2017 11:50:15 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60090 invoked by uid 99); 14 Feb 2017 11:50:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2017 11:50:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5AEFFC0D33 for ; Tue, 14 Feb 2017 11:50:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id nFCeY5XXZpL0 for ; Tue, 14 Feb 2017 11:50:12 +0000 (UTC) Received: from mail-ot0-f173.google.com (mail-ot0-f173.google.com [74.125.82.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 11E795FB44 for ; Tue, 14 Feb 2017 11:50:07 +0000 (UTC) Received: by mail-ot0-f173.google.com with SMTP id 73so90076019otj.0 for ; Tue, 14 Feb 2017 03:50:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=0q3zLvo5CziLNKExDGPp3wEtOFYjmkSYW89giCsINCE=; b=pFJbF+GpyXv5oCJvRwY8U7yQg+gV7njbt8Izo30PiDXZta1XM53IinHBuinbcMKw/U l0LyLNmi/71zxZGhgqnAshs7gEt80CnJ/qUqtZVCsoI1cO3bnpdrYwHCX41/Q7OXN+d8 FHQEDCykWlaJL5GBnnImNGoFE9QyXKvb8hcpi5hbZSuivHObyTh4pCW0m0e5e3MO0YAf 2P6khUFxPL8jG72YTYzX+tJMPJzx6u7ioozfI3oDyLCmNvGdDsjT5lpwp1U3qlJsn49a hjZCsKTsxhepbqvtvbmGwZUTK43ZNh7TrdykxTtJQvU72Vg6h7qgvSsJgKPGFeUxdsXp Ikvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=0q3zLvo5CziLNKExDGPp3wEtOFYjmkSYW89giCsINCE=; b=jsLSCepTkkpHaz9X4NbDtdg+KYP3/Ogw4XfyWzeummWeVZCrLWTtRyDEa3OuSYHu49 npxLOC8P8sNJ36AkxVg3aIgDkwd6pKk4GrKFLzJd2eeoS13v5MSYuaHYfYrWjC/+Teyf J8Jf4JAFF9RZNQMMK737Ldhqxy03WQNx9cH834WDVYeGAqoYxj0DglnjD5JIeo9zVMO0 3N6isMBldabzmdcnLEKD/cLln2OF9pUsXkC8fXryL9kiQ0lHP0YRsr+SvlTSw5c6nlgQ DrSV7TZB9/AfskAbkhQ3/7e4chjlH8as6a3tlLoOVJ4pE9PGU5GiZf84XcfolghtHTsu 8W/A== X-Gm-Message-State: AMke39nFBwTmmClsNZa/b2vk34c3g6ZcgwedF+YctRDf4kDTdnir2LUBEKT7nNVLwnONzADNunY57KpY9q7+Lw== X-Received: by 10.157.17.5 with SMTP id g5mr18362364ote.118.1487073003200; Tue, 14 Feb 2017 03:50:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.157.38.105 with HTTP; Tue, 14 Feb 2017 03:50:02 -0800 (PST) Received: by 10.157.38.105 with HTTP; Tue, 14 Feb 2017 03:50:02 -0800 (PST) In-Reply-To: References: From: Oliver Mannion Date: Tue, 14 Feb 2017 22:50:02 +1100 Message-ID: Subject: Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton) To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a1145009e1c6cf305487c2a88 archived-at: Tue, 14 Feb 2017 11:50:17 -0000 --001a1145009e1c6cf305487c2a88 Content-Type: text/plain; charset=UTF-8 Thanks Mike for getting back to me, sounds like I'm on the right track. I'm building the automaton from around 1.7million strings, and it ends up with about 3.8million states and it turns out building a CharacterRunAutomaton from that takes up about 2gig of heap (I was quite suprised!), with negligible performance difference at run time. At your suggestion I tried the ByteRunAutomaton and it was similar to the CharacterRunAutomaton in terms of heap and run time. So for now I'm going to just stick to an Automaton. On 14 February 2017 at 00:41, Michael McCandless wrote: > On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion > wrote: > > > I'd like to construct an Automaton to prefix match against a large set of > > strings. I gather a RunAutomation is immutable, thread safe and faster > than > > Automaton. > > That's correct. > > > Are there any other differences between the three Automaton > > classes, for example, in memory usage? > > CompiledAutomaton is just a thin wrapper to hold onto both the > original Automaton and the RunAutomaton, plus some other corner-casey > things that are likely not interesting for your usage. > > > Would the general approach for such a problem be to use > > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list > of > > my string set, set all the states to accept (to enable prefix matching), > > then pass the Automaton into the constructor of a CharacterRunAutomaton, > > and use the run method on the CharacterRunAutomaton to match any queries? > > That sounds right. > > You could also try doing everything in UTF8 space instead, and use > ByteRunAutomaton: it will likely be faster since it will do faster > lookups on each transition. It should still be safe to set all states > as accept, even though some of those states will be inside a single > Unicode character, as long as the strings you test against are whole > UTF-8 sequences? > > > As it seems like I'm building up the Automaton at least three times, and > > keeping a reference to the Automaton in the CharacterRunAutomaton, is > this > > the most memory efficient way of building such an Automaton? > > Yeah, it is. The RunAutomaton will likely require substantial heap in > your case, probably more than the original automaton. > > I suppose you don't actually need to keep the Automaton around once > the RunAutomaton is built, but Lucene doesn't make this possible > today, since the RunAutomaton holds onto the Automaton... > > > Thanks in advance, > > You're welcome! > > Mike McCandless > > http://blog.mikemccandless.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --001a1145009e1c6cf305487c2a88--