Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8D6AD200C1B for ; Tue, 14 Feb 2017 12:57:20 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8C200160B5F; Tue, 14 Feb 2017 11:57:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AEE77160B52 for ; Tue, 14 Feb 2017 12:57:19 +0100 (CET) Received: (qmail 83108 invoked by uid 500); 14 Feb 2017 11:57:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 83095 invoked by uid 99); 14 Feb 2017 11:57:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2017 11:57:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 9214CC0D33 for ; Tue, 14 Feb 2017 11:57:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.219 X-Spam-Level: X-Spam-Status: No, score=-0.219 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=mikemccandless-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id UxyQi0wACYYa for ; Tue, 14 Feb 2017 11:57:16 +0000 (UTC) Received: from mail-it0-f43.google.com (mail-it0-f43.google.com [209.85.214.43]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 152A25F570 for ; Tue, 14 Feb 2017 11:57:15 +0000 (UTC) Received: by mail-it0-f43.google.com with SMTP id x75so32770326itb.0 for ; Tue, 14 Feb 2017 03:57:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mikemccandless-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=4NCX6va0CVsQFFwewMcP94esZ3F+MXeDoN6gPWx9CpE=; b=wth8GHYVyZ6U+4eE7+ddPYKyfmpNPz6Zp89k7o207B6ugUnPIvY8GAoWaEHB7ti+Mk qt0F1uqPyscihwJ0WHR5V3rEF7ovkWODnxpI1xM0qQOv/xP7JM6epd9bBSGKIDt+EhmS ubwGyfsVpw+TXdX9oWzAewW+pL38XcBN/X9h1/Z+PnvJ/ZQRIjdePi9q+LKnEb3tvKAV ZHcU3cw6OuSSzsJI/wYh3JZ5xrZxWSGMPJPvD9ZS72BknyiQQzNui7gQZXXv5Z20Ogjr fuxqKkio+z0781LFNCq6RIzMoZnRSdiwhU2R65SjfjCLJzxvP/zYtflaxGBiaa8VDfrM XOeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=4NCX6va0CVsQFFwewMcP94esZ3F+MXeDoN6gPWx9CpE=; b=LeY8HGVKwuFegkWqueJpBuNHTPGBtbsCVKjj2sr0CBsF+TJteuQFfnd7a94EmueyQx d6mc37g16q3IY7q5s7d8IZR++2/P3GFWmmd0jc1mbGGggQGaSmbVorSN9erBQTBnwnGV TA5RBws9U9DsHFczEA54InTNMbsBQ7eLH9oVnsTalMw7jNl/gkRPVeipMz/O/wwwPrMI mwZ6BUMyy6lK56FBCJ0/C67kFSA22MjpPmHrvF0EooDUR5/P8XbplXUXkn9qiYkowPZv /RxLdImfpi0/1qsPm0gx4K2doCHoI43TOXgL2rn4pNGiQ2H20A4QTF5IbpAh/EgSYQ0g N4+Q== X-Gm-Message-State: AMke39nCFqSpYP2QI3veEywQPmGDKGIdE5JskUjYA0apZtOPNAnQkEQtkU1r8dtSW8mvJf2UIZlyyc+6ZL1Hmw== X-Received: by 10.107.129.39 with SMTP id c39mr26410482iod.57.1487073433036; Tue, 14 Feb 2017 03:57:13 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.132.38 with HTTP; Tue, 14 Feb 2017 03:56:52 -0800 (PST) In-Reply-To: References: From: Michael McCandless Date: Tue, 14 Feb 2017 06:56:52 -0500 Message-ID: Subject: Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton) To: Lucene Users , o.mannion@gmail.com Content-Type: text/plain; charset=UTF-8 archived-at: Tue, 14 Feb 2017 11:57:20 -0000 Wow, 2G heap, that's horrible! How much heap does the automaton itself take? You can use the automaton's step method to transition from a state given the next input character to another state (or -1 if that state doesn't accept that character); it will be slower than the 2 GB run automaton, but perhaps fast enough? Mike McCandless http://blog.mikemccandless.com On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion wrote: > Thanks Mike for getting back to me, sounds like I'm on the right track. > > I'm building the automaton from around 1.7million strings, and it ends up > with about 3.8million states and it turns out building a > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite > suprised!), with negligible performance difference at run time. At your > suggestion I tried the ByteRunAutomaton and it was similar to the > CharacterRunAutomaton > in terms of heap and run time. So for now I'm going to just stick to an > Automaton. > > On 14 February 2017 at 00:41, Michael McCandless > wrote: > >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion >> wrote: >> >> > I'd like to construct an Automaton to prefix match against a large set of >> > strings. I gather a RunAutomation is immutable, thread safe and faster >> than >> > Automaton. >> >> That's correct. >> >> > Are there any other differences between the three Automaton >> > classes, for example, in memory usage? >> >> CompiledAutomaton is just a thin wrapper to hold onto both the >> original Automaton and the RunAutomaton, plus some other corner-casey >> things that are likely not interesting for your usage. >> >> > Would the general approach for such a problem be to use >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list >> of >> > my string set, set all the states to accept (to enable prefix matching), >> > then pass the Automaton into the constructor of a CharacterRunAutomaton, >> > and use the run method on the CharacterRunAutomaton to match any queries? >> >> That sounds right. >> >> You could also try doing everything in UTF8 space instead, and use >> ByteRunAutomaton: it will likely be faster since it will do faster >> lookups on each transition. It should still be safe to set all states >> as accept, even though some of those states will be inside a single >> Unicode character, as long as the strings you test against are whole >> UTF-8 sequences? >> >> > As it seems like I'm building up the Automaton at least three times, and >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is >> this >> > the most memory efficient way of building such an Automaton? >> >> Yeah, it is. The RunAutomaton will likely require substantial heap in >> your case, probably more than the original automaton. >> >> I suppose you don't actually need to keep the Automaton around once >> the RunAutomaton is built, but Lucene doesn't make this possible >> today, since the RunAutomaton holds onto the Automaton... >> >> > Thanks in advance, >> >> You're welcome! >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org