From dev-return-315834-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Thu Mar 22 15:44:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7480A180676 for ; Thu, 22 Mar 2018 15:44:04 +0100 (CET) Received: (qmail 10984 invoked by uid 500); 22 Mar 2018 14:44:03 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 10972 invoked by uid 99); 22 Mar 2018 14:44:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Mar 2018 14:44:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9FB281808A6 for ; Thu, 22 Mar 2018 14:44:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.311 X-Spam-Level: X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ArU6uevOECyd for ; Thu, 22 Mar 2018 14:44:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 297F25F568 for ; Thu, 22 Mar 2018 14:44:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 69EC0E0D6A for ; Thu, 22 Mar 2018 14:44:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 25E53214D9 for ; Thu, 22 Mar 2018 14:44:00 +0000 (UTC) Date: Thu, 22 Mar 2018 14:44:00 +0000 (UTC) From: "Christian Ziech (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (LUCENE-8219) LevenshteinAutomata should estimate the number of states and transitions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Christian Ziech created LUCENE-8219: --------------------------------------- Summary: LevenshteinAutomata should estimate the number of states and transitions Key: LUCENE-8219 URL: https://issues.apache.org/jira/browse/LUCENE-8219 Project: Lucene - Core Issue Type: Improvement Reporter: Christian Ziech Currently the toAutomaton() method of the LevenshteinAutomata class uses the default constructor of the Automaton although it exactly knows how many states the automaton will have and can do a reasonable estimation on how many transitions it will need as well. I suggest changing the lines {code:language=java|firstline=154|linenumbers=true} // the number of states is based on the length of the word and n int numStates = description.size(); Automaton a = new Automaton(); int lastState; {code} to {code:language=java|firstline=154|linenumbers=true} // the number of states is based on the length of the word and n final int numStates = description.size(); final int numTransitions = numStates * Math.min(1 + 2 * n, alphabet.length); final int prefixStates = prefix != null ? prefix.codePointCount(0, prefix.length()) : 0; final Automaton a = new Automaton(numStates + prefixStates, numTransitions); int lastState; {code} For my test data this cut down on the total amount of memory needed for int arrays by factor 4. The estimation of "1 + 2 * editDistance" should maybe rather be replaced by a value coming from the ParametricDescription used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org