Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 91267 invoked from network); 6 Apr 2010 04:32:52 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Apr 2010 04:32:52 -0000 Received: (qmail 75653 invoked by uid 500); 6 Apr 2010 04:32:52 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 75506 invoked by uid 500); 6 Apr 2010 04:32:51 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 75498 invoked by uid 99); 6 Apr 2010 04:32:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 04:32:50 +0000 X-ASF-Spam-Status: No, hits=-1224.3 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 04:32:50 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B2814234C48D for ; Tue, 6 Apr 2010 04:32:29 +0000 (UTC) Message-ID: <72661844.709101270528349730.JavaMail.jira@brutus.apache.org> Date: Tue, 6 Apr 2010 04:32:29 +0000 (UTC) From: "Robert Muir (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-2265) improve automaton performance by running on byte[] In-Reply-To: <1246482837.263521266158967908.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2265: -------------------------------- Attachment: LUCENE-2265.patch ok I think i made some serious progress here, but i did find a bug in the utf32 -> utf8 dfa convertor. The problem is it does not handle at least the case where the initial state is an accept state. I created a testcase for this (TestUTF32SpecialCase), and included the python code back, as i figure you will probably fix it there first. I deleted the surrogate-seeking tests, like other nuances, if we switch to byte[] these won't behave the same, as these regexps are no longer defined. remaining is to switch the slow fuzzy to use codepoint calculations (to be consistent with the fast one). by the way, its really silly we have to unicode-convert just to get length in chars for that score calculation... ugh! > improve automaton performance by running on byte[] > -------------------------------------------------- > > Key: LUCENE-2265 > URL: https://issues.apache.org/jira/browse/LUCENE-2265 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: Flex Branch > Reporter: Robert Muir > Priority: Minor > Fix For: Flex Branch > > Attachments: LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265_pare.patch, LUCENE-2265_utf32.patch > > > Currently, when enumerating terms, automaton must convert entire terms from flex's native utf-8 byte[] to char[] first, then step each char thru the state machine. > we can make this more efficient, by allowing the state machine to run on byte[], so it can return true/false faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org