Return-Path: Delivered-To: apmail-lucene-java-commits-archive@www.apache.org Received: (qmail 85396 invoked from network); 17 Jul 2008 16:31:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Jul 2008 16:31:14 -0000 Received: (qmail 58793 invoked by uid 500); 17 Jul 2008 16:31:14 -0000 Delivered-To: apmail-lucene-java-commits-archive@lucene.apache.org Received: (qmail 58766 invoked by uid 500); 17 Jul 2008 16:31:14 -0000 Mailing-List: contact java-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 58757 invoked by uid 99); 17 Jul 2008 16:31:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jul 2008 09:31:14 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jul 2008 16:30:29 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id ECF401115B for ; Thu, 17 Jul 2008 16:30:23 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Apache Wiki To: java-commits@lucene.apache.org Date: Thu, 17 Jul 2008 16:30:23 -0000 Message-ID: <20080717163023.14784.7107@eos.apache.org> Subject: [Lucene-java Wiki] Update of "FastRegexp" by RobertMuir X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification. The following page has been changed by RobertMuir: http://wiki.apache.org/lucene-java/FastRegexp New page: One use case we had was the need to perform very fast regular expression queries against the term dictionary. The easiest improvement was to integrate Brics Automaton package around the RegexCapabilities interface. This gave a significant improvement in performance (because of the underlying DFA) but was not enough. The biggest problem was that most of our regular expressions do not have any constant prefix. We are searching Arabic which means a common regular expression would look like (ا|إ) at the beginning of a word because hamza is so frequently omitted. This meant it was doing regexp comparison on the entire term dictionary (although still significantly faster than the JDK implementation) The final trick was to take advantage of the fact that Brics allows you to do a 'step' operation against a built regular expression and determine if it is in a reject state of the DFA. Because of this, if you organize terms into buckets by their first N characters (N=1 or 2 is typically enough) then you can simply use this step operation against each bucket "prefix" and only do the full regex comparison against the items in buckets that will not definitely end in a reject state. For our implementation we simply loaded up all the terms into this bucket organized structure at startup with a TermEnum, but it might be very likely that the way terms are stored in Lucene is compatible with this trick and a more elegant integration method would work. Obviously this does not help for regular expressions where the beginning of the string can match *anything*, but it is more flexible than requiring just a constant prefix for reasonable performance. On a 400M doc index with 2.1M unique terms we can do a regex expansion in 15ms on average with this method on my desktop computer.