Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 56280 invoked from network); 9 Oct 2006 14:58:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 9 Oct 2006 14:58:40 -0000 Received: (qmail 49596 invoked by uid 500); 9 Oct 2006 14:58:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 49307 invoked by uid 500); 9 Oct 2006 14:58:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 49296 invoked by uid 99); 9 Oct 2006 14:58:30 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Oct 2006 07:58:30 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [128.230.18.29] (HELO mailer.syr.edu) (128.230.18.29) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Oct 2006 07:58:26 -0700 Received: from [128.230.84.138] (syru84-138.syr.edu) by mailer.syr.edu (LSMTP for Windows NT v1.1b) with SMTP id <0.1590ADCD@mailer.syr.edu>; Mon, 9 Oct 2006 10:58:05 -0400 Message-ID: <452A637D.3090400@syr.edu> Date: Mon, 09 Oct 2006 10:58:05 -0400 From: Steven Rowe User-Agent: Mail/News 1.5.0.4-GroupWise-IMAP-fix (Windows/20060619) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Performing a like query References: <451FAEF8.2000300@yahoo.co.uk> <359a92830610010452k6a7f0fdao295a4217d37219b1@mail.gmail.com> <452045A4.8000501@yahoo.co.uk> <359a92830610011652o5936b0c4y8466be8aa7f98570@mail.gmail.com> <452662B5.8000905@yahoo.co.uk> <45268499.8030105@syr.edu> <452A1C6B.80107@yahoo.co.uk> In-Reply-To: <452A1C6B.80107@yahoo.co.uk> X-Enigmail-Version: 0.94.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi Rahil, Rahil wrote: > I was just wondering whether there is a > difference between the regular expression you sent me i.e. > (i) \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s* > > and > (ii) \\b > > as they lead to the same output. For example, the string search "testing > a-new string=3/4" results in the same output : [...] There is a difference for strings like "testing a- -new string=3/4" -- with (ii), you will get: ..., "a", "- -", "new", ... but with (i), you will get: ..., "a", "-", "-", "new", ... > What Id like to do though is remove the split over space characters [...] >From my reading of org.apache.lucene.index.memory.PatternAnalyzer (assuming you're using this class), I don't think this is necessary, since it just throws away zero-length tokens. Actually, given the below-discussed algorithm for PatternAnalyzer, I don't think it's even possible to do what you want. Here's the PatternAnalyzer.next() method definition (from ): public Token next() { if (matcher == null) return null; while (true) { // loop takes care of leading and trailing boundary cases int start = pos; int end; boolean isMatch = matcher.find(); if (isMatch) { end = matcher.start(); pos = matcher.end(); } else { end = str.length(); matcher = null; // we're finished } if (start != end) { // non-empty match (header/trailer) String text = str.substring(start, end); if (toLowerCase) text = text.toLowerCase(locale); return new Token(text, start, end); } if (!isMatch) return null; } } This method finds token breakpoints, remembering the end of the previous breakpoint (in instance field "pos"), then compares the beginning of the current breakpoint with the end of the previous breakpoint ("if (start != end)"), creating a Token *only* if the text between breakpoints has longer than zero length. If you're familiar with Perl, this class emulates a Perl regex idiom: (iii) @tokens = grep { length > 0 } split /my-regex/, $text; That is, return a list of tokens generated by breaking text on a regex, filtering out zero-length tokens. Actually, the way I usually write this in Perl is: (iv) @tokens = grep { /\S/ } split /my-regex/, $text; In the above version, tokens are kept only if they contain at least one non-space character (this also filters out zero-length tokens). PatternAnalyzer, OTOH, *will* emit whitespace-only tokens - it implements (iii), not (iv). Hope it helps, Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org