Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Message-ID: <452A637D.3090400@syr.edu>
Date: Mon, 09 Oct 2006 10:58:05 -0400
From: Steven Rowe <sarowe@syr.edu>
User-Agent: Mail/News 1.5.0.4-GroupWise-IMAP-fix (Windows/20060619)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Performing a like query
References: <451FAEF8.2000300@yahoo.co.uk>
	 <359a92830610010452k6a7f0fdao295a4217d37219b1@mail.gmail.com>
	 <452045A4.8000501@yahoo.co.uk>
 <359a92830610011652o5936b0c4y8466be8aa7f98570@mail.gmail.com>
 <452662B5.8000905@yahoo.co.uk> <45268499.8030105@syr.edu>
 <452A1C6B.80107@yahoo.co.uk>
In-Reply-To: <452A1C6B.80107@yahoo.co.uk>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hi Rahil,

Rahil wrote:
> I was just wondering whether there is a
> difference between the regular expression you sent me i.e.
> (i)   \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*
> 
>    and
>   (ii)   \\b
> 
> as they lead to the same output. For example, the string search "testing
> a-new string=3/4" results in the same output : [...]

There is a difference for strings like "testing a- -new string=3/4" --
with (ii), you will get:

   ..., "a", "- -", "new", ...

but with (i), you will get:

   ..., "a", "-", "-", "new", ...

> What Id like to do though is remove the split over space characters [...]

>From my reading of org.apache.lucene.index.memory.PatternAnalyzer
(assuming you're using this class), I don't think this is necessary,
since it just throws away zero-length tokens.  Actually, given the
below-discussed algorithm for PatternAnalyzer, I don't think it's even
possible to do what you want.

Here's the PatternAnalyzer.next() method definition (from
<http://svn.apache.org/viewvc/lucene/java/trunk/contrib/memory/src/java/org/apache/lucene/index/memory/PatternAnalyzer.java?revision=450725&view=markup>):

public Token next() {
  if (matcher == null) return null;

  while (true) { // loop takes care of leading and trailing boundary cases
    int start = pos;
    int end;
    boolean isMatch = matcher.find();
    if (isMatch) {
      end = matcher.start();
      pos = matcher.end();
    } else {
      end = str.length();
      matcher = null; // we're finished
    }

    if (start != end) { // non-empty match (header/trailer)
      String text = str.substring(start, end);
      if (toLowerCase) text = text.toLowerCase(locale);
      return new Token(text, start, end);
    }
    if (!isMatch) return null;
  }
}

This method finds token breakpoints, remembering the end of the previous
breakpoint (in instance field "pos"), then compares the beginning of the
current breakpoint with the end of the previous breakpoint ("if (start
!= end)"), creating a Token *only* if the text between breakpoints has
longer than zero length.

If you're familiar with Perl, this class emulates a Perl regex idiom:

(iii) @tokens = grep { length > 0 } split /my-regex/, $text;

That is, return a list of tokens generated by breaking text on a regex,
filtering out zero-length tokens.

Actually, the way I usually write this in Perl is:

(iv) @tokens = grep { /\S/ } split /my-regex/, $text;

In the above version, tokens are kept only if they contain at least one
non-space character (this also filters out zero-length tokens).
PatternAnalyzer, OTOH, *will* emit whitespace-only tokens - it
implements (iii), not (iv).

Hope it helps,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org