Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4700FC1C.6010707@propylon.com>
Date: Mon, 01 Oct 2007 14:54:36 +0100
From: John Byrne <john.byrne@propylon.com>
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: java-user@lucene.apache.org
References: <4700F72B.1010609@propylon.com>
 <0C09F3B2-8C01-4C61-970E-B4673971B526@gmail.com>
In-Reply-To: <0C09F3B2-8C01-4C61-970E-B4673971B526@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: Indexing puncuation and symbols

Whitespace analyzer does preserve those symbols, but not as tokens. It 
simply leaves them attached to the original term.

As an example of what I'm talking about, consider a document that 
contains (without the quotes) "foo, ".

Now, using WhitespaceAnalyzer, I could only get that document by 
searching for "foo,". Using StandardAnalyzer or any analyzer that 
removes punctuation, I could only find it by searching for "foo".

I want an analyzer that will allow me to find it if I build a phrase 
query with the term "foo" followed immediately by ",". After all, the 
comma may be relevant to the search, but is definitely not part of the 
word.

Extending StandardAnalyer is what I had in mind, but I don't know where 
to start. I also wonder why no-one seems to have done it before- it 
makes me suspect that there's some reason I haven't seen yet that makes 
it impossible ot impractical.


Karl Wettin wrote:
>
> 1 okt 2007 kl. 15.33 skrev John Byrne:
>
>> Has anyone written an analyzer that preserves puncuation and
>> synmbols ("�", "$", "%" etc.) as tokens?
>
> WhitespaceAnalyzer?
>
> You could also extend the lexical rules of StandardAnalyzer.
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org