lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <john.by...@propylon.com>
Subject Re: Indexing puncuation and symbols
Date Mon, 01 Oct 2007 14:37:13 GMT
Well, the size wouldn't be a problem, we could afford the extra field. 
But it would seem to complicate the search quite a lot. I'd have to run 
the search terms through both analyzers. It would be much simpler if the 
characters were indexed as separate tokens.

Patrick Turcotte wrote:
> Hi,
>
> Don't know the size of your dataset. But, couldn't you index in 2
> fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
> and WhiteSpace for the other.
>
> Then use multiple field query (there is a query parser for that, just
> don't remember the name right now).
>
> Patrick
>
> On 10/1/07, John Byrne <john.byrne@propylon.com> wrote:
>   
>> Whitespace analyzer does preserve those symbols, but not as tokens. It
>> simply leaves them attached to the original term.
>>
>> As an example of what I'm talking about, consider a document that
>> contains (without the quotes) "foo, ".
>>
>> Now, using WhitespaceAnalyzer, I could only get that document by
>> searching for "foo,". Using StandardAnalyzer or any analyzer that
>> removes punctuation, I could only find it by searching for "foo".
>>
>> I want an analyzer that will allow me to find it if I build a phrase
>> query with the term "foo" followed immediately by ",". After all, the
>> comma may be relevant to the search, but is definitely not part of the
>> word.
>>
>> Extending StandardAnalyer is what I had in mind, but I don't know where
>> to start. I also wonder why no-one seems to have done it before- it
>> makes me suspect that there's some reason I haven't seen yet that makes
>> it impossible ot impractical.
>>
>>
>>
>> Karl Wettin wrote:
>>     
>>> 1 okt 2007 kl. 15.33 skrev John Byrne:
>>>
>>>       
>>>> Has anyone written an analyzer that preserves puncuation and
>>>> synmbols ("£", "$", "%" etc.) as tokens?
>>>>         
>>> WhitespaceAnalyzer?
>>>
>>> You could also extend the lexical rules of StandardAnalyzer.
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message