lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Turcotte" <pat...@gmail.com>
Subject Re: Indexing puncuation and symbols
Date Mon, 01 Oct 2007 14:03:15 GMT
Hi,

Don't know the size of your dataset. But, couldn't you index in 2
fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
and WhiteSpace for the other.

Then use multiple field query (there is a query parser for that, just
don't remember the name right now).

Patrick

On 10/1/07, John Byrne <john.byrne@propylon.com> wrote:
> Whitespace analyzer does preserve those symbols, but not as tokens. It
> simply leaves them attached to the original term.
>
> As an example of what I'm talking about, consider a document that
> contains (without the quotes) "foo, ".
>
> Now, using WhitespaceAnalyzer, I could only get that document by
> searching for "foo,". Using StandardAnalyzer or any analyzer that
> removes punctuation, I could only find it by searching for "foo".
>
> I want an analyzer that will allow me to find it if I build a phrase
> query with the term "foo" followed immediately by ",". After all, the
> comma may be relevant to the search, but is definitely not part of the
> word.
>
> Extending StandardAnalyer is what I had in mind, but I don't know where
> to start. I also wonder why no-one seems to have done it before- it
> makes me suspect that there's some reason I haven't seen yet that makes
> it impossible ot impractical.
>
>
>
> Karl Wettin wrote:
> >
> > 1 okt 2007 kl. 15.33 skrev John Byrne:
> >
> >> Has anyone written an analyzer that preserves puncuation and
> >> synmbols ("£", "$", "%" etc.) as tokens?
> >
> > WhitespaceAnalyzer?
> >
> > You could also extend the lexical rules of StandardAnalyzer.
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message