lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Hits.score mystery
Date Thu, 01 Nov 2007 23:01:37 GMT
Well, you might have to pre-process your strings before you
give them to an analyzer. Or roll your own analyzer.

What you're asking for, in effect, is an analyzer "that does
exactly what I want it to, nothing more and nothing less". But
the problem is that there is nothing general about what you want.
That is, leaving in # and ++ is completely arbitrary so I don't
think there are any canned analyzers out there that'll do what you
want.

But it's pretty simple to write a regular expression that'll remove
(actually, replace with spaces), anything that you want to. So I'd
think about that approach and then feed your lower-case/whitespace
analyzer the results.

Best
Erick

On 11/1/07, Tom Conlon <tomc@2ls.com> wrote:
>
> The reason seems to be that I found I needed to implement an analyser that
> lowercases terms as well as *not* ignoring trailing characters such as #, +.
> (i.e. I needed to match C# and C++)
>
> public final class LowercaseWhitespaceAnalyzer extends Analyzer
> {
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>     return new LowercaseWhitespaceTokenizer(reader);
>   }
> }
>
> Problem now exists that "system," etc is not matched against "system".
>
> Can anyone point to an example of a combination of analyser/tokeniser (or
> other method) that gets around this please?
>
> Thanks,
> Tom
>
>
> -----Original Message-----
> From: Tom Conlon [mailto:tomc@2ls.com]
> Sent: 01 November 2007 09:18
> To: java-user@lucene.apache.org
> Subject: RE: Hits.score mystery
>
> Thanks Daniel,
>
> I'm using Searcher.explain() & luke to try to understand the reasons for
> the score.
>
> -----Original Message-----
> From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
> Sent: 01 November 2007 08:19
> To: java-user@lucene.apache.org
> Subject: Re: Hits.score mystery
>
> On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
>
> > 119.txt17.865013 97% (13 occurences) 45.txt8.600986 47%
> > (18 occurences)
>
> 45.txt might be a document with more therms so that its score is lower
> although it contains more matches.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message