lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Is there any difference in a document between one added field with a number of terms and a field added a number of times ?
Date Tue, 12 Jan 2010 20:29:00 GMT
Been doing some analysis with Luke (BTW doesnt work with 
StandardAnalyzer since Version field introduced) and discovered a 
problem with field lenghth boosting for me.

I have a document that represents a recording artist (i.e Madonna, The 
Beatles ectera) it contains an artist and an alias field, the alias 
field contains other names that the artist is maybe known as, and so 
there can be multiple aliases for an artist.

PseudoCode:
(
doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
for (String alias : aliases.get(artistId)) {
      doc.addField(ArtistIndexField.ALIAS, alias);
}
)

 Im finding that when I search by for the artist by the alias field if 
the value matches an alias in two different documents the document with 
the least number of aliases get the best score because the boost of the 
alias is split between the aliases on the other doc, if I 
ANALYSED_NO_NORMS then both documents return the same score.

The trouble is I don't want to disable norms because I want a match on a 
single field containing less terms to score better than one with more 
scores.

Full example:

http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1
return two results , the second result only has score of 8 because it 
more aliases than the first result, even the alias it matched on was an 
exact single term match.
http://musicbrainz.org/show/artist/aliases.html?artistid=174327

but if I remove norms then the following query (which is currently working)

http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1

would stop working, in that  searching for 'The beatles' would no longer 
score rate artist 'The Beatles' better than 'The Beatles revival Band'

So isn't there any way to recognise that repeated calls to addField() is 
not creating a single field with many terms,but many fields with few terms.

thanks Paul




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message