lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "charlie w" <spambait...@gmail.com>
Subject documents with large numbers of fields
Date Fri, 18 May 2007 20:01:43 GMT
Hi all,

I am trying to create an index where I can apply specific boost values to
documents, depending on the notion of something like a multiple tags for
documents.  Each tag would wind up having a different boost value in each
document.

My original approach was to add many fields with the same name, and the
value of each field would be the tag.  I applied a different boost to each
tag.  So something like:
tag=foo        ^2
tag=bar       ^1.2
tag=foobar   ^1.8
searching "tag:bar", for example.  Different documents might have different
boost values for "tag=bar", influencing the scoring.  A document might have
hundreds of these tags.  I might be searching for hundreds of different
values in these tag fields.

I have discovered this doesn't work, seemingly because of the way
DefaultSimilarity creates the fieldNorm.  It seems like internally Lucene is
effectively combining these into one tag field, with 3 terms and a boost
value that is the product of each of the 3 boost values.  If id did indeed
have a document with many many of these tag fields I'd wind up with a
freakishly huge boost on that document.

So now I have the idea to invert the field name and value thusly:
foo=tag     ^2
bar=tag     ^1.2
foobar=tag    ^1.8
and search "foo:tag".

Intuitively, I would expect Lucene to be optimized for searching the values
of fields, and not really the names of fields.  In a somewhat large index,
say 10 million documents, will Lucene search performance continue to be
acceptable if I load up documents with many fields like this?

Is there an upper limit on the number of fields comprising a document, and
if so what is it?

Or, is there some way to make my original approach work after all?

Regards and thanks,
Charlie

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message