lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject getting term offset information for fields with multiple value entiries
Date Thu, 16 Aug 2007 09:50:15 GMT
Hash: SHA1


I have an index with an 'actor' field, for each actor there exists an single field value entry,

stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition <movie_actors>

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
movie_actors:Miguel Bosé
movie_actors:Anna Lizaran (as Ana Lizaran)
movie_actors:Raquel Sanchís
movie_actors:Angelina Llongueras

I try to get the term offset, e.g. for 'angelina' with

termPositionVector = (TermPositionVector) reader.getTermFreqVector(docNumber, "movie_actors");
int iTermIndex = termPositionVector.indexOf("angelina");
TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex);

I get one TermVectorOffsetInfo for the field - with offset numbers that are bigger than one
Field entry.
I guessed that Lucene gives the offset number for the situation that all values were concatenated,
which is for the single (virtual) string:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran (as Ana Lizaran)Raquel
SanchísAngelina Llongueras

This fits in nearly no situation, so my second guess was that lucene adds some virtual delimiters
between the single
field entries for offset calculation. I added a delimiter, so the result would be:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna Lizaran (as Ana Lizaran)
Raquel Sanchís Angelina Llongueras
(note the ' ' between each actor name)

..this also fits not for each situation - there are too much delimiters there now, so, further,
I guessed that Lucene don't add
a delimiter in each situation. So I added only one when the last character of an entry was
no alphanumerical one, with:
StringBuilder strbAttContent = new StringBuilder();
for (String strAttValue : m_luceneDocument.getValues(strFieldName))
   if(strbAttContent.substring(strbAttContent.length() - 1).matches("\\w"))
      strbAttContent.append(' ');

where I get the result (virtual) entry:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran (as Ana Lizaran)Raquel
Sanchís Angelina Llongueras

this fits in ~96% of all my queries....but still its not 100% the way lucene calculates the
offset value for fields with multiple
value entries.

..maybe the problem is that there are special characters inside my database (e.g. the 'é'
at 'Bosé'), where my '\w' don't matches.
I have looked to this specific situation, but considering this one character don't solves
the problem.

How do Lucene calculates these offsets? I also searched inside the source code, but can't
find the correct place.

Thanks in advance!

Christian Reuschling

- --
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-125

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
Version: GnuPG v1.4.2 (GNU/Linux)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message