lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: getting term offset information for fields with multiple value entiries
Date Fri, 17 Aug 2007 16:44:20 GMT
Hash: SHA1

Hello community, dear Grant

I have build a JUnit test case that illustrates the problem - there, I try to cut
out the right substring with the offset values given from Lucene - and fail :(

A few remarks:

In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches -
it is recognized, unlike in StandardAnalyzer - as delimiter sign.

Analysis: It seems that Lucene calculates the offset values by adding a virtual
delimiter between every field value.
But Lucene forgets the last characters of a field value when these are
analyzer-specific delimiter values. (I seem this because of DocumentWriter, line
245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
With this line of code, only the end offset of the last token is considered - by
forgetting potential, trimmed delimiter chars.

Thus, solving would be:
1. Add a single delimiter char between the field values
2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters
   that are at the end of all field values before the match

For this, someone needs to know what a delimiter for an specific analyzer is.

The other possibility of course is to change the behaviour inside Lucene, because
the current offset values are more or less useless / hard to use (I currently have
no idea how to get analyzer-specific delimiter chars).

For me, this looks like a bug - am I wrong?

Any ideas/hints/remarks? I would be very lucky about :)



Grant Ingersoll schrieb:
> Hi Christian,
> Is there anyway you can post a complete, self-contained example
> preferably as a JUnit test?  I think it would be useful to know more
> about how you are indexing (i.e. what Analyzer, etc.)
> The offsets should be taken from whatever is set in on the Token during
> Analysis.  I, too, am trying to remember where in the code this is
> taking place
> Also, what version of Lucene are you using?
> -Grant
> On Aug 16, 2007, at 5:50 AM, wrote:
> Hello,
> I have an index with an 'actor' field, for each actor there exists an
> single field value entry, e.g.
> stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
> <movie_actors>
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
> movie_actors:Miguel Bosé
> movie_actors:Anna Lizaran (as Ana Lizaran)
> movie_actors:Raquel Sanchís
> movie_actors:Angelina Llongueras
> I try to get the term offset, e.g. for 'angelina' with
> termPositionVector = (TermPositionVector)
> reader.getTermFreqVector(docNumber, "movie_actors");
> int iTermIndex = termPositionVector.indexOf("angelina");
> TermVectorOffsetInfo[] termOffsets =
> termPositionVector.getOffsets(iTermIndex);
> I get one TermVectorOffsetInfo for the field - with offset numbers
> that are bigger than one single
> Field entry.
> I guessed that Lucene gives the offset number for the situation that
> all values were concatenated,
> which is for the single (virtual) string:
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
> Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
> This fits in nearly no situation, so my second guess was that lucene
> adds some virtual delimiters between the single
> field entries for offset calculation. I added a delimiter, so the
> result would be:
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
> Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
> (note the ' ' between each actor name)
> ..this also fits not for each situation - there are too much
> delimiters there now, so, further, I guessed that Lucene don't add
> a delimiter in each situation. So I added only one when the last
> character of an entry was no alphanumerical one, with:
> StringBuilder strbAttContent = new StringBuilder();
> for (String strAttValue : m_luceneDocument.getValues(strFieldName))
> {
>    strbAttContent.append(strAttValue);
>    if(strbAttContent.substring(strbAttContent.length() -
> 1).matches("\\w"))
>       strbAttContent.append(' ');
> }
> where I get the result (virtual) entry:
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
> Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
> this fits in ~96% of all my queries....but still its not 100% the way
> lucene calculates the offset value for fields with multiple
> value entries.
> ..maybe the problem is that there are special characters inside my
> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
> I have looked to this specific situation, but considering this one
> character don't solves the problem.
> How do Lucene calculates these offsets? I also searched inside the
> source code, but can't find the correct place.
> Thanks in advance!
> Christian Reuschling
> --
> ______________________________________________________________________________
> Christian Reuschling, Dipl.-Ing.(BA)
> Software Engineer
> Knowledge Management Department
> German Research Center for Artificial Intelligence DFKI GmbH
> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
> Phone: +49.631.20575-125
> ------------Legal Company Information Required by German
> Law------------------
> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
> (Vorsitzender)
>                   Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313=
> ______________________________________________________________________________
- ---------------------------------------------------------------------
To unsubscribe, e-mail:
For additional commands, e-mail:

> --------------------------
> Grant Ingersoll

> Lucene Helpful Hints:

> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Version: GnuPG v1.4.2 (GNU/Linux)


View raw message