lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: getting term offset information for fields with multiple value entiries
Date Fri, 17 Aug 2007 21:18:12 GMT
What version of Lucene are you using?


On Aug 17, 2007, at 12:44 PM, duiduder@web.de wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello community, dear Grant
>
> I have build a JUnit test case that illustrates the problem -  
> there, I try to cut
> out the right substring with the offset values given from Lucene -  
> and fail :(
>
> A few remarks:
>
> In this example, the 'é' from 'Bosé' makes that the '\w' pattern  
> don't matches -
> it is recognized, unlike in StandardAnalyzer - as delimiter sign.
>
> Analysis: It seems that Lucene calculates the offset values by  
> adding a virtual
> delimiter between every field value.
> But Lucene forgets the last characters of a field value when these are
> analyzer-specific delimiter values. (I seem this because of  
> DocumentWriter, line
> 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
> With this line of code, only the end offset of the last token is  
> considered - by
> forgetting potential, trimmed delimiter chars.
>
> Thus, solving would be:
> 1. Add a single delimiter char between the field values
> 2. Substract (from the Lucene Offset) the count of analyzer- 
> specific delimiters
>    that are at the end of all field values before the match
>
> For this, someone needs to know what a delimiter for an specific  
> analyzer is.
>
> The other possibility of course is to change the behaviour inside  
> Lucene, because
> the current offset values are more or less useless / hard to use (I  
> currently have
> no idea how to get analyzer-specific delimiter chars).
>
> For me, this looks like a bug - am I wrong?
>
> Any ideas/hints/remarks? I would be very lucky about :)
>
> Greetings
>
> Christian
>
>
>
> Grant Ingersoll schrieb:
>> Hi Christian,
>>
>> Is there anyway you can post a complete, self-contained example
>> preferably as a JUnit test?  I think it would be useful to know more
>> about how you are indexing (i.e. what Analyzer, etc.)
>> The offsets should be taken from whatever is set in on the Token  
>> during
>> Analysis.  I, too, am trying to remember where in the code this is
>> taking place
>>
>> Also, what version of Lucene are you using?
>>
>> -Grant
>>
>> On Aug 16, 2007, at 5:50 AM, duiduder@web.de wrote:
>>
>> Hello,
>>
>> I have an index with an 'actor' field, for each actor there exists an
>> single field value entry, e.g.
>>
>> stored/ 
>> compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorP 
>> osition
>> <movie_actors>
>>
>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
>> movie_actors:Miguel Bosé
>> movie_actors:Anna Lizaran (as Ana Lizaran)
>> movie_actors:Raquel Sanchís
>> movie_actors:Angelina Llongueras
>>
>> I try to get the term offset, e.g. for 'angelina' with
>>
>> termPositionVector = (TermPositionVector)
>> reader.getTermFreqVector(docNumber, "movie_actors");
>> int iTermIndex = termPositionVector.indexOf("angelina");
>> TermVectorOffsetInfo[] termOffsets =
>> termPositionVector.getOffsets(iTermIndex);
>>
>>
>> I get one TermVectorOffsetInfo for the field - with offset numbers
>> that are bigger than one single
>> Field entry.
>> I guessed that Lucene gives the offset number for the situation that
>> all values were concatenated,
>> which is for the single (virtual) string:
>>
>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
>> Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
>>
>> This fits in nearly no situation, so my second guess was that lucene
>> adds some virtual delimiters between the single
>> field entries for offset calculation. I added a delimiter, so the
>> result would be:
>>
>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé  
>> Anna
>> Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
>> (note the ' ' between each actor name)
>>
>> ..this also fits not for each situation - there are too much
>> delimiters there now, so, further, I guessed that Lucene don't add
>> a delimiter in each situation. So I added only one when the last
>> character of an entry was no alphanumerical one, with:
>> StringBuilder strbAttContent = new StringBuilder();
>> for (String strAttValue : m_luceneDocument.getValues(strFieldName))
>> {
>>    strbAttContent.append(strAttValue);
>>    if(strbAttContent.substring(strbAttContent.length() -
>> 1).matches("\\w"))
>>       strbAttContent.append(' ');
>> }
>>
>> where I get the result (virtual) entry:
>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
>> Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
>>
>> this fits in ~96% of all my queries....but still its not 100% the way
>> lucene calculates the offset value for fields with multiple
>> value entries.
>>
>>
>> ..maybe the problem is that there are special characters inside my
>> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
>> I have looked to this specific situation, but considering this one
>> character don't solves the problem.
>>
>>
>> How do Lucene calculates these offsets? I also searched inside the
>> source code, but can't find the correct place.
>>
>>
>> Thanks in advance!
>>
>> Christian Reuschling
>>
>>
>>
>>
>>
>> --
>> _____________________________________________________________________ 
>> _________
>>
>> Christian Reuschling, Dipl.-Ing.(BA)
>> Software Engineer
>>
>> Knowledge Management Department
>> German Research Center for Artificial Intelligence DFKI GmbH
>> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>>
>> Phone: +49.631.20575-125
>> mailto:reuschling@dfki.de  http://www.dfki.uni-kl.de/~reuschling/
>>
>> ------------Legal Company Information Required by German
>> Law------------------
>> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
>> (Vorsitzender)
>>                   Dr. Walter Olthoff
>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>> Amtsgericht Kaiserslautern, HRB 2313=
>> _____________________________________________________________________ 
>> _________
>>
>>>
> -  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (GNU/Linux)
>
> iD8DBQFGxdBkQoTr50f1tpcRAjyHAJ9jvKORu0Ciepprg70z2qOdFhmnIgCgoaYA
> nkji0oFq2UGEkh7H9YLdqbE=
> =wKe1
> -----END PGP SIGNATURE-----
> package org.dynaq.index;
>
>
>
> import static org.junit.Assert.assertTrue;
>
> import java.io.IOException;
> import java.util.LinkedList;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.TermPositionVector;
> import org.apache.lucene.index.TermVectorOffsetInfo;
> import org.apache.lucene.queryParser.ParseException;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.LockObtainFailedException;
> import org.apache.lucene.store.RAMDirectory;
> import org.junit.After;
> import org.junit.Before;
> import org.junit.Test;
>
>
>
> public class TokenizerTest
> {
>
>     IndexReader m_indexReader;
>
>     Analyzer m_analyzer = new StandardAnalyzer();
>
>
>     @Before
>     public void createIndex() throws CorruptIndexException,  
> LockObtainFailedException, IOException
>     {
>         Directory ramDirectory = new RAMDirectory();
>
>         // we create a fist little set of actors names
>         LinkedList<String> llActorNames = new LinkedList<String>();
>         llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
>         llActorNames.add("Miguel Bosé");
>         llActorNames.add("Anna Lizaran (as Ana Lizaran)");
>         llActorNames.add("Raquel Sanchís");
>         llActorNames.add("Angelina Llongueras");
>
>
>         // store them into a single document with multiple values  
> for one Field
>         Document testDoc = new Document();
>         for (String strActorsName : llActorNames)
>         {
>             Field testEntry =
>                     new Field("movie_actors", strActorsName,  
> Field.Store.YES, Field.Index.TOKENIZED,  
> Field.TermVector.WITH_POSITIONS_OFFSETS);
>
>             testDoc.add(testEntry);
>         }
>
>         // now we write it into the index
>         IndexWriter indexWriter = new IndexWriter(ramDirectory,  
> true, m_analyzer, true);
>
>         indexWriter.addDocument(testDoc);
>         indexWriter.close();
>
>         m_indexReader = IndexReader.open(ramDirectory);
>     }
>
>
>
>     @Test
>     public void checkOffsetValues() throws ParseException, IOException
>     {
>
>         // first, we search for 'angelina'
>         String strSearchTerm = "Angelina";
>
>
>         Searcher searcher = new IndexSearcher(m_indexReader);
>
>         QueryParser parser = new QueryParser("movie_actors",  
> m_analyzer);
>         Query query = parser.parse(strSearchTerm);
>
>         Hits hits = searcher.search(query);
>         Document resultDoc = hits.doc(0);
>
>         String[] straValues = resultDoc.getValues("movie_actors");
>
>         // now, we get the field values and build a single string  
> value out of them
>         StringBuilder strbSimplyConcatenated = new StringBuilder();
>         StringBuilder strbWithDelimiters = new StringBuilder();
>         StringBuilder strbWithDelimitersAfterAlphaNumChar = new  
> StringBuilder();
>         for (String strActorName : straValues)
>         {
>             // first situation: we simply concatenate all field  
> value entries
>             strbSimplyConcatenated.append(strActorName);
>             // second: we add a single delimiter char between the  
> field values
>             strbWithDelimiters.append(strActorName).append('$');
>             // third try: we add a single delimiter, but only if  
> the last char of the actor before was a alphanum-char
>             strbWithDelimitersAfterAlphaNumChar.append(strActorName);
>             String strLastChar = strActorName.substring 
> (strActorName.length() - 1, strActorName.length());
>             if(strLastChar.matches("\\w"))  
> strbWithDelimitersAfterAlphaNumChar.append('$');
>         }
>
>
>         // this is the offset value from lucene. This should be the  
> place of 'angelina' in one of the concatenated value Strings above
>         TermPositionVector termPositionVector =  
> (TermPositionVector) m_indexReader.getTermFreqVector(0,  
> "movie_actors");
>         int iTermIndex = termPositionVector.indexOf 
> (strSearchTerm.toLowerCase());
>         TermVectorOffsetInfo[] termOffsets =  
> termPositionVector.getOffsets(iTermIndex);
>         int iStartOffset = termOffsets[0].getStartOffset();
>         int iEndOffset = termOffsets[0].getEndOffset();
>
>         //we create the substrings according to the offset value  
> given from lucene
>         String strSubString1 = strbSimplyConcatenated.substring 
> (iStartOffset, iEndOffset);
>         String strSubString2 = strbWithDelimiters.substring 
> (iStartOffset, iEndOffset);
>         String strSubString3 =  
> strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset,  
> iEndOffset);
>
>         System.out.println("Offset value: " + iStartOffset + "-" +  
> iEndOffset);
>         System.out.println("simply concatenated:");
>         System.out.println(strbSimplyConcatenated);
>         System.out.println("SubString for offset: '" +  
> strSubString1 + "'");
>         System.out.println();
>         System.out.println("with delimiters:");
>         System.out.println(strbWithDelimiters);
>         System.out.println("SubString for offset: '" +  
> strSubString2 + "'");
>         System.out.println();
>         System.out.println("with delimiter after alphanum  
> character:");
>         System.out.println(strbWithDelimitersAfterAlphaNumChar);
>         System.out.println("SubString for offset: '" +  
> strSubString3 + "'");
>
>
>         //is the offset value correct for one of the concatenated  
> strings?
>
>         //this fails for all situations
>         assertTrue(strSubString1.equals(strSearchTerm) ||  
> strSubString2.equals(strSearchTerm) || strSubString3.equals 
> (strSearchTerm));
>
>         /*
>          * Comments: In this example, the 'é' from 'Bosé' makes  
> that the '\w' pattern don't matches - it is recognized, unlike in
>          * StandardAnalyzer - as delimiter sign.
>          *
>          * Analysis: It seems that Lucene calculates the offset  
> values by adding a virtual delimiter between every field value.
>          * But Lucene forgets the last characters of a field value  
> when these are analyzer-specific delimiter values.
>          * (I seem this because of DocumentWriter, line 245: 'if 
> (lastToken != null) offset += lastToken.endOffset() + 1;)'
>          * With this line of code, only the end offset of the last  
> token is considered - by forgetting potential, trimmed delimiter
>          * chars.
>          *
>          * Thus, solving would be:
>          * 1. Add a single delimiter char between the field values
>          * 2. Substract (from the Lucene Offset) the count of  
> analyzer-specific delimiters that are at the end of all field  
> values before the match
>          *
>          * For this, someone needs to know what a delimiter for an  
> specific analyzer is.
>          */
>
>
>
>     }
>
>
>
>     @After
>     public void closeIndex() throws CorruptIndexException, IOException
>     {
>         m_indexReader.close();
>     }
>
>
> }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message