lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: getting term offset information for fields with multiple value entiries
Date Fri, 17 Aug 2007 00:53:29 GMT
Hi Christian,

Is there anyway you can post a complete, self-contained example  
preferably as a JUnit test?  I think it would be useful to know more  
about how you are indexing (i.e. what Analyzer, etc.)
The offsets should be taken from whatever is set in on the Token  
during Analysis.  I, too, am trying to remember where in the code  
this is taking place

Also, what version of Lucene are you using?

-Grant

On Aug 16, 2007, at 5:50 AM, duiduder@web.de wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello,
>
> I have an index with an 'actor' field, for each actor there exists  
> an single field value entry, e.g.
>
> stored/ 
> compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPo 
> sition <movie_actors>
>
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
> movie_actors:Miguel Bosé
> movie_actors:Anna Lizaran (as Ana Lizaran)
> movie_actors:Raquel Sanchís
> movie_actors:Angelina Llongueras
>
> I try to get the term offset, e.g. for 'angelina' with
>
> termPositionVector = (TermPositionVector) reader.getTermFreqVector 
> (docNumber, "movie_actors");
> int iTermIndex = termPositionVector.indexOf("angelina");
> TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets 
> (iTermIndex);
>
>
> I get one TermVectorOffsetInfo for the field - with offset numbers  
> that are bigger than one single
> Field entry.
> I guessed that Lucene gives the offset number for the situation  
> that all values were concatenated,
> which is for the single (virtual) string:
>
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel  
> BoséAnna Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
>
> This fits in nearly no situation, so my second guess was that  
> lucene adds some virtual delimiters between the single
> field entries for offset calculation. I added a delimiter, so the  
> result would be:
>
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé  
> Anna Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
> (note the ' ' between each actor name)
>
> ..this also fits not for each situation - there are too much  
> delimiters there now, so, further, I guessed that Lucene don't add
> a delimiter in each situation. So I added only one when the last  
> character of an entry was no alphanumerical one, with:
> StringBuilder strbAttContent = new StringBuilder();
> for (String strAttValue : m_luceneDocument.getValues(strFieldName))
> {
>    strbAttContent.append(strAttValue);
>    if(strbAttContent.substring(strbAttContent.length() - 1).matches 
> ("\\w"))
>       strbAttContent.append(' ');
> }
>
> where I get the result (virtual) entry:
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel  
> BoséAnna Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
>
> this fits in ~96% of all my queries....but still its not 100% the  
> way lucene calculates the offset value for fields with multiple
> value entries.
>
>
> ..maybe the problem is that there are special characters inside my  
> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
> I have looked to this specific situation, but considering this one  
> character don't solves the problem.
>
>
> How do Lucene calculates these offsets? I also searched inside the  
> source code, but can't find the correct place.
>
>
> Thanks in advance!
>
> Christian Reuschling
>
>
>
>
>
> - --
> ______________________________________________________________________ 
> ________
> Christian Reuschling, Dipl.-Ing.(BA)
> Software Engineer
>
> Knowledge Management Department
> German Research Center for Artificial Intelligence DFKI GmbH
> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>
> Phone: +49.631.20575-125
> mailto:reuschling@dfki.de  http://www.dfki.uni-kl.de/~reuschling/
>
> - ------------Legal Company Information Required by German  
> Law------------------
> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster  
> (Vorsitzender)
>                   Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313=
> ______________________________________________________________________ 
> ________
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (GNU/Linux)
>
> iD8DBQFGxB3XQoTr50f1tpcRAti+AKCH0YgcHjA+bO9NTbuxaAlKb8dO5gCfSfSK
> oVOiAdWYROqXOMqHv176xBY=
> =b2jO
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message