lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Issue with indexed tokens position
Date Fri, 17 Aug 2007 18:11:18 GMT
Sure. I'd recommend that you start by taking out our custom
tokenizer and looking at what Lucene does rather than what you've
tried to emulate. For instance, the StandardTokenizer returns
offsets that are one more than the end of the previous token. That is,
the following program (Lucene 2.1)


import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardTokenizer;


public class Analysis
{
    public static void main(String[] args)
    {
        try {
            Reader r = new StringReader("this is some text");
            Tokenizer tzer = new StandardTokenizer(r);

            Token t;

            while ((t = tzer.next()) != null) {
                System.out.println(
                        String.format(
                                "Text: %s, start: %d, end: %d",
                                t.termText(),
                                t.startOffset(),
                                t.endOffset()));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

outputs:
Text: this, start: 0, end: 4
Text: is, start: 5, end: 7
Text: some, start: 8, end: 12
Text: text, start: 13, end: 17


Which, if I'm reading your code correctly is different in that the end of
one
token is the same offset as the beginning of the next token in your
example. So the off-by-one error you're claiming is perhaps the result of
an off-by-one error of your tokenizer.

In general, a lot of people depend on offset positions and phrase queries,
so I'd be very surprised if something this basic is out there without anyone
being aware of it. But you never know.....

Of course, I may be way off. If so can you post a self-contained program
using standard analyzers/tokenizers illustrating the problem? Most often,
when I try to create such a thing I can't and it then points me back to
my own code..

Best
Erick

On 8/17/07, Ramana Jelda <ramana.jelda@ciao-group.com> wrote:
>
> Hi Erick,
> Thanks.
> Here I try here my best to provide Pseudo code.
>
> Indexed Value: "pink-i"
>
> I have used a Custom Analyzer. My Analyzer looks a littlebit like
> following..
> public class KeyWordFilter extends TokenFilter{
>         public KeyWordFilter(TokenStream in) {
>         super(in);
>         keywordStack = new LinkedList<Token>();
>         }
>
>         org.apache.lucene.analysis.Token next(){
>                 if(keywordStack.size() > 0){
>                 return (Token) keywordStack.poll();
>                 }
>                 //token = "pink-i"
>                 makeTokens(token);
>         }
>
>         void makeTokens(Token token){
>                 //make following tokens and add to stack..
>                 //[(pink,0,5,type=HYPENWORD_DIVIDED),
> (pinki,0,5,type=HYPENWORD_DIVIDED,posIncr=0),
> (i,5,6,type=HYPENWORD_DIVIDED)]
>         }
> }
>
>
> I am 100% sure that there is a problem with token-positions. And
> PhraseQuery
> "pink i" is not working where as PhraseQuery "pinki i" works.
> And it seems positions are totally ignored by PhraseQuery.
>
> Any thoughts?
>
> Thx,
> Jelda
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Friday, August 17, 2007 3:31 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Issue with indexed tokens position
> >
> > You'd get much better answers if you posted a concise example
> > (or possibly code snippets), especially including the
> > analyzers you used.
> >
> > Have you used Luke to examine your index and see if it's
> > indexed as you expect?
> >
> > Best
> > Erick
> >
> > On 8/17/07, Ramana Jelda <ramana.jelda@ciao-group.com> wrote:
> > >
> > > Strangely..
> > > My lucene query: fieldName:"pinki i"  finds document. (see "i"
> > > in  "pinki")
> > >
> > > Jelda
> > >
> > > > -----Original Message-----
> > > > From: Ramana Jelda [mailto:ramana.jelda@ciao-group.com]
> > > > Sent: Friday, August 17, 2007 12:33 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Issue with indexed tokens position
> > > >
> > > > Hi,
> > > > Lucene doesn't find following value. Some issues with PhraseQuery.
> > > >
> > > > indexed value: pink-I
> > > > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] (ex.
> > > > explanation:
> > > > "pink" is a term "0->5" term-position)
> > > >
> > > > And I have indexed in a field called "fieldName".
> > > > My lucene search with the query [fieldName:"pink i"] can't find
> > > > above indexed value.
> > > >
> > > > Can anyone help me out here.
> > > >
> > > > Thx in advance,
> > > > Jelda
> > > >
> > > >
> > > >
> > > >
> > --------------------------------------------------------------------
> > > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message