lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shoba Ramachandran <shoba_duru...@yahoo.com>
Subject Re: RERE: Highlighting, startOffset, endOffset
Date Tue, 13 May 2003 15:16:12 GMT
Hi Arpad,

Tested your codes with this example. Seems to run
fine....


public static void main(String[] args)
  {

      String p_sText = "i have to " +
              "go to Sainte-Foy this afternoon but the
urban transports are on strike " +
              "just today in Lyon, they are protesting
against the project of law on " +
              "pensions and thus they are cassing me
the nenette with their betises.";


      TokenStream tokenStream = null;
      try
      {
        Analyzer analyzer = new StandardAnalyzer();
        Query query =
org.apache.lucene.queryParser.QueryParser.parse("pension
AND protest", "Contents", analyzer);
        System.out.println("Search String is :  " +
query.toString("contents"));
        ArrayList p_aop = new ArrayList();


        HashSet terms = new HashSet();
        org.apache.lucene.analysis.Token token;
        int startOffset, endOffset;

        getTerms(query, terms, false);

        //StringReader sr = new StringReader(p_sText);
        //stream = analyzer.tokenStream(sr);

        tokenStream = analyzer.tokenStream(p_sText,
new StringReader(p_sText));
        while((token = tokenStream.next()) != null) {
          startOffset = token.startOffset();
          endOffset = token.endOffset();
          if(terms.contains(token.termText())) {
            System.out.println("startOffset : " +
startOffset);
            System.out.println("endOffset : " +
endOffset);
          }
        }
      }
      catch(Exception e)
      {
        e.printStackTrace();
      }
  }


Ouput is:
Search String is :  +Contents:pension
+Contents:protest
startOffset : 110
endOffset : 120
startOffset : 151
endOffset : 159

-Shoba


--- Arpad KATONA <a.katona@ever-team.com> wrote:
> Hi Shoba,
> 
> thank you for your reponse.
> 
> Imagine an "indexing server" that may receive from a
> client via a socket three types of command : insert,
> search and retrieve.
> 
> Receving an insert command e.g. "insert 123,
> d:/tempo/test.txt", the server reads the named file,
> extracts its content, creates a Lucene Document with
> three Lucene Fields KEY=123,
> DOCPATH=d:/tempo/test.txt and CONTENT=<the content
> of the file> and adds the document to the index. KEY
> is stored and indexed, DOCPATH is juste stored and
> CONTENT is juste indexed. Let us suppose, this
> test.txt file contains the text : "i have to go to
> Sainte-Foy this afternoon but the urban transports
> are on strike just today in Lyon, they are
> protesting against the project of law on pensions
> and thus they are cassing me the nenette with their
> betises."
> 
> Receving a search command e.g. "search tra* AND
> pro*", the server constructs a Query object 
>   Analyzer azer = StandardAnalyzer();
>   QueryParser qp = new QueryParser("CONTENT", azer);
>   Query q = qp.parse("tra* AND pro*");
>   m_LastQuery = q; //server memorises the last data
> query
> etc, etc. The server responds to the client with an
> array of KEYs, in this exemple "123".
> 
> Receving a retrieve command e.g. "retrieve 123", the
> server constructs another Query "KEY:123", retrieves
> the DOCPATH from the obtained Document, re-reads the
> file and re-extracts its content (this is perhaps
> what Tom Dunstan calls in his mail "reparsing at
> runtime"?) and writes it to a String, let us
> suppose, a variable called sText contains the
> content of the found file, i.e. sText="i have to
> go...". At this moment i would like to apply a tool
> in order to construct an "array of positions"
> indicating the positions and the lengths of the
> tokens of the last memorised data query. In this
> exemple the last memorised data query is "tra* AND
> pro*", the tokens of this query are "transports" and
> "protesting", the "array of positions" must be :
> "57,10;111,10", then the first 't' of 'transports'
> is the 57th character in the string "i have to
> go..." and the length of the word "transports" is 10
> etc, etc.
> 
> So i call the function highlightTerms in
> de.iqcomputing.lucene.LuceneTools (Maik Schreiber's
> class, lightly modified) :
>   ArrayList aop = new ArrayList();
>   LuceneTools.highlightTerms(sText, m_LastQuery,
> azer, aop);
> 
> A schematised excerpt of his function :
> 
> public String highlightTerms(
>   String p_sText, //"i have to go..."
>   Query p_Query, //"tra* AND pro*"
>   Analyzer p_Azer,
>   ArrayList p_aop) //"array of pos"
> {
> ...
>   TokenStream stream = null;
>   HashSet terms = new HashSet();
>   org.apache.lucene.analysis.Token token;
>   int startOffset, endOffset;
> 
>   getTerms(p_Query, terms, false);
> 
>   StringReader sr = new StringReader(p_sText);
>   stream = analyzer.tokenStream(sr);
>   while ((token = stream.next()) != null) {
>     startOffset = token.startOffset();
>     endOffset = token.endOffset();
>     if (terms.contains(token.termText())) {
>       //add startOffset to p_aop
>       //but it is always 0
>     }
>   }
> ...
> }
> 
> The call to "getTerms" collects all the terms
> figuring in the Query (this is why many Lucene
> classes must be modified in order to access
> "private" informations within the classes), i.e.
> after the call to getTerms we have in the HashSet
> "terms" the two terms of the last Query :
> "transports" and "protesting". Then a TokenStream is
> created from p_sText etc etc. The problem is,
> startOffset is always 0 (zero) and, therefore, the
> "array of positions" will be completely wrong
> (contains only zeros). You say, you have never
> encountered a similar problem, do you see why it
> deconnes here and why it works correctly in your
> context? I would be grateful to anybody for any
> help, because we are here in the big Schlamassel
> whitout these highlighting feature...
> 
> Arpad KATONA
> --
> a.katona@ever-team.com
> 
> PS: sorry for the lamentable english, i hope it is
> comprehensible anyway...
> 
> -----Message d'origine-----
> De : Shoba Ramachandran
> [mailto:shoba_duruvan@yahoo.com]
> Envoyé : lundi 12 mai 2003 20:59
> À : Lucene Users List
> Objet : Re: Highlighting, startOffset, endOffset
> 
> 
> I'm using this termhighlighter and never got any
> problem. Could you be elaborate on how you are using
> it.
> 
> -Shoba
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message