lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpad KATONA" <a.kat...@ever-team.com>
Subject RERE: Highlighting, startOffset, endOffset
Date Tue, 13 May 2003 09:30:47 GMT
Hi Shoba,

thank you for your reponse.

Imagine an "indexing server" that may receive from a client via a socket three types of command
: insert, search and retrieve.

Receving an insert command e.g. "insert 123, d:/tempo/test.txt", the server reads the named
file, extracts its content, creates a Lucene Document with three Lucene Fields KEY=123, DOCPATH=d:/tempo/test.txt
and CONTENT=<the content of the file> and adds the document to the index. KEY is stored
and indexed, DOCPATH is juste stored and CONTENT is juste indexed. Let us suppose, this test.txt
file contains the text : "i have to go to Sainte-Foy this afternoon but the urban transports
are on strike just today in Lyon, they are protesting against the project of law on pensions
and thus they are cassing me the nenette with their betises."

Receving a search command e.g. "search tra* AND pro*", the server constructs a Query object

  Analyzer azer = StandardAnalyzer();
  QueryParser qp = new QueryParser("CONTENT", azer);
  Query q = qp.parse("tra* AND pro*");
  m_LastQuery = q; //server memorises the last data query
etc, etc. The server responds to the client with an array of KEYs, in this exemple "123".

Receving a retrieve command e.g. "retrieve 123", the server constructs another Query "KEY:123",
retrieves the DOCPATH from the obtained Document, re-reads the file and re-extracts its content
(this is perhaps what Tom Dunstan calls in his mail "reparsing at runtime"?) and writes it
to a String, let us suppose, a variable called sText contains the content of the found file,
i.e. sText="i have to go...". At this moment i would like to apply a tool in order to construct
an "array of positions" indicating the positions and the lengths of the tokens of the last
memorised data query. In this exemple the last memorised data query is "tra* AND pro*", the
tokens of this query are "transports" and "protesting", the "array of positions" must be :
"57,10;111,10", then the first 't' of 'transports' is the 57th character in the string "i
have to go..." and the length of the word "transports" is 10 etc, etc.

So i call the function highlightTerms in de.iqcomputing.lucene.LuceneTools (Maik Schreiber's
class, lightly modified) :
  ArrayList aop = new ArrayList();
  LuceneTools.highlightTerms(sText, m_LastQuery, azer, aop);

A schematised excerpt of his function :

public String highlightTerms(
  String p_sText, //"i have to go..."
  Query p_Query, //"tra* AND pro*"
  Analyzer p_Azer,
  ArrayList p_aop) //"array of pos"
{
...
  TokenStream stream = null;
  HashSet terms = new HashSet();
  org.apache.lucene.analysis.Token token;
  int startOffset, endOffset;

  getTerms(p_Query, terms, false);

  StringReader sr = new StringReader(p_sText);
  stream = analyzer.tokenStream(sr);
  while ((token = stream.next()) != null) {
    startOffset = token.startOffset();
    endOffset = token.endOffset();
    if (terms.contains(token.termText())) {
      //add startOffset to p_aop
      //but it is always 0
    }
  }
...
}

The call to "getTerms" collects all the terms figuring in the Query (this is why many Lucene
classes must be modified in order to access "private" informations within the classes), i.e.
after the call to getTerms we have in the HashSet "terms" the two terms of the last Query
: "transports" and "protesting". Then a TokenStream is created from p_sText etc etc. The problem
is, startOffset is always 0 (zero) and, therefore, the "array of positions" will be completely
wrong (contains only zeros). You say, you have never encountered a similar problem, do you
see why it deconnes here and why it works correctly in your context? I would be grateful to
anybody for any help, because we are here in the big Schlamassel whitout these highlighting
feature...

Arpad KATONA
--
a.katona@ever-team.com

PS: sorry for the lamentable english, i hope it is comprehensible anyway...

-----Message d'origine-----
De : Shoba Ramachandran [mailto:shoba_duruvan@yahoo.com]
Envoyé : lundi 12 mai 2003 20:59
À : Lucene Users List
Objet : Re: Highlighting, startOffset, endOffset


I'm using this termhighlighter and never got any
problem. Could you be elaborate on how you are using
it.

-Shoba

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message