lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stepan Mik" <>
Subject RE: Can I retrieve token offsets from Hits?
Date Fri, 23 Jul 2004 08:07:46 GMT
Athough storing metadata would by very useful stuff, I believe that
Token (Term?) offsets should be integral part of index. Storing such
information would be optional (same as term frequencies) so users could
make thier own decision on index size. But this discussion probably
belongs to developer mailing list.
Reanalizing of original document content every time when excerpt is
needed is not best solution. With StandardAnalyzer it probably works
fine, but more complicated analyzers may be much slower. For example
(this is the case I'm trying to solve now) one may use lemmatizer as
part of tokenization process. In such case analyzing time of (even
small) set of documents is not acceptable for web application under
heavy load. 
In fact one can store token positions already. Original document may be
stored to unindexed field (contents_original), the text may be
preprocesed to form "token1[start,end] token2[start,end] ..." and stored
to another indexed and tokenized field (contents). Then very simple
analyzer that removes [start,end] text would be aplied to the field. Of
course this approach is the least efficient (in sense of memory
sonsumtion) from any other proposed in this discsussion.


> -----Original Message-----
> From: Grant Ingersoll []
> Sent: Thursday, July 22, 2004 9:37 PM
> To:
> Subject: Re: Can I retrieve token offsets from Hits?
> I am sensing a common theme throughout a variety of threads
> here:  Namely, a need for a pluggable set of Reader's and 
> Writers (think Interface) that can write metadata about an 
> Index/Document/Field/Term (which I see the TermVector stuff 
> as being a instance of) and can be given to Lucene from the 
> application level (or at least the application specifies 
> which ones to use)
> I proposed something like this a bit earlier, but didn't see
> any interest.  I suppose I should implement it as this is how 
> things get going, but would be nice to have some input on 
> requirements and whether the people who know Lucene better 
> than I think this is possible.
> Just my two cents on this one.  Doesn't help you w/ an
> immediate solution, but I think it would help us all in the 
> long run.  If this existed, one could easily implement a 
> Token position store and ask it for all of this information, 
> I think.  :-)
> -Grant
> >>> 07/22/04 03:19PM >>>
> > I wonder if the information in termPositions or termVector
> can be used
> > to restore token position from indicies?
> TermFreqVector gives you term frequencies (not positions).
> This can be of use in computing document 
> similarities.
> TermPositions gives you the sequence number . eg in the last 
> sentence the word "sequence" was 
> token number 5,  (not character position 5). This is used for 
> PhraseQueries to determine proximity.
> Character position is what is required to do highlighting and
> this isnt stored anywhere currently. 
> The requirements for such a store would be indexed access by 
> doc number, and a compact means of storing term/character 
> position info. This could add considerable size to the index.
> Previously we concluded that highlighting is only typically
> done on the first 10 or so records in a result set 
> anyway and that re-analyzing the text shouldnt add too much 
> of an overhead. If you want to limit the size of an 
> individual document's text to be tokenized use 
> highlighter.setMaxDocBytesToAnalyze().
> If you find tokenizing slow check you arent using 
> StandardAnalyzer - I have found that to be slow (see )



To unsubscribe, e-mail: 
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

Pøíchozí zpráva neobsahuje viry.
Zkontrolováno antivirovým systémem AVG (
Verze: 6.0.718 / Virová báze: 474 - datum vydání: 9.7.2004

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message