lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 18927] - [PATCH] Term Vector support
Date Thu, 19 Aug 2004 12:10:10 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=18927>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=18927

[PATCH] Term Vector support





------- Additional Comments From grant_ingersoll@yahoo.com  2004-08-19 12:10 -------
Term Vector support now has optional support for storing 
Token.getPositionIncrement() and Token.startOffset() and Token.endOffset() 
information.  Control of this is done through the standard Field creation 
methods.  All options are backward compatible (position and offset information 
will _not_ be stored by default).  Added many new test cases to demonstrate 
functionality.  There are two new files needed: SegmentTermPositionVector and 
TermVectorOffsetInfo.  All tests pass as of 8/19/04 in the AM.

Attached should be 1 patch file plus a zip containing 2 new files.

What is this info good for?
1.  I think the highlighter could use this info (offset) instead of reparsing 
every document at runtime
2. Many IR algorithms need character position, etc.
3. Others??

Remember, the values stored are based on what values you set when running the 
Analyzer (i.e. Token.startOffset and Token.endOffset and 
Token.positionIncrement).  These values are controlled by the application 
author and can vary by application.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message