lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikko Noromaa" <mi...@noromaa.fi>
Subject RE: Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?
Date Wed, 24 Aug 2005 12:41:23 GMT
Hi,

I create my index with TermVector.WITH_POSITIONS_OFFSETS and get the term
offsets with the following code. The code collects two arrays: HFIDs (unique
ID's stored with documents) and Highlights (strings with offset info).

Please note that this code requires the patch from bug #36292
(http://issues.apache.org/bugzilla/show_bug.cgi?id=36292) to work with
prefix queries.


QueryParser parser = new QueryParser("text", analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
Query query=parser.parse(querystr);

IndexSearcher searcher=new IndexSearcher(reader);
Hits hits = searcher.search(query);

//System.out.println("query.getClass()=\""+query.getClass().toString()+"\"")
;
HashSet QueryTerms=new HashSet();
query.extractTerms(QueryTerms);

int NumHits=hits.length();
int[] HFIDs=new int[NumHits];
String[] Highlights=new String[NumHits];

for (int i = 0; i < NumHits; i++) {
	Document doc = hits.doc(i);
	HFIDs[i]=Integer.parseInt(doc.get("hfid"));
	String HiliString="";

	TermPositionVector
tpv=(TermPositionVector)reader.getTermFreqVector(hits.id(i), "text");

	String[] DocTerms=tpv.getTerms();          
	int[] freq=tpv.getTermFrequencies();
	for (int t = 0; t < freq.length; t++) {
		if (QueryTerms.contains(new Term("text",DocTerms[t]))) {
		    TermVectorOffsetInfo[] offsets=tpv.getOffsets(t);
		    int[] pos=tpv.getTermPositions(t);

			for (int tp = 0; tp < pos.length; tp++) {
	
HiliString+=(HiliString!=""?",":"")+offsets[tp].getStartOffset()+"-"+offsets
[tp].getEndOffset();
			}
		}
	}

	Highlights[i]=HiliString;
}


--

Mikko Noromaa (mikko@noromaa.fi) - tel. +358 40 7348034
Noromaa Solutions - see http://www.nm-sol.com/
 

> -----Original Message-----
> From: Sean O'Connor [mailto:sean@oconeco.com] 
> Sent: Wednesday, August 24, 2005 12:42 AM
> To: java-user@lucene.apache.org
> Subject: Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage?
> 
> 
> Hello,
>     I am trying to work through term positions and how to get 
> them from 
> a collection of hits. Does setting 
> TermVector.WITH_POSITIONS_OFFSETS to 
> true save the start/end position of the term in the source 
> text file? (I 
> _think_ it does).
> 
>      If so, where would I start for trying to make that information 
> accessible in a "result set"? I believe it would be extending 
> a query, a 
> scorer, a hit, and/or a weight object. I will be wanting to 
> process ALL 
> hits, so I think will need to implement a hitcollector.
> 
>     As an example of what I want, if I were looking for the offset 
> position of "brown" in a properly indexed field containing "the lazy 
> brown fox", I would like to get:
> start==10
> end==15 (assuming my counting is right)
> 
>     Based on Paul Elschot's previous response to a similar question I 
> had (which I am still working on), I _think_ I need to extend 
> something 
> like the ExactPhraseScorer. While debugging with my IDE 
> (Eclipse) I can 
> see that the weight object in the scorer contains a reference to the 
> query. The query contains the fields:
>     Vector positions (just has ints of term positions in phrase?)
>     Vector terms (vector of Term, just field name and field contents?)
> 
>     The weight also seems to have an array of TermPositions, 
> which have 
> SegmentTermPositions. I thought this was what I wanted, but I 
> don't see 
> the proper start/end fields, or anything which seems to be on 
> the right 
> track.
> 
>     Can anyone point me in the right direction?
> Thanks,
> 
> Sean
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message