lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: from docID to terms enumerator in O(1) ?
Date Sun, 15 Jul 2012 16:00:11 GMT
Enable term vectors while indexing and use the TermVector API.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Giovanni Gherdovich [mailto:g.gherdovich@gmail.com]
> Sent: Sunday, July 15, 2012 5:57 PM
> To: java-user@lucene.apache.org
> Subject: from docID to terms enumerator in O(1) ?
> 
> Hi all,
> 
> I'd like to know if I can get the list of indexed terms in a document from
its
> document ID in constant time (say, in a time independent of the size of
the
> index).
> 
> The reason why I ask might be relevant
> (you could suggest me a totally different way to achieve my goal).
> 
> I want to present the search results of a query as a word cloud, i.e. no
scoring,
> no sorting, no nothing, just a visual representation of the array of pairs
(term,
> docFreq) for all terms appearing in at least one of the docs that matched
my
> query.
> 
> Skimming thru the pages of "Lucene in Action"
> I found that I might need to call the method
> 
> void IndexSearcher.search(Query query, Collector results)
> 
> i.e. pass that method my own Collector class, that fetches and cook
results the
> way I want.
> 
> The author provides a very clear code example for the Collector,
> 
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 --
-- >8
public class
> BookLinkCollector extends Collector {
>     private Map<String,String> documents = new HashMap<String,String>();
>     private Scorer scorer;
>     private String[] urls;
>     private String[] titles;
> 
>     public boolean acceptsDocsOutOfOrder() {
> 	return true;
>     }
> 
>     public void setScorer(Scorer scorer) {
> 	this.scorer = scorer;
>     }
> 
>     public void setNextReader(IndexReader reader, int docBase)
> 	throws IOException {
> 	urls = FieldCache.DEFAULT.getStrings(reader, "url");
> 	titles = FieldCache.DEFAULT.getStrings(reader, "title2");
>     }
> 
>     public void collect(int docID) {
> 	try {
> 	    String url = urls[docID];
> 	    String title = titles[docID];
> 	    documents.put(url, title);
> 	    System.out.println(title + ":" + scorer.score());
> 	} catch (IOException e) {
> 	}
>     }
> 
>     public Map<String,String> getLinks() {
> 	return Collections.unmodifiableMap(documents);
>     }
> }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 --
-- >8
> 
> which is the used like
> 
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 --
-- >8
public void
> testCollecting() throws Exception {
>     Directory dir = TestUtil.getBookIndexDirectory();
>     TermQuery query = new TermQuery(new Term("contents", "junit"));
>     IndexSearcher searcher = new IndexSearcher(dir);
>     BookLinkCollector collector = new BookLinkCollector(searcher);
> 
>     searcher.search(query, collector);
>     Map<String,String> linkMap = collector.getLinks();
>     assertEquals("ant in action",
> 		 linkMap.get("http://www.manning.com/loughran"));;
>     searcher.close();
>     dir.close();
> }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 --
-- >8
> 
> What might not work for me is the use of FieldCache on the IndexReader to
> retrieve all fields values on the current segment; those values are
returned as
> String[],
> 
> while for me it would be more convenient to get a term enumerator:
> all the tokenizing and stopword removal work has already been dojne and
> indexing time, and I would like to leverage that.
> 
> How does it sound?
> 
> Cheers,
> Giovanni
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message