lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giovanni Gherdovich <g.gherdov...@gmail.com>
Subject from docID to terms enumerator in O(1) ?
Date Sun, 15 Jul 2012 15:56:32 GMT
Hi all,

I'd like to know if I can get the list of indexed terms in a document
from its document ID in constant time
(say, in a time independent of the size of the index).

The reason why I ask might be relevant
(you could suggest me a totally different way to achieve my goal).

I want to present the search results of a query as a word cloud,
i.e. no scoring, no sorting, no nothing, just a visual representation
of the array of pairs (term, docFreq) for all terms appearing in
at least one of the docs that matched my query.

Skimming thru the pages of "Lucene in Action"
I found that I might need to call the method

void IndexSearcher.search(Query query, Collector results)

i.e. pass that method my own Collector class,
that fetches and cook results the way I want.

The author provides a very clear code example for
the Collector,

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --
>8
public class BookLinkCollector extends Collector {
    private Map<String,String> documents = new HashMap<String,String>();
    private Scorer scorer;
    private String[] urls;
    private String[] titles;

    public boolean acceptsDocsOutOfOrder() {
	return true;
    }

    public void setScorer(Scorer scorer) {
	this.scorer = scorer;
    }

    public void setNextReader(IndexReader reader, int docBase)
	throws IOException {
	urls = FieldCache.DEFAULT.getStrings(reader, "url");
	titles = FieldCache.DEFAULT.getStrings(reader, "title2");
    }

    public void collect(int docID) {
	try {
	    String url = urls[docID];
	    String title = titles[docID];
	    documents.put(url, title);
	    System.out.println(title + ":" + scorer.score());
	} catch (IOException e) {
	}
    }

    public Map<String,String> getLinks() {
	return Collections.unmodifiableMap(documents);
    }
}
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --
>8

which is the used like

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --
>8
public void testCollecting() throws Exception {
    Directory dir = TestUtil.getBookIndexDirectory();
    TermQuery query = new TermQuery(new Term("contents", "junit"));
    IndexSearcher searcher = new IndexSearcher(dir);
    BookLinkCollector collector = new BookLinkCollector(searcher);

    searcher.search(query, collector);
    Map<String,String> linkMap = collector.getLinks();
    assertEquals("ant in action",
		 linkMap.get("http://www.manning.com/loughran"));;
    searcher.close();
    dir.close();
}
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- --
>8

What might not work for me is the use of FieldCache
on the IndexReader to retrieve all fields values on the current segment;
those values are returned as String[],

while for me it would be more convenient to get a term enumerator:
all the tokenizing and stopword removal work has already been
dojne and indexing time, and I would like to leverage that.

How does it sound?

Cheers,
Giovanni

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message