lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <schno...@ids-mannheim.de>
Subject Term Statistics for MultiTermQuery
Date Tue, 12 Mar 2013 17:00:09 GMT
Hi,
here's another question involving MultiTermQuerys. My aim is to get a
frequency count for a MultiTermQuery while I don't need to execute the
query. The naive approach would be to create the Query, extract the
terms, and get each term's frequency, approximately as follows:

IndexSearcher searcher = ...;
PrefixQuery query = new PrefixQuery(new Term("field", "abc"));
Query rewritten = searcher.rewrite(query);
Set<Term> terms = rewritten.extractTerms();
...

And eventually read the term frequencies for each term. However, this
seems rather costly for a large number of terms and I am actually
interested in the total frequencies, so there would be no need for a
term-by-term analysis.
My use case is that I have an index containing part-of-speech tags in
the form <tag>:<token> and I may be searching for <tag> frequencies.
My alternative solution would be to create a dedicated index in which
the original tokens are completely replaced by the tags, so that I had
documents in the form "DET NN ..." and corresponding tokens. Would you
rather recommend this?

Thanks,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message