lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "deb.lucene" <>
Subject multiple phrase search for topic
Date Fri, 28 Oct 2011 16:13:16 GMT
Hi Group,

I am indexing and searching a large corpus of news articles. The indexing
process is very straightforward, I am utilizing the standardAnalyzer and
analyzing the content of the news document.
document = new Document();
document.add(new Field("snum", snum, Field.Store.YES,Field.Index.NO));
document.add(new Field("content", conent,

where, "snum" is the serial number of the news article and "content" is the
actual text of the document.

So far so good. The searching process is little complex as I am doing a
multiple phrase searching. Let me explain the situation with an example.
Suppose I have to retrieve documents which belong to the category "Software
Technology" using phrase/query terms related to that topic. Also, I have
around 10k phrases which belong to this particular category (e.g. "data
recovery tool",....., "C++ language",...."Steve Jobs",....."Mac
Layer",...."Grid Computing"...etc.). My idea was to create separate phrase
query for each of these phrases and then add all of them to a boolean query.
Much like this,

PhraseQuery pQuery ;
BooleanQuery bQuery = new BooleanQuery ();
for (Phrase phrase : allPhrases)
          String terms[] = phrase.split("\\s++");
          int words = terms.length ;
          pQuery = new PhraseQuery();
          for ( int j = 0 ; j < words ; j++)
                 String word = terms[j].toLowerCase();
                 pQuery.add(new Term("content", word));
int numOfSugg = 2000 ;
TopDocs matches =, numOfSugg)

Unfortunately when I am searching the news content with this approach the
searched results do not look very promising. A lot of top-ranked documents
are not the best candidates for the "Software Technology" topic, even though
they contain the phrases (not very frequent). My questions are :

1) is there anything wrong in this usage of the phrase/boolean query?
2) how I can guarantee to retrieve the most suitable news documents (i.e.
document which contains a lot of the related phrases) in the top searched
results? I utilized the BooleanClause.Occur.SHOULD feature (instead of the
MUST) because it is impossible to find a single document containing all of
the 10k phrases, but using the SHOULD feature I surmise the best results
will be which contains at least a few of the phrases.

thanks in advance,

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message