lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "deb.lucene" <deb.luc...@gmail.com>
Subject multiple phrase search for topic
Date Fri, 28 Oct 2011 16:13:16 GMT
Hi Group,

I am indexing and searching a large corpus of news articles. The indexing
process is very straightforward, I am utilizing the standardAnalyzer and
analyzing the content of the news document.
**************************
document = new Document();
document.add(new Field("snum", snum, Field.Store.YES,Field.Index.NO));
document.add(new Field("content", conent,
Field.Store.NO,Field.Index.ANALYZED,Field.TermVector.YES));
indexWriter.addDocument(document);

where, "snum" is the serial number of the news article and "content" is the
actual text of the document.

******************************
So far so good. The searching process is little complex as I am doing a
multiple phrase searching. Let me explain the situation with an example.
Suppose I have to retrieve documents which belong to the category "Software
Technology" using phrase/query terms related to that topic. Also, I have
around 10k phrases which belong to this particular category (e.g. "data
recovery tool",....., "C++ language",...."Steve Jobs",....."Mac
Layer",...."Grid Computing"...etc.). My idea was to create separate phrase
query for each of these phrases and then add all of them to a boolean query.
Much like this,

****************************
PhraseQuery pQuery ;
BooleanQuery bQuery = new BooleanQuery ();
bQuery.setMaxClauseCount(10000);
       
for (Phrase phrase : allPhrases)
{
          String terms[] = phrase.split("\\s++");
          int words = terms.length ;
            
          pQuery = new PhraseQuery();
          for ( int j = 0 ; j < words ; j++)
           {
                 String word = terms[j].toLowerCase();
                 pQuery.add(new Term("content", word));
                
           }
           pQuery.setSlop(0);
           bQuery.add(pQuery,BooleanClause.Occur.SHOULD);
}
int numOfSugg = 2000 ;
TopDocs matches = isearcher.search(bQuery, numOfSugg)

********************************
Unfortunately when I am searching the news content with this approach the
searched results do not look very promising. A lot of top-ranked documents
are not the best candidates for the "Software Technology" topic, even though
they contain the phrases (not very frequent). My questions are :

1) is there anything wrong in this usage of the phrase/boolean query?
2) how I can guarantee to retrieve the most suitable news documents (i.e.
document which contains a lot of the related phrases) in the top searched
results? I utilized the BooleanClause.Occur.SHOULD feature (instead of the
MUST) because it is impossible to find a single document containing all of
the 10k phrases, but using the SHOULD feature I surmise the best results
will be which contains at least a few of the phrases.

thanks in advance,
--d


--
View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3461423.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message