lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <>
Subject Re: Reverse search
Date Wed, 28 Mar 2007 07:34:00 GMT
 >>I just want to make sure there is no API either

No, but your code looks like it should do the job. That code can be 
improved by something like [psuedo code]:


if(query instanceof PhraseQuery)
        //find and index rarest term only using an existing index 
          int rarestDf=Integer.MAX;
          Term rarestTerm=null;
         for(Term term:terms){

		int df= corpusReader.docFreq(term);
    	//add just the rarest term
	doc.add(new Field(rarestTerm.field(),rarestTerm.text(),

    //add all terms

Melanie Langlois wrote:
> Mark,
> When, I extract the terms from my query, I can not use add them directly? I have to do
something like:
> Set<Term> terms=new HashSet<Term>();
> query.extractTerms(terms);
> Document doc=new Document();
> for(Term term:terms){
> doc.add(new Field(term.field(),term.text(),Field.Store.NO,Field.Index.TOKENIZED);
> }
> I just want to make sure there is no API either create the document from term or to index
the term directly.
> Thanks
> Mélanie 
> -----Original Message-----
> From: markharw00d [] 
> Sent: Monday, March 26, 2007 12:36 AM
> To:
> Subject: Re: Reverse search
> On app startup:
> 1) parse all Queries and place in an array.
> 2) Create a RAMIndex containing a doc for each query with content 
> consisting of the query's terms (see Query.extractTerms). For optimal 
> performance only index the most rare term for queries with multiple 
> mandatory criteria e.g. PhraseQuerys. "Most rare" can be determined by 
> looking at IndexReader.docFreq(t) using an existing index which is 
> representative of  your type of content.
> 3) For any queries that can't be handled by 2) e.g. FuzzyQueries - add 
> to list of "run always queries".
> Whenever you receive a new document:
> 1) Put it in a MemoryIndex
> 2) Get a list of the document's terms by calling 
> memoryIndex.getReader().terms();
> 3) For each term hit your query RAMIndex and get 
> queryIndexReader.termDocs(term) - this will give you the ids of queries 
> that need to be run - you can use the doc id to index straight into your 
> parsed queries array.
> 4) Run all queries found in 3) and all those held in your "run always" 
> list against the MemoryIndex containing your new document
> Hope this helps,
> Mark
> Melanie Langlois wrote:
>> Hi Mark,
>> If I follow you, I should list the key terms in my incoming document, then select
the queries which contains these key terms, and then run those queries on my index ? If this
is correct there is two things I don't understand:
>> -how do I know which term is a key term in my document ?
>> -how can I select the queries? Should I index them in a separate index?
>> Thanks,
>> Mélanie Langlois 
>> -----Original Message-----
>> From: mark harwood [] 
>> Sent: Friday, March 23, 2007 11:19 PM
>> To:
>> Subject: Re: Reverse search
>> Bear in mind that the million queries you run on the MemoryIndex can be shortlisted
if you place those queries in a RAMIndex and use the source document's terms to "query the
queries". The list of unique terms for your document is readily available in the MemoryIndex's
>> You can take this list and find "likely related queries" to execute from your Query
>> Note that for phrase queries or other forms of query with multiple mandatory terms
you should only index one of the terms (preferably the rarest) to ensure that your query is
not needlessly executed. For example - using this approach I need only run the phrase query
for "XYZ limited" whenever I encounter a document with the rare term "XYZ" in it, rather than
the much more commonplace "limited". 
>> Cheers
>> Mark
>> ----- Original Message ----
>> From: karl wettin <>
>> To:
>> Sent: Friday, 23 March, 2007 12:54:36 PM
>> Subject: Re: Reverse search
>> 23 mar 2007 kl. 09.57 skrev Melanie Langlois:
>>> Well, I though to use the PerFieldAnalyzerWrapper which contains as  
>>> basic the snowballAnalyzer with English stopwords and use  
>>> snowballAnalyzer with language specific keywords for the fields  
>>> which will be in different languages. But I'm seeing that in your  
>>> MemoryIndexTest you commented the use of SnowballAnalyzer, is it  
>>> because it's too slow. In this case, I think I could use the  
>>> StandardAnalyzer... what do you think?
>> I think that creating an index with a couple of documents takes a  
>> fraction of the time it will take to place a million queries on that  
>> index. There is no real need to optimize something that takes  
>> milliseconds when you in the same process do something that takes  
>> half a minute.
> ___________________________________________________________ 
> All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use."
- PC Magazine 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message