lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From starz10de <farag_ah...@yahoo.com>
Subject Re: conditional High Freq Terms in Lucene index
Date Fri, 30 Mar 2012 23:23:28 GMT
Thanks for your hint.

I tried simple solution as following:
Firstly I determine the document type “A” and stored them in an array by
searching the field document type in the index:
public static void doStreamingSearch(final Searcher searcher, Query query)
			throws IOException {
		
		
		Collector streamingHitCollector = new Collector() { 
			// simply print docId and score of every matching document
			@Override
			public void collect(int doc) throws IOException {
				c++;
			//	System.out.println("doc=" + doc);
				
				doc_id.add(doc+"");
				//  System.out.println("doc=" + doc  );
				// scorer.score());
			}

			@Override
			public boolean acceptsDocsOutOfOrder() {
				return true;
			}

			@Override
			public void setNextReader(IndexReader arg0, int arg1)
					throws IOException {
				// TODO Auto-generated method stub
				
			}

			@Override
			public void setScorer(Scorer arg0) throws IOException {
				// TODO Auto-generated method stub
				
			} 

		};

		 searcher.search(query, streamingHitCollector); 
		 
	}
Then I modified the HighFrequentTerm in lucene as follows:
while (terms.next()) { 
    	  
      dok.seek(terms);
         
        while (dok.next()) {  
        	 
         	
       
        	  for(int i=0;i< doc_id.size();++i)
        		 { 
            	 
                    if( doc_id.get(i).equals(dok.doc()+""))
                    {
                    	 if (terms.term().field().equals(field)  ) {
                    		                    		  
tiq.insertWithOverflow(new TermInfo(terms.term(), dok.freq()));
                    	        }
            
                    }
I could test that i correctly have only the document type „A“. However, the
result is not correct because I can see few terms twice in the ordered high
frequent list.

Any hints where are the problem?

Michael McCandless-2 wrote
> 
> You'd have to modify HighFreqTerm's sources...
> 
> Roughly...
> 
> First, make a bitset recording which docs are type A (eg, use
> FieldCache), second, change HighFreqTerms so that for each term, it
> walks the postings, counting how many type A docs there were, then...
> just use the rest of HighFreqTerms (priority queue, etc.).
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Mar 29, 2012 at 11:33 AM, starz10de &lt;farag_ahmed@&gt; wrote:
>> HI,
>>
>> I am using HighFreqTerms class to compute the high frequent terms in the
>> Lucene index and it works well. However, I am interested to compute the
>> high
>> frequent terms under some condition. I would like to compute the high
>> frequent terms not for all documents in the index instead only for
>> documents
>> with type “A”. Beside the “contents” field in the index I have also the
>> “DocType” (document type) in the index as extra field.
>> So I should compute the high frequent term only  (if DocType=”A”)
>>
>> Any idea how to do this?
>>
>> Thanks
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@.apache
>> For additional commands, e-mail: dev-help@.apache
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@.apache
> For additional commands, e-mail: dev-help@.apache
> 

Michael McCandless-2 wrote
> 
> You'd have to modify HighFreqTerm's sources...
> 
> Roughly...
> 
> First, make a bitset recording which docs are type A (eg, use
> FieldCache), second, change HighFreqTerms so that for each term, it
> walks the postings, counting how many type A docs there were, then...
> just use the rest of HighFreqTerms (priority queue, etc.).
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Mar 29, 2012 at 11:33 AM, starz10de &lt;farag_ahmed@&gt; wrote:
>> HI,
>>
>> I am using HighFreqTerms class to compute the high frequent terms in the
>> Lucene index and it works well. However, I am interested to compute the
>> high
>> frequent terms under some condition. I would like to compute the high
>> frequent terms not for all documents in the index instead only for
>> documents
>> with type “A”. Beside the “contents” field in the index I have also the
>> “DocType” (document type) in the index as extra field.
>> So I should compute the high frequent term only  (if DocType=”A”)
>>
>> Any idea how to do this?
>>
>> Thanks
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@.apache
>> For additional commands, e-mail: dev-help@.apache
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@.apache
> For additional commands, e-mail: dev-help@.apache
> 

Michael McCandless-2 wrote
> 
> You'd have to modify HighFreqTerm's sources...
> 
> Roughly...
> 
> First, make a bitset recording which docs are type A (eg, use
> FieldCache), second, change HighFreqTerms so that for each term, it
> walks the postings, counting how many type A docs there were, then...
> just use the rest of HighFreqTerms (priority queue, etc.).
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Mar 29, 2012 at 11:33 AM, starz10de &lt;farag_ahmed@&gt; wrote:
>> HI,
>>
>> I am using HighFreqTerms class to compute the high frequent terms in the
>> Lucene index and it works well. However, I am interested to compute the
>> high
>> frequent terms under some condition. I would like to compute the high
>> frequent terms not for all documents in the index instead only for
>> documents
>> with type “A”. Beside the “contents” field in the index I have also the
>> “DocType” (document type) in the index as extra field.
>> So I should compute the high frequent term only  (if DocType=”A”)
>>
>> Any idea how to do this?
>>
>> Thanks
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3868066.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@.apache
>> For additional commands, e-mail: dev-help@.apache
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@.apache
> For additional commands, e-mail: dev-help@.apache
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene-index-tp3868066p3872298.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message