lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Brown <chri...@orangepics.com>
Subject Re: top n words within a results set?
Date Wed, 11 Jan 2006 16:38:11 GMT
Excellent!! Thank you so much!

----- Original Message ----- 
From: "Grant Ingersoll" <gsingers@syr.edu>
To: <java-user@lucene.apache.org>
Sent: Wednesday, January 11, 2006 12:07 PM
Subject: Re: top n words within a results set?


> Hey Chris,
>
> There is just such an analyzer, called the PerFieldAnalyzerWrapper.  The 
> trick is the Analyzer always passes in the Field name when it gets the 
> TokenStream,
>
> -Grant
>
> Chris Brown wrote:
>
>> Bear with me, I might be missing something.... My documents get indexed 
>> ( writer.addDocument(doc) ) with one IndexWriter created using one 
>> Analyzer (the SnowballAnalyzer). So unless you can somehow use a 
>> different Analyzer per field I don't see how the second field will help. 
>> If I get the TermFreqVector for a field for a document that was indexed 
>> using the SnowballAnalyzer, isn't it always going to return stemmed 
>> words?
>>
>> To confirm your assumption, I suppose I am trying to display the values 
>> of the indexed field. It doesn't matter to me whether I count "party" and 
>> "parties" as separate words or not but I cannot display "parti" to a user 
>> as it's not a word.
>>
>> I'm thinking I need a separate index with the field created using the 
>> StandardAnalyzer unless there's some other trick with mixing Analyzers 
>> I'm unaware of.
>>
>> Thanks again for your help,
>> Chris
>>
>> ----- Original Message ----- From: "Grant Ingersoll" <gsingers@syr.edu>
>> To: <java-user@lucene.apache.org>
>> Sent: Wednesday, January 11, 2006 8:32 AM
>> Subject: Re: top n words within a results set?
>>
>>
>>> I believe the usual solution is to have a separate field on the same 
>>> document for display purposes (I am assumming you are trying to display 
>>> the values of the indexed field) that is not stemmed.   The tradeoff is 
>>> in disk space, of course.
>>>
>>> Chris Brown wrote:
>>>
>>>> Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
>>>> each term in the applicable field. It works quite well, there's just 
>>>> one
>>>> glitch.
>>>>
>>>> Some words like "party" and "picture" appear as "parti" and "pictur". I 
>>>> am
>>>> using the SnowballAnalyzer, I suspect that's what's changing the words.
>>>> Short of maintaining a second index using a different analyzer, does 
>>>> anyone
>>>> have any ideas?
>>>>
>>>> ----- Original Message ----- From: "Grant Ingersoll" <gsingers@syr.edu>
>>>> To: <java-user@lucene.apache.org>
>>>> Sent: Monday, January 09, 2006 12:34 PM
>>>> Subject: Re: top n words within a results set?
>>>>
>>>>
>>>>> You could use term vectors to accomplish this.  Get your hits for the

>>>>> website, then load the term vector for the field containing the 
>>>>> keywords and add up the frequencies
>>>>>
>>>>> Chris Brown wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Is it possible to retrieve the top 'n' most often appearing words

>>>>>> within a search criteria? I've seen the High Frequency Terms code
in 
>>>>>> the sandbox but it works across the whole index.
>>>>>>
>>>>>> To put this question into context: We're developing website that

>>>>>> hosts a user's photo website. Searches can be specific to a 
>>>>>> particular user's website or be performed globally across one, many

>>>>>> or all websites. I've accomplished this with a field in the index

>>>>>> called website. What I'd like to do is give each user the top ten

>>>>>> words that appear on their website.
>>>>>> Thanks,
>>>>>> Chris Brown
>>>>>>
>>>>>> http://www.orangepics.com/
>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> ------------------------------------------------------------------- 
>>>>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>>>>> Processing Syracuse University School of Information Studies 337 Hinds

>>>>> Hall Syracuse, NY 13244
>>>>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> -- 
>>> ------------------------------------------------------------------- 
>>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>>> Processing Syracuse University School of Information Studies 337 Hinds 
>>> Hall Syracuse, NY 13244
>>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> -- 
> ------------------------------------------------------------------- 
> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
> Processing Syracuse University School of Information Studies 337 Hinds 
> Hall Syracuse, NY 13244
> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message