lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: TermVectorComponent for tag generation?
Date Sat, 01 Nov 2008 19:33:34 GMT



On Nov 1, 2008, at 3:04 PM, Jon Baer wrote:

>
> On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:
>
>> How do you propose to distinguish those words from the other ones?
>
> ** They are field values from other documents

But so are many other words from that document, what separates out  
[Lucene, PDF, HTML, Microsoft Word]  from the rest?  Your brain made  
the distinction, but what info exists in that document such that a  
computer can?  (this is a leading question, I have some ideas, but I  
think hearing it from you will help me better understand what you are  
trying to do)

>
>
>> The problem you are addressing is often called keyword extraction.   
>> In general, it 's a difficult problem, but you may have domain  
>> knowledge that can help.
>
> ** Im finding it hard to think Lucene can do amazing job @ search  
> but yet nothing to tell me if a generated list of content is present  
> in a resulting document.

I think it can, I think the thing I'm missing is where the generated  
list comes from.  Given the list, I think it's just another search,  
right?

So, I suppose you could get the TV for your current document, along  
with the DF (doc freq) and know which terms occur in other documents,  
then you could go get those documents by searching for each of those  
terms.

However, I still suspect I'm missing something, so I'd say give it a  
try!  Maybe trying it out in code would be the best way to articulate  
it.

-Grant

Mime
View raw message