lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: TermVectorComponent for tag generation?
Date Sat, 01 Nov 2008 17:16:56 GMT
How do you propose to distinguish those words from the other ones?   
The problem you are addressing is often called keyword extraction.  In  
general, it 's a difficult problem, but you may have domain knowledge  
that can help.


On Oct 31, 2008, at 6:35 PM, Jon Baer wrote:

> Well for example in any given text (which is field on a document);
>
> "While suitable for any application which requires full text  
> indexing and searching capability, Lucene has been widely recognized  
> for its utility in the implementation of Internet search engines and  
> local, single-site searching.
>
> At the core of Lucene's logical architecture is the idea of a  
> document containing fields of text. This flexibility allows Lucene's  
> API to be independent of file format. Text from PDFs, HTML,  
> Microsoft Word documents, as well as many others can all be indexed  
> so long as their textual information can be extracted."
>
> Id like to be able to say the tags for this article should be  
> [Lucene, PDF, HTML, Microsoft Word] because they are in field values  
> from other documents.  Basically how to generate tags from just a  
> single document based on other document field values.
>
> - Jon
>
>
> On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:
>
>> Hey Jon,
>>
>> Not following how the TVC (TermVectorComp) would help here.    I  
>> suppose you could use the "most important" terms, as defined by TF- 
>> IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to  
>> generate query terms.
>>
>> However, I'm not following the different filter query piece.  Can  
>> you provide a bit more details?
>>
>> One thing you did make me think, though, is it might be interesting  
>> to extend TermVectorMapper so that it can output a NamedList and  
>> then allow people to implement their own SolrTermVectorMapper and  
>> have it customize the TV output...
>>
>> Thanks,
>> Grant
>>
>> On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:
>>
>>> Hi,
>>>
>>> So Im looking to either use this or build a component which might  
>>> do what Im looking for.  Id like to figure out if its possible use  
>>> a single doc to get tag generation based on the matches within  
>>> that document for example:
>>>
>>> 1 News Doc -> contains 5 Players and 8 Teams (show them as  
>>> possible tags for this article)
>>>
>>> In this case Players and Teams are also docs.  It's almost like I  
>>> want to use MoreLikeThis w/ a different filter query than what Im  
>>> using.
>>>
>>> Is there any easy hack to get this going?
>>>
>>> Thanks.
>>>
>>> - Jon
>>
>> --------------------------
>> Grant Ingersoll
>> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
>> http://www.lucenebootcamp.com
>>
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










Mime
View raw message