lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] Updated: (SOLR-651) A SearchComponent for fetching TF-IDF values
Date Thu, 04 Sep 2008 17:21:45 GMT

     [ https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Grant Ingersoll updated SOLR-651:
---------------------------------

    Attachment: SOLR-651.patch

Here's a first crack at this.  It still needs more unit tests to exercise the various combination
of options, but I think it is a reasonable first crack at the idea.

Questions to be answered/things to still do:
1. How do people like the format for output?  It's basically broken down by doc, then field,
then term, then term information,  See the unit tests for some samples
2. Would be good to have a more efficient lookup for IDF.  At a minimum, a cache of IDF values
would be useful, but the memory would need to be controlled.  Lucene may do some caching under
the hood, so that should be investigated more
3.  It relies on the query component doing it's thing.  That is, you send in a query, start
and rows, and this component just loops over the doc list and fetches.  I could see a case
for doing things separately, but that seems like duplication.  People using this can just
send explicit queries designed for this Component.
4. Probably needs some error handling for documents that don't have term vectors, but haven't
tested yet.



> A SearchComponent for fetching TF-IDF values
> --------------------------------------------
>
>                 Key: SOLR-651
>                 URL: https://issues.apache.org/jira/browse/SOLR-651
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.3
>            Reporter: Noble Paul
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-651.patch
>
>
> A SearchComponent that can return TF-IDF vector for any given document in the SOLR index
> Query : A Document Number / a query identifying a Document
> Response :  A Map of term vs.TF-IDF value of every term in the Selected
> Document
> Why ?
> Most of the Machine Learning Algorithms work on TFIDF representation of
> documents, hence adding a Request Handler proving the TFIDF representation
> will pave the way for incorporating Learning Paradigms to SOLR framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message