lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-868) Making Term Vectors more accessible
Date Thu, 19 Jul 2007 19:45:06 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513983
] 

Karl Wettin commented on LUCENE-868:
------------------------------------

Sorry for the delay, vacation time.

In short I think this is a really nice improvment to the API. I also agree with Yonik about
the array[]s constructed and passed down to the mapper. Perhaps your current implementation
could be moved one layer further up? Another thought is to reuse array(s) and pass on the
data length, but that might just complicate things.

I'll try to introduce these things next week and see how well it works. 

I use the term vectors for text classification. For each new classifier introduced (occurs
quite a lot) I iterate the corpus and classify the documents. Potentially it could save me
quite a bit of ticks and bits to not create all them array[]s, however my gut tells me there
might be some JVM settings that does the same trick. I'll have to look in to that.



> Making Term Vectors more accessible
> -----------------------------------
>
>                 Key: LUCENE-868
>                 URL: https://issues.apache.org/jira/browse/LUCENE-868
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch
>
>
> One of the big issues with term vector usage is that the information is loaded into parallel
arrays as it is loaded, which are then often times manipulated again to use in the application
(for instance, they are sorted by frequency).
> Adding a callback mechanism that allows the vector loading to be handled by the application
would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, TermVectorMapper
mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The existing
getTermFreqVectors will be reimplemented to use an implementation of TermVectorMapper that
creates the parallel arrays.  Additionally, some simple implementations that automatically
sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have a patch
soon.
> See http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message