lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas D'Silva (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1910) Extension to MoreLikeThis to use tag information
Date Sun, 04 Oct 2009 17:21:56 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761997#action_12761997
] 

Thomas D'Silva edited comment on LUCENE-1910 at 10/4/09 10:21 AM:
------------------------------------------------------------------

Mark,

I refactored the class to include more descriptive variable names. I also modified the code
so that while calculating information gain only terms belonging to documents that have been
tagged with the given tag and used (and not all the terms in the index). 
I tested this class on a test index containing one million documents. The documents were tagged
with five tags (tag_0...tag_4). tag_0 was assigned to approximately 10% of the documents,
tag_1 to 20% and so on. 

tag name, number of documents, time in ms
tag_0, 10134, 137314
tag_1, 19996, 219527
tag_2, 30010, 315336
tag_3, 39907, 413615
tag_4, 50148, 507350

The time taken to generate the query for a tag depends on the number of documents in the index
containing the tag and scales linearly with the number of documents. 
The top document terms for a given are cached in a hashmap once they have been generated in
order to speed up subsequent lookups.

Thanks,
Thomas

      was (Author: twdsilva@gmail.com):
    I refactored the class to include more descriptive variable names. I also modified the
code so that while calculating information gain only terms belonging to documents that have
been tagged with the given tag and used (and not all the terms in the index). 
I tested this class on a test index containing one million documents. The documents were tagged
with five tags (tag_0...tag_4). tag_0 was assigned to approximately 10% of the documents,
tag_1 to 20% and so on. 

tag name, number of documents, time in ms
tag_0, 10134, 137314
tag_1, 19996, 219527
tag_2, 30010, 315336
tag_3, 39907, 413615
tag_4, 50148, 507350

The time taken to generate the query for a tag depends on the number of documents in the index
containing the tag and scales linearly with the number of documents. 
The top document terms for a given are cached in a hashmap once they have been generated in
order to speed up subsequent lookups.
  
> Extension to MoreLikeThis to use tag information
> ------------------------------------------------
>
>                 Key: LUCENE-1910
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1910
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Thomas D'Silva
>            Priority: Minor
>         Attachments: LUCENE-1910.patch, LUCENE-1910.patch
>
>
> I would like to contribute a class based on the MoreLikeThis class in
> contrib/queries that generates a query based on the tags associated
> with a document. The class assumes that documents are tagged with a
> set of tags (which are stored in the index in a seperate Field). The
> class determines the top document terms associated with a given tag
> using the information gain metric.
> While generating a MoreLikeThis query for a document the tags
> associated with document are used to determine the terms in the query.
> This class is useful for finding similar documents to a document that
> does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message