lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas D'Silva (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
Date Fri, 27 Nov 2009 02:25:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783010#action_12783010
] 

Thomas D'Silva commented on LUCENE-1910:
----------------------------------------

Mark,

I refactored the code so that the tag and document probabilities are computed and used to
find the most important document terms corresponding to a given tag term during the index
creation phase. These most important document terms (ranked by information gain) for a given
tag term is stored as meta information in the index when the index is created. I added a class
TagIndexWriter which extends IndexWriter which is used to create an index which can be used
to run MoreLikeThisUsingTags queries. 

I recreated a test index with one million documents, and assigned tags (tag_0,...tag_4) to
10%,20%.. and so on of the documents. 

The time taken to generate a query on an index created using TagIndexWriter:
tag name, number of documents, time in ms
tag_0, 10134, 22
tag_1, 19996, 29
tag_2, 30010, 6
tag_3, 39907, 6
tag_4, 50148, 9

Since the document terms corresponding to a tag term is computed during the indexing phase,
the time taken to generate a MoreLikeThisUsingTags query is constant. 

Thanks,
Thomas

> Extension to MoreLikeThis to use tag information
> ------------------------------------------------
>
>                 Key: LUCENE-1910
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1910
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Thomas D'Silva
>            Priority: Minor
>
> I would like to contribute a class based on the MoreLikeThis class in
> contrib/queries that generates a query based on the tags associated
> with a document. The class assumes that documents are tagged with a
> set of tags (which are stored in the index in a seperate Field). The
> class determines the top document terms associated with a given tag
> using the information gain metric.
> While generating a MoreLikeThis query for a document the tags
> associated with document are used to determine the terms in the query.
> This class is useful for finding similar documents to a document that
> does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message