lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <>
Subject [jira] Updated: (LUCENE-1016) TermVectorAccessor, transparent vector space access
Date Wed, 03 Oct 2007 07:50:50 GMT


Karl Wettin updated LUCENE-1016:

    Attachment: LUCENE-1016-clusterer.txt

Sorry for flooding. This JIRA issue is sort of turning more off topic for each post.. I hope
you don't mind.

LUCENE-1016-clusterer.txt now contains a refactor of the Tanimoto similarity, it does the
same thing, but with less messy code. 

And as the filename hints, I thought it would be fun to demonstrate the similarity by adding
a very simple two dimensional decision tree clusterer.

For the test I feed it with 17 news articles representing 3 news stories I got from Google
news. Attached is also a graphviz diagram that shows the tree with the news stories clustered
together. I did not look at how to draw the line between the clusters yet, but I could probably
come up with something simple enough. Legend: floating numbers represents the distance between
two children. The leafs are the actual articles, prefixed with new story identity and suffixed
with news article identity.

(The clusterer sure needs optimization, use carrot instead. This is just me fooling aroung.)

Have fun!

> TermVectorAccessor, transparent vector space access 
> ----------------------------------------------------
>                 Key: LUCENE-1016
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Term Vectors
>    Affects Versions: 2.2
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUCENE-1016-clusterer.txt, LUCENE-1016-Tanimoto.txt, LUCENE-1016.txt,
> This class visits TermVectorMapper and populates it with information transparent by either
passing it down to the default terms cache (documents indexed with Field.TermVector) or by
resolving the inverted index.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message