lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fernando" <>
Subject Related terms and SOM capabilities in Lucene
Date Thu, 06 Feb 2003 07:10:49 GMT
Hi all

I am a newcomer to Lucene project. From year 1990 or so, I have been a very
confidence user and developper using CPL text engine, the first one in using
relevance ranking,  from company PLS now in AOL, who holds the software in
the public domain, but locked from year 1997. In searching for a sustitute
of CPL, I have found Lucene. What I am looking for in Lucene are mainly the
following two features

1) To supply an ordered list of related terms from a given query, as CPL
does, and very efficiently. You can check it in our following URL . If you try it, you will have to install
SVG plugin, the software used for graphics, from Adobe and select Agencia
EFE database, since the News 91 is not active. This is only a demo based on
a set of spanish news from year 1992 (32000 docs in 90 Mb). Sorry it is in
spanish, You can try the word "futbol", and you will see that dinamically it
sugest you related words such as UEFA, FIFA, Havelange, for you to explore
the connections in between. This form of relevance feedback is very good
because it allows a better control from the user in the process of search
the database, and

2) To supply what CPL calls a weighted list of related/significant terms
from a given doc within the db. This function is very useful as initial
learning data if one wishes to build from that data a SOM (Self Organizing
Map). In fact, the different versions of SOM for clustering documents begin
always with a set of multidimensional vectors representing docs in the db.
This set of vectors are then processed using certain known algorithms
(mainly Kohonen's) for clustering. From this clustering is possible to build
a graphical interface such as knowledge trees or islands. Do you know if
that is possible from the data that is stored in Lucene indexes?

The question for the developpers is if these features can be easily
generated from Lucene indexes.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message