lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Saltiel" <>
Subject RE: -> Grouping Search Results by Clustering Snippets:
Date Sun, 30 Jan 2005 04:12:37 GMT
I am very excited by your proposal.
I have looked at your web site and as far as I can follow it, it looks
very interesting.
I notice that it is text mining and clustering and that there is an
indexer as well as clustering algorithms used. This is a point I make
below, I am unsure how much duplication is entailed when such a system
works with Lucene.
It also seems to me that you use the approach outlined below of SVG and
matrix dimension reduction. I raise a few points about this.

I have read the pdf briefly.
I will try for searches and see what I think. What is your
Reading about this on a link supplied by furl to a sciam article it
seems that what is most similar to what we are discussing is implicit
search which is probably going to be a feature in Longhorn. This is,
more or less, restricted to the desk top. Of course what we want is a
tool kit so that we can offer to others what they want, rather than the
rather prescriptive lowest common denominator approach that
characterises MS. The article then goes on to discuss explicitly what we
have mentioned, pointing out its inherent difficulty. I actually think
it makes too much of the difficulty of gathering information on a user.
I have to say the paradigm use I have in mind would be, say, that of a
nurse or nurse in training. But this already tells us a lot about that
person. And there would be a lot more that would be easy to establish,
e.g. where they work, where they are in the course, what other courses
they have already taken and so on. In fact the difficultly would be in
how to usefully assemble such an abundance of information, rather than
finding it in the first place.
A year ago I was involved in a similar thread on this list and, although
I was invited to pursue it privately by Herb Chong "RE: Probabilistic
Model in Lucene - possible?", I was unable to for a very simple reason
(time availability aside).
I do not have sufficient knowledge of the maths and, more particularly,
the means necessary to code from the formulae to Java. I guess I am not
alone in this.
Now I know that there is some open source code out there (license
compatibility?), Carrot2, Encore and the LSI code being developed by the
National Institute for Technology & Liberal Education (NITLE) as CNG
(Contextual Network Graphs) and led by Maciej Ceglowski. This is C++ and
Perl, but there is a hook available in Java. Try for a very good synopsis.
One point that emerges is that an LSI implementation is probably from
the ground up implying that to use Lucene would require overlooking some
features and re-engineering others, I believe. Note the point about
concept maps under mds_applications.htm.
I'm sure there are other open source implementations I haven't
mentioned. Joaquin,
Aside from deriving a suitable function to express the means by which
data will be indexed and manipulated an essential technology is that of
a means of matrix dimension reduction. Matrix dimension reduction (& the
2 dimensional counterpart multi-dimensional scaling) serves to amplify
features allowing them to be compared. SVD and similar techniques (LSI
but possibly not CNG) suffer from inflexibility in that the matrix must
be rebuilt on the subtraction of data (the removal of a document in a
TDM). There are techniques to deal with this but, from what I can see,
they further add to the expense of the technique.
Sparse random indexes do not suffer from this problem, while, I belief
they yield similar results. I think they work by having a restricted
dimensional array of (perhaps) 1500 dims populated with random noise,
within which a tally of terms in a document are stored as sets of eight
bits. (The techniques described below have been modified by
experimentation, one of the criticisms of statistical techniques is that
experimentation must be relied on so much.) Without re-reading this is
as far as I remember. However, the dimension reduction is at matrix
creation, there is no further intervening step.

  Latent Semantic Analysis/Indexing (Landauer & Dumais 1997):

    54,000 x 37,600 --> SVD --> 54,000 x 300

  The  54,000 x 37,600 words-by-documents matrix F is reduced to,
  say, 54,000 x    300 with Singular-Value Decomposition (SVD).
  Each word is then represented by a 300-dimensional vector
  (down from 37,600-D).  Words with similar meaning will have
  vectors with a high correlation (cosine), and the test is made
  by comparing the correlations.  Among the four choices in each
  test question, the one with the highest correlation wins--it's
  the machine's answer.


  SVD is computationally "heavy," in terms of both memory size
  and compute time (approx. 3 hrs. for this problem).  SVD for
  a collection of a million documents is impractical today.


  Jiang & Littman (2000; includes a beautiful mathematical summary
  of vector-based information retrieval).

Complementary Answer:  RANDOM INDEXING

  54,000 x 37,600  --> Random Indexing -->  54,000 x 4,000

  Based on Sparse Distributed Representation.

  Instead of collecting word-use statistics into a 54,000 x 37,600
  (M x N) words-by-documents matrix F, collect them into a
  54,000 x 4,000 (M x N') matrix G.  How?

  0.  Set G = 0.

  1.  For each document d, make an N'-dimensional (e.g., 4,000-D)
      ternary INDEX VECTOR i_d by placing a small number k of
      -1s and +1s (e.g., ten of each; k = 10) at random among
      the 4,000; the rest will be 0s:
      i_d = 0000000+000...000+000000-000...000,
            that's 4,000 "trits," 3,980 of them 0s.

  2.  Scan through document d.  Every time the word w appears,
      add the index vector i_d to row w of matrix G.

  NOTE: The same procedure will produce the matrix F if the 37,600
     index vectors i_d are 37,600-bit and have only one 1, in the
     d-th position (i.e., unary or LOCAL representation):
     i_d = 00000000000...000+0000000000...000000000000000000000000000,
           that's 37,600 bits, 37,599 of them 0s.


  If the N random index vectors for documents are assembled into an
  N x N' matrix D, whose row d is the index vector for document d
  (i.e., D_d. = i_d), then G = FD.  The matrix D is "P-unitary,"
  meaning that it is probabilistic and DD` ~ I times a constant (2k;
  ` means `transpose'), so that F ~ GD`/2k (also, D`D ~ I times a
  constant).  Thus, F and G have approximately the same information.

This is good stuff, the paper is worth the read.
I do not have the means to evaluate the Encore code, from inspection it
does not look robust, but I don't know. I do not know whether it *fully*
implements a sparse random array either. I believe not but rather SVD in
conjunction with a disc based inverted file structure (that is not a
matrix held in memory). The researcher Martin Svensson's home page is
I don't why they don't, nor if they should use random indexing. They are
both at the same institute, though.
Inspecting the code for the recommender system, it should accomplish
much, or most of what has been mentioned so far here. How far it crosses
over to document indexing is unclear to me. The demo does use an indexed
document but that should just be for the purposes of the demo and
decoupled from the function of the recommender.

I was unaware of the site Good luck with it!
A few points. I haven't meant to imply that I have any solutions up my
sleeve. I only wish! All I can do is try to nudge the discussion along
and maybe do some testing and integration later. I am tempted to use it. I use furl and, in fact, used it in
composing this email, so when it comes to sharing links and memory it is
I haven't used Simpy but I can offer a few criticisms of furl.
1. Some of the UI is off, such as reposts of data leading to error
pages, drop down lists that are difficult to read.
2. The ranking system, 1 .. 5 seems like nonsense to me. I just put
everything in at 5. If I didn’t think it useful I wouldn’t have bothered
to save the link in the first place.
3. There are a few nice touches to do with similar or the same pages
saved by others, which is what Amazon does with books. They seem a bit
hidden and obscure, though.
4. I find it deficient that there is no way to create hierarchical
5. Categories are entirely of one's own creation, this militates against
the idea that others might have something in common with you, after all,
the first place to find that would be through looking at the categories
of each user.
This last point brings me onto something else. Obviously a recommender
system has something in common with the idea behind a foaf ontology.
Alan Rector (a leading Professor in the field at Manchester University)
has said that to attempt to find categories for an ontology using
statistics would be like alloying gold with brass. Perhaps he is right.
I think that remains to be seen. But it is clear to me that the two
types of techniques operate in complementary spaces. Statistical
techniques should be used as a Bottom up approach and this is useful
when it is very difficult to anticipate the nature of the textual
information before hand by creating pattern discovery and mappings
between those patterns.
The top down approach models a priori knowledge in terms of logical
relations expressed in ontologies. So this method can be used to model
the control of flow between pages, page elements, or *text fragments* at
the web tier. In other words it is important to be able to create a
system that utilises a priori knowledge. My example of nurses
illustrates. It should run:- given the context as defined by an ontology
deduction about this person, where they are in their course and other
factors, find relevant information. Depending on context this might be
either on a push or pull basis. At some point there would have to be an
interaction between ontological terms and concepts suggested in returned
searches. This takes the idea suggested by Nistle further. They suggest
that the user might wish to interact with the 2d representation of the
semantic space and rearrange it according to their understanding and
current needs, I am saying, yes, and maybe guided by an ontology, or, in
doing this, modifying an ontology?


> -----Original Message-----
> From: Joaquin Delgado []
> Sent: Friday, January 28, 2005 9:41 PM
> To: Lucene Developers List;
> Subject: RE: -> Grouping Search Results by Clustering Snippets:
> This is a very interesting thread. Please find attached a paper I
> published many years ago (1998) about RAAP, a bookmark recommender
> DELGADO, J., ISHII, N. and URA, T., "Content-based Collaborative
> Information Filtering: Actively Learning to Classify and Recommend
> Documents" in M. Klusch, G. Weiß (Eds.): (1998) Cooperative
> Agents II. Learning, Mobility and Electronic Commerce for Information
> Discovery on the Internet. Springer-Verlag, Lecture Notes in
> Intelligence Series No. 1435.
> I'd like your opinion on the topic clustering you can find at
> This one uses title and snippets from the external engines and
> "concepts" from the documents that are indexed by the system.
> Cheers,
> -- Joaquin
> -----Original Message-----
> From: Otis Gospodnetic []
> Sent: Friday, January 28, 2005 2:47 PM
> To: Lucene Developers List;
> Subject: RE: -> Grouping Search Results by Clustering Snippets:
> This is very much of interest to me.  Although it's not in the UI, I
> did integrate Lucene and Carrot2 in Simpy ( ).
> Clustering is currently triggered only by a search.  Although you may
> not be able to tell (again, sucky UI) Simpy is designed in a way that
> will let me hook in a recommender system, much like you describe it.
> Users store links into their Simpy accounts, they tag them, perform
> searches, find other users, add them to their Topics (Simpy-specific
> thing), and so on, so there is a lot of knowledge about a user that
> be derived from all that.  Currently, the only quasi-smart thing that
> goes beyond a simple search is 'More users like this', and even that
> has a small bug that I need to fix for the next release, but what you
> are describing sounds very much like one of the directions in which I
> want to take Simpy and its users. :)
> Otis
> --- Adam Saltiel <> wrote:
> > This has been implemented in open source, but not with lucene?
> >
> > and
> >
> > David Weiss is a Polish academic at Poznan University, Poland. He
> > others have implemented a servlet based web app that uses pipe lined
> > components that communicate using http and implement a couple of
> > clustering algorithms.
> > Clustering, of course, can go way beyond search result presentation
> > and
> > there are some very suggestive examples at
> >
> > Where the encore project (Martin Svennson) is based on orthogonal
> > transformations of a large sparse matrix (a possible method for
> > matrix
> > dimension reduction). I think it would be interesting to hook a
> > recommender system into lucene, thus clustering would take place on
> > the
> > basis of user profile which may be built up automatically by
> > accumulating clicks and comparing to other visitors, with some
> > intelligent weighting to node inputs.
> > This calls into question what really a search is, does it have to be
> > instigated by the user or might their context and history suggest
> > enough
> > to pull in additional material? So this would be on top of snippets
> > and
> > also influence what snippets are returned as well as their
> > presentation.
> > Coller still would be to be able to recognise the user without a
> > login.
> > This might be implemented with cookies, but to deal with the user in
> > terms of types of interests, a series of faceted profiles, so that
> > portals could become fluidly dynamic. Sounds far flung, but I
> > actually
> > think it is just round the corner.
> > Let me know if this is of interest.
> >
> > Adam
> >
> > > -----Original Message-----
> > > From: integer [daniel prawdzik] []
> > > Sent: Wednesday, January 26, 2005 5:17 PM
> > > To:
> > > Subject: -> Grouping Search Results by Clustering Snippets:
> > >
> > > Grouping Search Results by Clustering Snippets:
> > >
> > > The presentation of search engines are typically long unsorted
> > lists
> > of
> > > results. To find the page you're looking for, is often
> > time-consuming
> > > and unsatisfying.
> > > Showing the results in groups by similar  topics is a quite more
> > > suitable solution to give an user a quick overview over the
> > results.
> > > This can be done by a technology called cluster analysis. Actually
> > I'm
> > > working on my diploma master thesis about this topic. In my
> > > understanding, it's too nice to be born for the archive, so I want
> > to
> > > implement this feature in an opensource software. The coding of
> > this
> > > programm already gone pretty far, I've got some tests done and the
> > > results are impresive and might still get better [you can see some
> > > results on -> sorry, only in
> > german]
> > >
> > > To make a long story short:
> > > I'm wondering, if this is an attractive feature for the lucene
> > > community?
> > >
> > > regards,
> > > integer
> > >
> > >
> > >
> >
> > > To unsubscribe, e-mail:
> > > For additional commands, e-mail:
> >
> >
> >
> >
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message