lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kaspar Fischer <kaspar.fisc...@dreizak.com>
Subject Keywords indexing, "top words", and co-occurrence
Date Fri, 18 Dec 2009 14:10:37 GMT
Hi everybody,

I need to do some text analysis and am looking for a software library (in Java, preferably)
to use for this. Lucene came to my mind first, but I actually hope that there is some library
(based on Lucene, for example) that solves the problems directly.

What I want to do is the following:

1. In documents that get added to the system I need to find keywords from a predefined, fixed
set of keywords. For example, the user will make a query for all documents containing the
word "traffic" (this word need not be a keyword) and I want to show the number of keyword
hits in all documents that contain "traffic":

- car, cars, automobile, automobiles (3)
- - Mercedes (2)
- - Ferrari (2)
- train, trains (4) // one doc contains "TGV", 3 contain "train" or "trains"
- - TGV (1)
- - ICE (0)
- plane, planes (5)
- - Boeing (4)
- - Airbus (1)

In short: I want to count keyword hits in the documents returned by some query. Notice that
the keywords are hierarchically organized and may have synonyms ("car" = "cars" = "automobile").

2. If the user queries for free-input word A ("hamburger", say) I want to find all keywords
(from the above hierarchy) that are close to "hamburger" in some sense (word-distance or some
similar measure of distance in text) and order them by number of occurrence.

Can this be done in Lucene? Or do you know of any frameworks that achieve such results?

Regarding to size, I expect the querys (for "traffic" in 1., or "hamburger" in 2.) to return
at most 500 documents and each document to contain at most 50 keywords.

Many thanks,
Kaspar
Mime
View raw message