lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Keywords indexing, "top words", and co-occurrence
Date Fri, 18 Dec 2009 18:52:07 GMT
Yes, Lucene will help you do this.  It won't do exactly what you want
without some effort on your part.

Sounds like what you want to do is

a) get a book on Lucene and SOLR

b) use standard indexers and a synonym lookup to produce multiple fields
based on the original text and the synonymed text

c) use SOLR's support for faceting to get the counts you are after.

On Fri, Dec 18, 2009 at 6:10 AM, Kaspar Fischer
<kaspar.fischer@dreizak.com>wrote:

> Hi everybody,
>
> I need to do some text analysis and am looking for a software library (in
> Java, preferably) to use for this. Lucene came to my mind first, but I
> actually hope that there is some library (based on Lucene, for example) that
> solves the problems directly.
>
> What I want to do is the following:
>
> 1. In documents that get added to the system I need to find keywords from a
> predefined, fixed set of keywords. For example, the user will make a query
> for all documents containing the word "traffic" (this word need not be a
> keyword) and I want to show the number of keyword hits in all documents that
> contain "traffic":
>
> - car, cars, automobile, automobiles (3)
> - - Mercedes (2)
> - - Ferrari (2)
> - train, trains (4) // one doc contains "TGV", 3 contain "train" or
> "trains"
> - - TGV (1)
> - - ICE (0)
> - plane, planes (5)
> - - Boeing (4)
> - - Airbus (1)
>
> In short: I want to count keyword hits in the documents returned by some
> query. Notice that the keywords are hierarchically organized and may have
> synonyms ("car" = "cars" = "automobile").
>
> 2. If the user queries for free-input word A ("hamburger", say) I want to
> find all keywords (from the above hierarchy) that are close to "hamburger"
> in some sense (word-distance or some similar measure of distance in text)
> and order them by number of occurrence.
>
> Can this be done in Lucene? Or do you know of any frameworks that achieve
> such results?
>
> Regarding to size, I expect the querys (for "traffic" in 1., or "hamburger"
> in 2.) to return at most 500 documents and each document to contain at most
> 50 keywords.
>
> Many thanks,
> Kaspar




-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message