lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uncle <>
Subject Reverse keyword search?
Date Fri, 27 Apr 2012 11:35:39 GMT

I am relatively new to Lucene, this might be a noob question, if so please redirect me. I'd
like some guidance on how to use Lucene to address a problem.

I have a set of a few hundred (and growing) user-defined keywords such as "spain" and "volkswagen"
and each of which is associated to one of about 20 categories, such as "world" and "automotive".
My challenge is to use the summary (title, description, caption, meta-tags, keywords, but
not the entire content) from a news article such as what you might find on and look
for those keywords in the article, to identify the article's category. The article's summary
is often "dirty" with special characters, commas, hash tags, etc. and so needs to be tokenized.
I would also like to utilize Lucene's natural language processing to match "spanish" to "spain"
for example.

This appears to be somewhat the reverse of the typical Lucene use case -- rather than having
a set of say 1000 of articles which are indexed, then issuing a query using a few keywords
to search on those articles, I have a set of say 1000 keywords, and a single article, and
I want to determine which keyword best fits the article's summary.  How to best use Lucene
to handle this?

I have considered:

1) Creating a Lucene index of the keywords and topics, then tokenizing the summaries using
Lucene's tokenizers, then issuing queries with the tokens to find the best match
2) Indexing the article summary, then iterating over all of the keywords, issuing a query
for each of them, then keeping the best match.
3) Learning how Lucene does the individual keyword-to-keyword matching and writing some custom

I'd appreciate it if someone could point me in the right direction.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message