lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <>
Subject RE: Using lucene to search a bunch of keywords?
Date Wed, 23 Jul 2008 21:09:34 GMT
Hi Ryan,

Well, at 100 million+ keywords, Lucene might be the right tool.

One thing that you might check out for the query side is Karl Wettin's recently committed
ShingleMatrixAnalyzer (not in any Lucene release yet - only on the trunk).

The JUnit test class TestShingleMatrixFilter has an example of splitting an input string into
"shingles" (a.k.a. "token n-grams") - in this example the input string "please divide this
sentence into shingles" is converted into the following terms by requesting a minimum shingle
size of one token and a maximum of two tokens, and using the space character to join the tokens

"please", "please divide", "divide", "divide this", "this", "this sentence", "sentence", 
"sentence into", "into", "into shingles", "shingles"

You could index your keywords list as-is with no tokenization; break up your queries using
a WhitespaceTokenizer connected to a ShingleMatrixFilter, with the minimum shingle size set
to one and the maximum set to the number of tokens in keyword with the most tokens; and then
build a BooleanQuery with one clause per shingle, each set to BooleanClause.Occur.SHOULD.


On 07/23/2008 at 4:05 PM, Ryan D wrote:
> Heh, actually I'm using Perl but I've always associated text-search with
> Lucene, I'm not sure if it's the best solution or not. On the small side
> there are 1.6 million keywords, on the large side there are well over
> 100 million but I might find another way to break down the searches into
> smaller searches(send A-G server1, H-R to server2...etc).
> Is there another search tool that might be better suited for this...the
> only thing I can relate this too is how AdWords works. A user enters a
> query in the Google search box and they search their database for people
> who've purchased those keywords to the appropriate ads.  What I'm doing
> is similar but without the payday. :-{
> Currently I'm using a (huge) hash table and regular expressions
> ($query =~ /$keyword/) going down the list from largest to smallest
> but I know this is not a long term solution especially if I have to
> load the large 100 million+ list in.
> Thanks.
> On Jul 23, 2008, at 3:54 PM, Steven A Rowe wrote:
> > Hi Ryan,
> > 
> > I'm not sure Lucene's the right tool for this job.
> > 
> > I have used regular expressions and ternary search trees in the past to
> > do similar things.
> > 
> > Is the set of keywords too large for an in-memory solution like these? 
> > If not, consider using a tool like the Perl package Regex::PreSuf
> > <> - it can convert a list of
> > strings into a compact set of alternations, which you can then import
> > into a Java program.  (I'm not aware of any similar Java tools.)
> > 
> > Steve
> > 
> > On 07/23/2008 at 3:30 PM, Ryan Detzel wrote:
> > > Everything i've read and seen about luceen is search for keywords in
> > > documents; I want to do the reverse. I have a huge list of
> > > keywords("big boy","red ball","computer") and I have phrases that I
> > > want to see if they keywords are in. For example using the small
> > > keyword list above(store in documents in lucene) what's the best
> > > approach to pass in a query "the girl likes red balls" and have it
> > > match the keyword "red ball"?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message