lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: Using lucene to search a bunch of keywords?
Date Wed, 23 Jul 2008 21:09:34 GMT
Hi Ryan,

Well, at 100 million+ keywords, Lucene might be the right tool.

One thing that you might check out for the query side is Karl Wettin's recently committed
ShingleMatrixAnalyzer (not in any Lucene release yet - only on the trunk).

The JUnit test class TestShingleMatrixFilter has an example of splitting an input string into
"shingles" (a.k.a. "token n-grams") - in this example the input string "please divide this
sentence into shingles" is converted into the following terms by requesting a minimum shingle
size of one token and a maximum of two tokens, and using the space character to join the tokens
together:

"please", "please divide", "divide", "divide this", "this", "this sentence", "sentence", 
"sentence into", "into", "into shingles", "shingles"

You could index your keywords list as-is with no tokenization; break up your queries using
a WhitespaceTokenizer connected to a ShingleMatrixFilter, with the minimum shingle size set
to one and the maximum set to the number of tokens in keyword with the most tokens; and then
build a BooleanQuery with one clause per shingle, each set to BooleanClause.Occur.SHOULD.

Steve

On 07/23/2008 at 4:05 PM, Ryan D wrote:
> Heh, actually I'm using Perl but I've always associated text-search with
> Lucene, I'm not sure if it's the best solution or not. On the small side
> there are 1.6 million keywords, on the large side there are well over
> 100 million but I might find another way to break down the searches into
> smaller searches(send A-G server1, H-R to server2...etc).
> 
> Is there another search tool that might be better suited for this...the
> only thing I can relate this too is how AdWords works. A user enters a
> query in the Google search box and they search their database for people
> who've purchased those keywords to the appropriate ads.  What I'm doing
> is similar but without the payday. :-{
> 
> Currently I'm using a (huge) hash table and regular expressions
> ($query =~ /$keyword/) going down the list from largest to smallest
> but I know this is not a long term solution especially if I have to
> load the large 100 million+ list in.
> 
> Thanks.
> 
> 
> On Jul 23, 2008, at 3:54 PM, Steven A Rowe wrote:
> 
> > Hi Ryan,
> > 
> > I'm not sure Lucene's the right tool for this job.
> > 
> > I have used regular expressions and ternary search trees in the past to
> > do similar things.
> > 
> > Is the set of keywords too large for an in-memory solution like these? 
> > If not, consider using a tool like the Perl package Regex::PreSuf
> > <http://search.cpan.org/dist/Regex-PreSuf/> - it can convert a list of
> > strings into a compact set of alternations, which you can then import
> > into a Java program.  (I'm not aware of any similar Java tools.)
> > 
> > Steve
> > 
> > On 07/23/2008 at 3:30 PM, Ryan Detzel wrote:
> > > Everything i've read and seen about luceen is search for keywords in
> > > documents; I want to do the reverse. I have a huge list of
> > > keywords("big boy","red ball","computer") and I have phrases that I
> > > want to see if they keywords are in. For example using the small
> > > keyword list above(store in documents in lucene) what's the best
> > > approach to pass in a query "the girl likes red balls" and have it
> > > match the keyword "red ball"?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message