lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Using lucene for substring matching
Date Fri, 23 Jul 2010 13:04:57 GMT
So, if I've understood this correctly, you've got some text and wan't
to loop through a list of words and/or phrases, and see which of those
match the text.

e.g.

text "some random article about something or other of some random length"

words

some - matches
many - no match
article - matches
word - no match

You can certainly do that with lucene.  Load the text into a document
and loop round the words or phrases searching for each.  You are
likely to need to look into analyzers depending on your requirements
around stop words, punctuation, case, etc.  And phrase/span queries
for phrases.
There are also probably some lucene techniques for speeding this up,
but as ever, start simple - lucene is usually plenty fast enough.


--
Ian.


On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
<geirgp@gmail.com> wrote:
> Hi,
>
> I'm about to write an application that does very simple text analysis,
> namely dictionary based entity entraction. The alternative is to do in
> memory matching with substring:
>
> String text; // could be any size, but normally "news paper length"
> List matches;
> for( String wordOrPhrase : dictionary) {
>   if ( text.substring( wordOrPhrase ) >= 0 ) {
>      matches.add( wordOrPhrase );
>   }
> }
>
> I am concerned the above code will be quite cpu intensitive, it will also be
> case sensitive and lot leave any room for fuzzy matching.
>
> I thought this task could also be solved by indexing every bit of text that
> is to be analyzed, and then executing a query per dicionary entry:
>
> (pseudo)
>
> lucene.index(text)
> List matches
> for( String wordOrPhrase : dictionary {
>   if( lucene.search( wordOrPharse, text_id) gives hit ) {
>      matches.add(wordOrPhrase)
>   }
> }
>
> I have not used lucene very much, so I don't know if it is a good idea or
> not to use lucene for this task at all. Could anyone please share their
> thoughs on this?
>
> Thanks,
> Geir
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message