lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geir Gullestad Pettersen <gei...@gmail.com>
Subject Re: Using lucene for substring matching
Date Tue, 27 Jul 2010 20:34:01 GMT
Thanks for your feedback, Ian.

I have written a first implementation of this service that works well. You
mentioned something about technologies for speeding up lucene, something I
am interested in knowing more about. Would you, or anyone, please mind
elaborating a bit, or giving me some pointers?

For the record I am using the in memory RAMDirectory instead of file based
index. I don't know if is relevant in terms of speeding things up, but
thought I'd mention it just to be safe.

Thank you,

Geir

2010/7/23 Ian Lea <ian.lea@gmail.com>

> So, if I've understood this correctly, you've got some text and wan't
> to loop through a list of words and/or phrases, and see which of those
> match the text.
>
> e.g.
>
> text "some random article about something or other of some random length"
>
> words
>
> some - matches
> many - no match
> article - matches
> word - no match
>
> You can certainly do that with lucene.  Load the text into a document
> and loop round the words or phrases searching for each.  You are
> likely to need to look into analyzers depending on your requirements
> around stop words, punctuation, case, etc.  And phrase/span queries
> for phrases.
> There are also probably some lucene techniques for speeding this up,
> but as ever, start simple - lucene is usually plenty fast enough.
>
>
> --
> Ian.
>
>
> On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
> <geirgp@gmail.com> wrote:
> > Hi,
> >
> > I'm about to write an application that does very simple text analysis,
> > namely dictionary based entity entraction. The alternative is to do in
> > memory matching with substring:
> >
> > String text; // could be any size, but normally "news paper length"
> > List matches;
> > for( String wordOrPhrase : dictionary) {
> >   if ( text.substring( wordOrPhrase ) >= 0 ) {
> >      matches.add( wordOrPhrase );
> >   }
> > }
> >
> > I am concerned the above code will be quite cpu intensitive, it will also
> be
> > case sensitive and lot leave any room for fuzzy matching.
> >
> > I thought this task could also be solved by indexing every bit of text
> that
> > is to be analyzed, and then executing a query per dicionary entry:
> >
> > (pseudo)
> >
> > lucene.index(text)
> > List matches
> > for( String wordOrPhrase : dictionary {
> >   if( lucene.search( wordOrPharse, text_id) gives hit ) {
> >      matches.add(wordOrPhrase)
> >   }
> > }
> >
> > I have not used lucene very much, so I don't know if it is a good idea or
> > not to use lucene for this task at all. Could anyone please share their
> > thoughs on this?
> >
> > Thanks,
> > Geir
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message