lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geir Gullestad Pettersen <gei...@gmail.com>
Subject Using lucene for substring matching
Date Thu, 22 Jul 2010 22:30:33 GMT
Hi,

I'm about to write an application that does very simple text analysis,
namely dictionary based entity entraction. The alternative is to do in
memory matching with substring:

String text; // could be any size, but normally "news paper length"
List matches;
for( String wordOrPhrase : dictionary) {
   if ( text.substring( wordOrPhrase ) >= 0 ) {
      matches.add( wordOrPhrase );
   }
}

I am concerned the above code will be quite cpu intensitive, it will also be
case sensitive and lot leave any room for fuzzy matching.

I thought this task could also be solved by indexing every bit of text that
is to be analyzed, and then executing a query per dicionary entry:

(pseudo)

lucene.index(text)
List matches
for( String wordOrPhrase : dictionary {
   if( lucene.search( wordOrPharse, text_id) gives hit ) {
      matches.add(wordOrPhrase)
   }
}

I have not used lucene very much, so I don't know if it is a good idea or
not to use lucene for this task at all. Could anyone please share their
thoughs on this?

Thanks,
Geir

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message