lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <izavo...@caci.com>
Subject Lucene.NET based text triage
Date Tue, 21 Aug 2012 17:50:43 GMT
I have the following task that I need to implement in .NET. I get a block of text and need
to assess whether this text is mostly readable or a bunch of unreadable garbage. This text
is generated by processes like OCR. I am not looking to detect or correct small errors. Instead,
I need to "triage" the text block and return TRUE if the whole block is more or less readable
(as well as searchable etc) or FALSE if it's mostly garbage.

My current plan is to:

1.       Use Lucene.NET to index a large dictionary of English words

2.       Tokenize the text, throwing out stopwords, words shorter than some minimum # of chars

3.       Query each token against the index using some sort of fuzzy match that would give
me not only the closest match to a given token from the dict but also the distance

4.       Somehow combine individual distances to come up with a cumulative measure for the
whole block of text

5.       Compare it against some threshold and return FALSE if the measure is above the threshold
and TRUE otherwise.

Here are some questions:

1.       Is there anything special I need to do during indexing of the dictionary to make
the fuzzy matching work better?

2.       What sort of fuzzy matching methods are available in Lucene.NET querying? Do they
return distances for the closest matches? Does the choice of a matching method affect how
indexing should be done?

3.       Is there a way of running the whole block of text against the index at once rather
than tokenizing and looping over tokens?

Thanks much,

Ilya Zavorin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message