lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Schon <aaron_sc...@yahoo.com>
Subject Re: Extremely Large Strings Comparison (slightly off-topic)
Date Sat, 15 Nov 2008 00:53:35 GMT
Thanks for responding Jonathan. I will look into k-grams approach. 

The objects could differ by small local changes. To provide some business context, the application
requires indexing email messages and attachments. If the attachments differ by some threshold
(users making edits/reviews), the attachment needs to be flagged and major/minor versioned.

Thanks
AS


----- Original Message ----
From: Jonathan Young <JYoung@attivio.com>
To: Aaron Schon <Aaron_Schon@yahoo.com>
Sent: Friday, November 14, 2008 5:52:17 PM
Subject: RE: Extremely Large Strings Comparison (slightly off-topic)

Aaron - Although a naïve implementation of a Levenshtein distance metric takes O(n*m) time,
if you are willing to bound the maximum distance by k << n,m (e.g. if you aren't interested
in distances greater than k) then the distance calculation can take O(k*min(n,m)).

Another approach would be to use shingles or k-grams from the original document.  I believe
Lucene has some support for that.

It depends a lot on the ways in which you expect the two strings to differ.  The fact that
it is Base64-encoded MIME content doesn't matter - are they actually objects which are going
to have small local changes, or are there likely to be changes which cascade (e.g. if the
data is compressed).

I hope this helps,

--- Jonathan

-----Original Message-----
From: Aaron Schon [mailto:aaron_schon@yahoo.com]
Sent: Friday, November 14, 2008 5:00 PM
To: java-user@lucene.apache.org
Subject: Extremely Large Strings Comparison (slightly off-topic)

hi I need to compare two Base64 representation strings of some MIME content that I am storing
within a Lucene index. I need to efficiently compare them to find the closest match to a query
Base64 string , post Lucene query.

I am not sure of the best way to approach this, could I compare the hashes and compute their
similarity? Levenshtein distance seems hard because of the size of ths strings and seems inefficient?
Is there any other method you could suggest?

n.b: The idea is to not to determine exact match or not, it is to compute a similarity metric.
for example

John & Johnson (closer)
vs,
John & Jimmy (farther)

tia,
Aaron




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message