lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Bet you didn't know Lucene can...
Date Mon, 31 Oct 2011 20:32:02 GMT
On 22/10/2011 11:11, Grant Ingersoll wrote:
> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).
 It's based on my observation, that over the years, a number of us in the community have done
some pretty cool things using Lucene that don't fit under the core premise of full text search.
 I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to
reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't
extend the conversation to a bit more than the conference and also see if I can't inject more
ideas beyond the ones I have.  I don't need deep technical details, but just high level use
case and the basic insight that led you to believe Lucene could solve the problem.

Better late than never ... :) I briefly mentioned this use case to you 
at Eurocon, but here it is for the record.

I used Lucene in a duplicate-detection scenario where instead of 
documents individual sentences would be indexed (with a fuzz). A 
similarity-preserving hash function was calculated on each sentence, and 
the hash was added as a field. The property of the hash was that similar 
documents (sentences) would produce a similar hash, with only some 
bit-level perturbation. The challenge was to find a ranked list of 
possible duplicates with similar (not exact same) hashes, which in this 
case meant to find a ranked list of documents that have the smallest 
bit-level distance in their hashes from the query hash.

The solution is described in SOLR-1918 - Bit-wise scoring field type.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message