lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frank Burough" <fburo...@elixirpharm.com>
Subject RE: String similarity search vs. typcial IR application...
Date Fri, 06 Jun 2003 15:14:10 GMT
The method I mention was based on using lempel-ziv (I expect my spelling is way off on this)
algorithms used in lz compression. It relied only on exact matches of short stretches of DNA
separated by non-matching sequence. The idea was to find stretches of sequence that had patterns
in common, then generate the original sequences back and display alignments. I never paid
any attention to its match scores since they seemed pretty arbitrary. The focus on this work
was in looking for common repeats in human sequence to assist in masking them prior to doing
analysis. I have lost touch with the original author of this but I am digging to see if I
can extract the papers from my "filing system". I will post them shortly (I hope!).

Frank


> -----Original Message-----
> From: Leo Galambos [mailto:Leo.G@seznam.cz] 
> Sent: Friday, June 06, 2003 10:00 AM
> To: Lucene Users List
> Subject: Re: String similarity search vs. typcial IR application...
> 
> 
> Exact matches are not ideal for DNA applications, I guess. I am not a 
> DNA expert, but those guys often need a feature that is termed 
> ``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, 
> and I think that Lucene has many drawbacks which disallow effective 
> implementations. On the other hand, I am very interested in a 
> method you 
> mentioned. Could you give me a reference, please? Thank you.
> 
> -g-
> 
> [*] why do you use the label ``fuzzy''? It has nothing to do 
> with fuzzy 
> logic or fuzzy IR, I guess.
> 
> Frank Burough wrote:
> 
> >I have seen some interesting work done on storing DNA 
> sequence as a set 
> >of common patterns with unique sequence between them. If one uses an 
> >analyzer to break sequence into its set of patterns and 
> unique sequence 
> >then Lucene could be used to search for exact pattern 
> matches. I know 
> >of only one sequence search tool that was based on this approach. I 
> >don't know if it ever left the lab and made it into the 
> mainstream. If 
> >I have time I will explore this a bit.
> >
> >Frank Burough
> >
> >
> >
> >  
> >
> >>-----Original Message-----
> >>From: Leo Galambos [mailto:Leo.G@seznam.cz]
> >>Sent: Thursday, June 05, 2003 5:55 PM
> >>To: Lucene Users List
> >>Subject: Re: String similarity search vs. typcial IR application...
> >>
> >>
> >>AFAIK Lucene is not able to look DNA strings up effectively.
> >>You would 
> >>use DASG+Lev (see my previous post - 05/30/2003 1916CEST).
> >>
> >>-g-
> >>
> >>Jim Hargrave wrote:
> >>
> >>    
> >>
> >>>Our application is a string similarity searcher where the
> >>>      
> >>>
> >>query is an
> >>    
> >>
> >>>input string and we want to find all "fuzzy" variants of the
> >>>      
> >>>
> >>input string in the DB.  The Score is basically dice's
> >>coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
> >>in common, Q is the number of unique query terms and D is the 
> >>number of unique document terms. Our documents will be sentences.
> >>    
> >>
> >>>I know Lucene has a fuzzy search capability - but I assume
> >>>      
> >>>
> >>this would
> >>    
> >>
> >>>be very slow since it must search through the entire term
> >>>      
> >>>
> >>list to find candidates.
> >>    
> >>
> >>>In order to do the calculation I will need to have 'C' - the
> >>>      
> >>>
> >>number of
> >>    
> >>
> >>>terms in common between query and document. Is there an API
> >>>      
> >>>
> >>that I can call to get this info? Any hints on what it will
> >>take to modify Lucene to handle these kinds of queries?
> >>    
> >>
> >>> 
> >>>
> >>>      
> >>>
> >>
> >>------------------------------------------------------------
> ---------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>
> >>    
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >  
> >
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message