lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Rudd <sagr...@gmail.com>
Subject substring query
Date Thu, 05 Mar 2015 05:16:28 GMT
I have created a slightly hairy document collection that contains 10s of millions of DNA sequence
words that I wish to process to find rarer and unique words. Each of the words is between
100 characters (nucleotides) and 1000 characters in length.

I have been able to use WildcardQuery and FuzzyQuery to select for words - using the query
“*ubst*” I can recover subst, substring etc.

I am a little challenged in selecting words in the reciprocal direction - if I start with
a long word such as “sequence”, what would be the most appropriate way to select for the
words in the database that are found within e.g. sequ, quenc and ence?

Is there a simple logical way that this could or should be done? A few pointers would be very
much appreciated.

Cheers

Stephen





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message