lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valentin Popov <valentin...@gmail.com>
Subject Re: multiterm numbers regexp search
Date Tue, 16 Dec 2014 07:36:20 GMT
Thanks, will try. 
> On 15 дек. 2014 г., at 21:02, Allison, Timothy B. <tallison@mitre.org> wrote:
> 
> If you can't change the analyzer, you can programmatically build a MultiPhraseQuery (you'd
have to fill in the alternatives ... not a great option) or a SpanNearQuery composed of span-wrapped
RegexpQueries (rewrites are taken care of for you).
> 
> You might also want to look into using the ComplexPhraseQueryParser:
> 
> "/5{1}<1-5>{1}<0-9>{2}/ /<0-9>{4}/ /<0-9>{4}/ /<0-9>{4}/"
> 
> Make sure to "or" that with the regex to capture the "phrase" without spaces/hyphens:
"5{1}<1-5>{1}<0-9>{14}"
> 
> I can't vouch for performance with the above options...
> 
> Whichever path you take, make sure that the MultiTermQuery.RewriteMethod and/or maxBooleanClauses
are set appropriately.
> 
> -----Original Message-----
> From: Valentin Popov [mailto:valentin.po@gmail.com] 
> Sent: Monday, December 15, 2014 8:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: multiterm numbers regexp search
> 
> Mike, thanks. 
> 
> Problem is that we cant change analyzer, as bank need a search not only for card numbers
for compliance and already exist storage is hundred millions of emails. My thinking is make
multiterm regexp search query, or search of combination of regexp queries with some distance
between them. Main idea is to search possible combination of digits, as them has a rule, for
mastercard it is start with five, second number must be between 1-5 other 14 must be digits.

> 
> Thanks 
> 
> 
>> On 15 дек. 2014 г., at 16:00, Michael Sokolov <msokolov@safaribooksonline.com>
wrote:
>> 
>> You probably don't want to use StandardAnalyzer: maybe try WhitespaceAnalyzer, but
you'll need to enhance your regex a little to deal with  punctuation since WA may give you
tokens like:
>> 
>> 5106-7922-9469-8422.
>> 
>> "5106-7922-9469-8422"
>> 
>> etc
>> 
>> -Mike
>> 
>> On 12/15/14 3:45 AM, Valentin Popov wrote:
>>> I have a need to find mastercard numbers with regular expression.
>>> 
>>> I’m using Query query = new RegexpQuery(new Term("body", "5{1}<1-5>{1}<0-9>{14}"),
RegExp.ALL) to search numbers in email’s body and StandardAnalizer used for body indexing.
So number like 5106792294698422 will be indexed as it is and all mastercard numbers will be
on search results, but numbers like 5106 7922 9469 8422 will be indexed as 4 tokens 5106,
7922, 9469, 8422, simular for 5106-7922-9469-8422.
>>> 
>>> Any ideas how to find the sequence of numbers with spaces, dashes etc? Maybe
multiterm regexp search query?
>>> 
>>> 
>>> Regards,
>>> Valentin Popov
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> Regards,
> Valentin Popov
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

Regards,
Valentin Popov





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message