lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Spaces in regular expressions
Date Fri, 26 Feb 2016 08:53:17 GMT
Hi,

in general you can implement the whole stuff as described in this paper using Lucene - you
don’t need to customize Lucene for this just use its official apis and tokenizers:

You have to build your own Analyzer that builds trigrams and does *not* tokenize on whitespace
and so on. From me it looks like you have to tokenize the source code on newlines (one token
per newline of code) and then make trigrams out of itHere are the components for an Analyzer
(Lucene 6 syntax with Java 8, can be rewritten for 5.5, it is just easier to show this way;
Lucene 5 needs subclassing CharTokenizer):

Tokenizer: CharTokenizer.fromSeparatorCharPredicate(ch -> ch == '\n')  // expand for other
newline, too
TokenFilters: maybe "new LowercaseFilter(tokenizer)", and finally "new NGramTokenFilter(wsfilter,
3, 3)" (trigrams on the lowercased tokens)

After that you can index using this analyzer and have trigrams in your index.

You can then use the algorithms as described in the paper to build TermQuery instances and
combine them with BooleanQuery from the regular expression to select the candidate documents.
In a final step you can loop through candidate results and filter them by applying the "real
regex". This would be done outside of Lucene code.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Kudrettin Güleryüz [mailto:kudrettin@gmail.com]
> Sent: Thursday, February 25, 2016 7:58 PM
> To: java-user@lucene.apache.org
> Subject: Re: Spaces in regular expressions
> 
> Thank you, I had looked at that article a little, some time ago. I was
> thinking I may have to change some lower level Lucene classes to be able to
> work like that. Plus I don't have much clue if that would break things.
> 
> I am primarily looking for a Lucene solution at this point.
> 
> On Thu, Feb 25, 2016 at 10:53 AM Greg Bowyer <gbowyer@fastmail.co.uk>
> wrote:
> 
> > Possibly not helpful but some time ago Russ Cox implemented a code
> > search at Google.
> >
> > His design is documented here
> https://swtch.com/~rsc/regexp/regexp4.html
> >
> > On Wed, Feb 24, 2016, at 08:01 AM, Kudrettin Güleryüz wrote:
> > > I appreciate the pointers Jack. More on that, where can I read more on
> > > enabling full regexp support on indexed source code documents using
> > > Lucene?
> > >
> > > Any suggestions regarding cases where developers implemented this
> kind of
> > > capability using Lucene/Solr/ElasticSearch/... would be more than
> > > welcome.
> > >
> > > Thank you,
> > > Kudret
> > >
> > > On Mon, Feb 15, 2016 at 10:22 AM Jack Krupansky
> > > <jack.krupansky@gmail.com>
> > > wrote:
> > >
> > > > You can have two parallel fields, one tokenized as a programming
> > language
> > > > would (identifiers, operators) and one using the keyword tokenizer for
> > each
> > > > line. You have to decide whether to treat each line as a separate
> > Lucene
> > > > document or treat each source file as a multivalued field, one value
> > per
> > > > source line. And then there is the issue of code sequences that span
> > source
> > > > lines.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Mon, Feb 15, 2016 at 8:30 AM, Kudrettin Güleryüz <
> > kudrettin@gmail.com>
> > > > wrote:
> > > >
> > > > > Since documents are source code, I am considering matching on
> > operators
> > > > > too.
> > > > >
> > > > > Using whitespace analyzer, A=foo(){ would be a single term, A = foo
> > () {
> > > > > would be  five terms. Different documents can have a different
> > > > combination
> > > > > of the identifiers and operators in the example. A regexp query like
> > > > > /A\s*=\s*foo\s*()\s*{/ could match all of them if multi term regexp
> > was
> > > > > allowed. Is it not allowed by default, or not possible at all? Any
> > > > > suggestions?
> > > > >
> > > > > On Sat, Feb 13, 2016 at 5:09 PM Jack Krupansky <
> > jack.krupansky@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Just to be clear, the whitespace tokenizer would treat "A=foo(){"
> > as a
> > > > > > single token. I presume you want "A" and "foo" to be separate
> > terms.
> > > > > >
> > > > > > You still haven't indicated what regex you were considering.
Try
> > > > > explaining
> > > > > > your query in plain English. I mean, do you want to search for
two
> > > > > keywords
> > > > > > with any operator sequence between them? Or... do you want to
> > match on
> > > > > > operators as well but simply want to ignore whitespace?
> > > > > >
> > > > > > Generally, the standard analyzer/tokenizer is better/easier
- you
> > can
> > > > > > simply query "A foo" and it will match all three of you statements.
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Sat, Feb 13, 2016 at 4:29 PM, Kudrettin Güleryüz <
> > > > kudrettin@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > >  As mentioned, document is a source code. As you know all
below
> > > > > statments
> > > > > > > are equal:
> > > > > > > A = foo() {
> > > > > > > A=foo(){
> > > > > > > A= foo(){
> > > > > > > ...
> > > > > > >
> > > > > > > With standard whitespace analyzer in action statements
wanted
> to
> > > > match
> > > > > > can
> > > > > > > be on one to five terms in this case. If spacing is definite,
I
> > could
> > > > > go
> > > > > > > either a phrase search or regexep. Any suggestions for
this case?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Feb 13, 2016 at 1:34 PM Jack Krupansky <
> > > > > jack.krupansky@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Obviously you wouldn't need to do a regex for simply
terms
> > like foo
> > > > > and
> > > > > > > bar
> > > > > > > > - just use simple terms and quoted phrase to match
"foo bar".
> > If
> > > > you
> > > > > > > really
> > > > > > > > do need to do complex pattern regexes and match across
> adjacent
> > > > > terms,
> > > > > > > your
> > > > > > > > best bet is to keep a copy of the source text in a
separate
> > string
> > > > > (not
> > > > > > > > tokenized text) field and then you can do a complex
regex that
> > > > spans
> > > > > > > terms
> > > > > > > > (and only do that if normal span queries don't do
what you
> > need.)
> > > > > > > >
> > > > > > > > What does your typical cross-term regex actually look
like?
> > > > > > > >
> > > > > > > >
> > > > > > > > -- Jack Krupansky
> > > > > > > >
> > > > > > > > On Sat, Feb 13, 2016 at 1:25 PM, Uwe Schindler <
> > uwe@thetaphi.de>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > That's very easy to explain: Regexp queries only
work on
> > terms,
> > > > you
> > > > > > > > > already said it in your introduction. There is
no phrase
> > query in
> > > > > > > Lucene
> > > > > > > > > that accepts regular expressions.
> > > > > > > > >
> > > > > > > > > Uwe
> > > > > > > > >
> > > > > > > > > -----
> > > > > > > > > Uwe Schindler
> > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > http://www.thetaphi.de
> > > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Kudrettin Güleryüz [mailto:kudrettin@gmail.com]
> > > > > > > > > > Sent: Saturday, February 13, 2016 7:14 PM
> > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > Subject: Spaces in regular expressions
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > I am using standard whitespace analyzer
to index a source
> > code
> > > > > > > document
> > > > > > > > > > using Lucene 5.
> > > > > > > > > >
> > > > > > > > > > I understand that a document with content
foo bar would
> > have
> > > > only
> > > > > > two
> > > > > > > > > > terms: foo and bar.  When I search for "foo
bar" it
> > normally
> > > > > > matches
> > > > > > > > the
> > > > > > > > > > document. Similarly a regexp query /foo/
or /bar/ also
> > matches
> > > > > the
> > > > > > > > > > document.
> > > > > > > > > >
> > > > > > > > > > Can you help me understand why doesn't a
regexp query
> like
> > /foo
> > > > > > bar/
> > > > > > > > > > doesn't match the document?
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > > Kudret
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > > java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message