lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kudrettin Güleryüz <kudret...@gmail.com>
Subject Re: Spaces in regular expressions
Date Thu, 25 Feb 2016 18:57:41 GMT
Thank you, I had looked at that article a little, some time ago. I was
thinking I may have to change some lower level Lucene classes to be able to
work like that. Plus I don't have much clue if that would break things.

I am primarily looking for a Lucene solution at this point.

On Thu, Feb 25, 2016 at 10:53 AM Greg Bowyer <gbowyer@fastmail.co.uk> wrote:

> Possibly not helpful but some time ago Russ Cox implemented a code
> search at Google.
>
> His design is documented here https://swtch.com/~rsc/regexp/regexp4.html
>
> On Wed, Feb 24, 2016, at 08:01 AM, Kudrettin Güleryüz wrote:
> > I appreciate the pointers Jack. More on that, where can I read more on
> > enabling full regexp support on indexed source code documents using
> > Lucene?
> >
> > Any suggestions regarding cases where developers implemented this kind of
> > capability using Lucene/Solr/ElasticSearch/... would be more than
> > welcome.
> >
> > Thank you,
> > Kudret
> >
> > On Mon, Feb 15, 2016 at 10:22 AM Jack Krupansky
> > <jack.krupansky@gmail.com>
> > wrote:
> >
> > > You can have two parallel fields, one tokenized as a programming
> language
> > > would (identifiers, operators) and one using the keyword tokenizer for
> each
> > > line. You have to decide whether to treat each line as a separate
> Lucene
> > > document or treat each source file as a multivalued field, one value
> per
> > > source line. And then there is the issue of code sequences that span
> source
> > > lines.
> > >
> > > -- Jack Krupansky
> > >
> > > On Mon, Feb 15, 2016 at 8:30 AM, Kudrettin Güleryüz <
> kudrettin@gmail.com>
> > > wrote:
> > >
> > > > Since documents are source code, I am considering matching on
> operators
> > > > too.
> > > >
> > > > Using whitespace analyzer, A=foo(){ would be a single term, A = foo
> () {
> > > > would be  five terms. Different documents can have a different
> > > combination
> > > > of the identifiers and operators in the example. A regexp query like
> > > > /A\s*=\s*foo\s*()\s*{/ could match all of them if multi term regexp
> was
> > > > allowed. Is it not allowed by default, or not possible at all? Any
> > > > suggestions?
> > > >
> > > > On Sat, Feb 13, 2016 at 5:09 PM Jack Krupansky <
> jack.krupansky@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Just to be clear, the whitespace tokenizer would treat "A=foo(){"
> as a
> > > > > single token. I presume you want "A" and "foo" to be separate
> terms.
> > > > >
> > > > > You still haven't indicated what regex you were considering. Try
> > > > explaining
> > > > > your query in plain English. I mean, do you want to search for two
> > > > keywords
> > > > > with any operator sequence between them? Or... do you want to
> match on
> > > > > operators as well but simply want to ignore whitespace?
> > > > >
> > > > > Generally, the standard analyzer/tokenizer is better/easier - you
> can
> > > > > simply query "A foo" and it will match all three of you statements.
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Sat, Feb 13, 2016 at 4:29 PM, Kudrettin Güleryüz <
> > > kudrettin@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > >  As mentioned, document is a source code. As you know all below
> > > > statments
> > > > > > are equal:
> > > > > > A = foo() {
> > > > > > A=foo(){
> > > > > > A= foo(){
> > > > > > ...
> > > > > >
> > > > > > With standard whitespace analyzer in action statements wanted
to
> > > match
> > > > > can
> > > > > > be on one to five terms in this case. If spacing is definite,
I
> could
> > > > go
> > > > > > either a phrase search or regexep. Any suggestions for this
case?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Feb 13, 2016 at 1:34 PM Jack Krupansky <
> > > > jack.krupansky@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Obviously you wouldn't need to do a regex for simply terms
> like foo
> > > > and
> > > > > > bar
> > > > > > > - just use simple terms and quoted phrase to match "foo
bar".
> If
> > > you
> > > > > > really
> > > > > > > do need to do complex pattern regexes and match across
adjacent
> > > > terms,
> > > > > > your
> > > > > > > best bet is to keep a copy of the source text in a separate
> string
> > > > (not
> > > > > > > tokenized text) field and then you can do a complex regex
that
> > > spans
> > > > > > terms
> > > > > > > (and only do that if normal span queries don't do what
you
> need.)
> > > > > > >
> > > > > > > What does your typical cross-term regex actually look like?
> > > > > > >
> > > > > > >
> > > > > > > -- Jack Krupansky
> > > > > > >
> > > > > > > On Sat, Feb 13, 2016 at 1:25 PM, Uwe Schindler <
> uwe@thetaphi.de>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > That's very easy to explain: Regexp queries only work
on
> terms,
> > > you
> > > > > > > > already said it in your introduction. There is no
phrase
> query in
> > > > > > Lucene
> > > > > > > > that accepts regular expressions.
> > > > > > > >
> > > > > > > > Uwe
> > > > > > > >
> > > > > > > > -----
> > > > > > > > Uwe Schindler
> > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > http://www.thetaphi.de
> > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Kudrettin Güleryüz [mailto:kudrettin@gmail.com]
> > > > > > > > > Sent: Saturday, February 13, 2016 7:14 PM
> > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > Subject: Spaces in regular expressions
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I am using standard whitespace analyzer to index
a source
> code
> > > > > > document
> > > > > > > > > using Lucene 5.
> > > > > > > > >
> > > > > > > > > I understand that a document with content foo
bar would
> have
> > > only
> > > > > two
> > > > > > > > > terms: foo and bar.  When I search for "foo bar"
it
> normally
> > > > > matches
> > > > > > > the
> > > > > > > > > document. Similarly a regexp query /foo/ or /bar/
also
> matches
> > > > the
> > > > > > > > > document.
> > > > > > > > >
> > > > > > > > > Can you help me understand why doesn't a regexp
query like
> /foo
> > > > > bar/
> > > > > > > > > doesn't match the document?
> > > > > > > > >
> > > > > > > > > Thank you,
> > > > > > > > > Kudret
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message