From java-user-return-63797-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Tue Jun 26 15:04:35 2018
Return-Path: <java-user-return-63797-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 1ED01180636
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 26 Jun 2018 15:04:34 +0200 (CEST)
Received: (qmail 30763 invoked by uid 500); 26 Jun 2018 13:04:33 -0000
Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:java-user-help@lucene.apache.org>
List-Unsubscribe: <mailto:java-user-unsubscribe@lucene.apache.org>
List-Post: <mailto:java-user@lucene.apache.org>
List-Id: <java-user.lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Delivered-To: mailing list java-user@lucene.apache.org
Received: (qmail 30752 invoked by uid 99); 26 Jun 2018 13:04:33 -0000
Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jun 2018 13:04:33 +0000
Received: from mail-it0-f48.google.com (mail-it0-f48.google.com [209.85.214.48])
	by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id CB4A645C
	for <java-user@lucene.apache.org>; Tue, 26 Jun 2018 13:04:32 +0000 (UTC)
Received: by mail-it0-f48.google.com with SMTP id 16-v6so2174166itl.5
        for <java-user@lucene.apache.org>; Tue, 26 Jun 2018 06:04:32 -0700 (PDT)
X-Gm-Message-State: APt69E3GvskhvRJr+QJbVY6uvdmF2iLpSPhsii+jBWH1VJqvRBzpCcpp
	ZjN8Ug+eW7DuebwJnZgPEAVAqQFatY3a0RC16EM=
X-Google-Smtp-Source: ADUXVKKRXiU3ejN2Qr500cCzQfBevryPYW6cO+Q6GTegirN9kpa/8q3ey1C6HE4HTH/grQpyRKYRqK4caKhoagpQohg=
X-Received: by 2002:a24:5a11:: with SMTP id v17-v6mr1397353ita.40.1530018272213;
 Tue, 26 Jun 2018 06:04:32 -0700 (PDT)
MIME-Version: 1.0
References: <c252359b394b4701967b6fb73d0a9d8e@sap.com> <8c75e2eacdd74dc284cb2a442de45317@sap.com>
 <CAF8TkC640z2wtw+tM-02OsJMz28GFgYxyZHGqj4svT8-JqV2MQ@mail.gmail.com> <7d53e065967444debf647d94f2f0c4af@sap.com>
In-Reply-To: <7d53e065967444debf647d94f2f0c4af@sap.com>
From: Mikhail Khludnev <mkhl@apache.org>
Date: Tue, 26 Jun 2018 16:04:20 +0300
X-Gmail-Original-Message-ID: <CAF8TkC4bgo24XhzE6R89HNc5TFzJodTkOO1XTONpnwAYSaTuhQ@mail.gmail.com>
Message-ID: <CAF8TkC4bgo24XhzE6R89HNc5TFzJodTkOO1XTONpnwAYSaTuhQ@mail.gmail.com>
Subject: Re: How search code files for words which contains a given substrings?
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary="0000000000009d84c6056f8b2376"

--0000000000009d84c6056f8b2376
Content-Type: text/plain; charset="UTF-8"

I mean, you'd rather need offsets not positions, but I don't have something
definite to suggest.

On Tue, Jun 26, 2018 at 1:29 PM Gordin, Ira <ira.gordin@sap.com> wrote:

> Hello Mikhail,
>
> I see in the link you sent that PositionIncrementAttribute determines the
> position of this token relative to the previous Token in a TokenStream,
> used in phrase searching.
> I am not in phrase searching.
> Would you mind to explain how it can help me?
>
> Thanks,
> Ira
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhl@apache.org]
> Sent: Tuesday, June 26, 2018 12:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: How search code files for words which contains a given
> substrings?
>
> Hello, Ira.
> Note the difference between offset
>
> https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html
> and
> position
>
> https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
> in Lucene terminology.
> Please make sure you don't rebuild existing functionality
>
> https://lucene.apache.org/core/7_3_1/highlighter/org/apache/lucene/search/highlight/package-summary.html#package.description
>
>
> On Tue, Jun 26, 2018 at 10:57 AM Gordin, Ira <ira.gordin@sap.com> wrote:
>
> > Hi all,
> > I started to work on project which currently search code files for words
> > which contains a given substrings.
> > Currently it uses WhitespaceTokenizerand use regex query which wraps the
> > searched substring with '.*'.
> > For example, if one search for 'a', the query will be '/.*a.*/'. In this
> > way in the 'Mama loves banana' text, it will find tokens 'Mama' and
> > 'banana'.
> > Currently I need to get the start and end positions of matched tokens in
> > the line and the line number.
> > With TokenStream I can get start and end positions of  'Mama' and
> 'banana'
> > in the full text. But I need the positions of 'a'.
> > I see 2 options.
> > Option 1: to perform additional search in returned token.
> > Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of
> > them) and in this way I hope I will get the 'a' positions in TokenStream.
> > Additional question how I can get the line numbers and the positions
> > inside the line.
> > Many thanks in advance for your help,
> > Ira
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Sincerely yours
Mikhail Khludnev

--0000000000009d84c6056f8b2376--