From dev-return-47755-archive-asf-public=cust-asf.ponee.io@ignite.apache.org  Mon Sep 30 20:35:28 2019
Return-Path: <dev-return-47755-archive-asf-public=cust-asf.ponee.io@ignite.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 00461180656
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 30 Sep 2019 22:35:27 +0200 (CEST)
Received: (qmail 4922 invoked by uid 500); 30 Sep 2019 20:35:25 -0000
Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@ignite.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@ignite.apache.org>
List-Post: <mailto:dev@ignite.apache.org>
List-Id: <dev.ignite.apache.org>
Reply-To: dev@ignite.apache.org
Delivered-To: mailing list dev@ignite.apache.org
Received: (qmail 4818 invoked by uid 99); 30 Sep 2019 20:35:25 -0000
Received: from Unknown (HELO mailrelay1-lw-us.apache.org) (10.10.3.42)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2019 20:35:25 +0000
Received: from mail-io1-f49.google.com (mail-io1-f49.google.com [209.85.166.49])
	by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 58DC86222
	for <dev@ignite.apache.org>; Mon, 30 Sep 2019 20:35:25 +0000 (UTC)
Received: by mail-io1-f49.google.com with SMTP id u8so41884421iom.5
        for <dev@ignite.apache.org>; Mon, 30 Sep 2019 13:35:25 -0700 (PDT)
X-Gm-Message-State: APjAAAXGuH913eE3ZYlr2zCe2A4uasmCG3tJw65O4WlEv2tdYJlX8kJZ
	YnEB9qo9qpSF0O9rdO5ujOqGdRH52tRtbZB2Ga+Z4g==
X-Google-Smtp-Source: APXvYqxth5UPuW2UjkRq5obampPd0EINp69kdFDbk8GaRK0FRFlgf87Lne0m/7eUU0OURBL0U6ef1/cnPStMI2VQisk=
X-Received: by 2002:a92:5c0c:: with SMTP id q12mr10439540ilb.111.1569875724986;
 Mon, 30 Sep 2019 13:35:24 -0700 (PDT)
MIME-Version: 1.0
References: <CADMrN_dmpCr3CH07q_x-FEjJB58c4KTAzys7Jsv3D6pfb7dPSw@mail.gmail.com>
 <CANCAXEFzByzinrHAhPmK_kMxFkw8yFvQQaPBG+zcjU81FYSbNw@mail.gmail.com>
 <CAK0qHnpD1Lr3MgUZ0frpiD-a=DxdY45PZFmObNvABx5YHtJDsA@mail.gmail.com>
 <CAMegbc+jvTa_zW_eAghrXp6HSfvh7Aavbx-KeM4Fjm+MB=sPqw@mail.gmail.com>
 <CADMrN_efc17thy4ardkhM_1GCgsjY98aAUmExHYhD3Rahe21Ow@mail.gmail.com>
 <CAK0qHnrPqKCzFCFBByX-WnvkN4TxHT5SMSzBjF+eAp1JitbCiw@mail.gmail.com>
 <CAMegbcKCcQQTLhNH3n+N4wEmj=uVTpkwfr3DcJuDonz01g+xeQ@mail.gmail.com>
 <CAOykqKccuN=8_pOpaVLRuJ9_xZdsP36y1H_sjCTHrUCt8kX1pw@mail.gmail.com>
 <CADMrN_fvfesZMO+msRx780hjzuLnexprBXYNVfKsANt+tf9CeQ@mail.gmail.com> <CAOykqKfWU8zjDrO=-_SZEev1tUhh28OcaLa45O9OR2TQRhkgcA@mail.gmail.com>
In-Reply-To: <CAOykqKfWU8zjDrO=-_SZEev1tUhh28OcaLa45O9OR2TQRhkgcA@mail.gmail.com>
From: Denis Magda <dmagda@apache.org>
Date: Mon, 30 Sep 2019 13:34:57 -0700
X-Gmail-Original-Message-ID: <CAK0qHnqYK1xePDucUi2avpoz4pk3Y9nJ2k_eMSPoOd+ZPHmCYQ@mail.gmail.com>
Message-ID: <CAK0qHnqYK1xePDucUi2avpoz4pk3Y9nJ2k_eMSPoOd+ZPHmCYQ@mail.gmail.com>
Subject: Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)
To: dev <dev@ignite.apache.org>
Content-Type: multipart/alternative; boundary="000000000000ee25c70593cb2c75"

--000000000000ee25c70593cb2c75
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Yuriy,

I've seen you opening a pull-request with the first changes:
https://issues.apache.org/jira/browse/IGNITE-12189

Alex Scherbakov and Ivan are you the right guys to do the review?

-
Denis


On Fri, Sep 27, 2019 at 8:48 AM =D0=9F=D0=B0=D0=B2=D0=BB=D1=83=D1=85=D0=B8=
=D0=BD =D0=98=D0=B2=D0=B0=D0=BD <vololo100@gmail.com> wrote:

> Yuriy,
>
> Thank you for providing details! Quite interesting.
>
> Yes, we already have support of distributed limit and merging sorted
> subresults for SQL queries. E.g. ReduceIndexSorted and
> MergeStreamIterator are used for merging sorted streams.
>
> Could you please also clarify about score/relevance? Is it provided by
> Lucene engine for each query result? I am thinking how to do sorted
> merge properly in this case.
>
> =D1=81=D1=80, 25 =D1=81=D0=B5=D0=BD=D1=82. 2019 =D0=B3. =D0=B2 18:56, Yur=
iy Shuliga <shuliga@gmail.com>:
> >
> > Ivan,
> >
> > Thank you for interesting question!
> >
> > Text searches (or full text searches) are mostly human-oriented. And th=
e
> > point of user's interest is topmost part of response.
> > Then user can read it, evaluate and use the given records for further
> > purposes.
> >
> > Particularly in our case, we use Ignite for operations with financial
> data,
> > and there lots of text stuff like assets names, fin. instruments,
> companies
> > etc.
> > In order to operate with this quickly and reliably, users used to work
> with
> > text search, type-ahead completions, suggestions.
> >
> > For this purposes we are indexing particular string data in separate
> caches.
> >
> > Sorting capabilities and response size limitations are very important
> > there. As our API have to provide most relevant information in view of
> > limited size.
> >
> > Now let me comment some Ignite/Lucene perspective.
> > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs *already
> > sorted by *score *(relevance). So most relevant documents are on the to=
p.
> > And currently distributed queries responses from different nodes are
> merged
> > into final query cursor queue in arbitrary way.
> > So in fact we already have the score order ruined here. Also Ignite
> > requests all possible documents from Lucene that is redundant and not
> good
> > for performance.
> >
> > I'm implementing *limit* parameter to be part of *TextQuery *and have t=
o
> > notice that we still have to add sorting for text queries processing in
> > order to have applicable results.
> >
> > *Limit* parameter itself should improve the part of issues from above,
> but
> > definitely, sorting by document score at least  should be implemented
> along
> > with limit.
> >
> > This is a pretty short commentary if you still have any questions, plea=
se
> > ask, do not hesitate)
> >
> > BR,
> > Yuriy Shuliha
> >
> > =D1=87=D1=82, 19 =D0=B2=D0=B5=D1=80. 2019 =D0=BE 11:38 =D0=9F=D0=B0=D0=
=B2=D0=BB=D1=83=D1=85=D0=B8=D0=BD =D0=98=D0=B2=D0=B0=D0=BD <vololo100@gmail=
.com> =D0=BF=D0=B8=D1=88=D0=B5:
> >
> > > Yuriy,
> > >
> > > Greatly appreciate your interest.
> > >
> > > Could you please elaborate a little bit about sorting? What tasks doe=
s
> > > it help to solve and how? It would be great to provide an example.
> > >
> > > =D1=81=D1=80, 18 =D1=81=D0=B5=D0=BD=D1=82. 2019 =D0=B3. =D0=B2 09:39,=
 Alexei Scherbakov <
> > > alexey.scherbakoff@gmail.com>:
> > > >
> > > > Denis,
> > > >
> > > > I like the idea of throwing an exception for enabled text queries o=
n
> > > > persistent caches.
> > > >
> > > > Also I'm fine with proposed limit for unsorted searches.
> > > >
> > > > Yury, please proceed with ticket creation.
> > > >
> > > > =D0=B2=D1=82, 17 =D1=81=D0=B5=D0=BD=D1=82. 2019 =D0=B3., 22:06 Deni=
s Magda <dmagda@apache.org>:
> > > >
> > > > > Igniters,
> > > > >
> > > > > I see nothing wrong with Yury's proposal in regards full-text
> search
> > > API
> > > > > evolution as long as Yury is ready to push it forward.
> > > > >
> > > > > As for the in-memory mode only, it makes total sense for in-memor=
y
> data
> > > > > grid deployments when Ignite caches data of an underlying DB like
> > > Postgres.
> > > > > As part of the changes, I would simply throw an exception (by
> default)
> > > if
> > > > > the one attempts to use text indices with the native persistence
> > > enabled.
> > > > > If the person is ready to live with that limitation that an
> explicit
> > > > > configuration change is needed to come around the exception.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > >
> > > > > -
> > > > > Denis
> > > > >
> > > > >
> > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <shuliga@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hello to all again,
> > > > > >
> > > > > > Thank you for important comments and notes given below!
> > > > > >
> > > > > > Let me answer and continue the discussion.
> > > > > >
> > > > > > (I) Overall needs in Lucene indexing
> > > > > >
> > > > > > Alexei has referenced to
> > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where
> > > > > > absence of index persistence was declared as an obstacle to
> further
> > > > > > development.
> > > > > >
> > > > > > a) This ticket is already closed as not valid.b) There are
> definite
> > > needs
> > > > > > (and in our project as well) in just in-memory indexing of
> selected
> > > data.
> > > > > > We intend to use search capabilities for fetching limited amoun=
t
> of
> > > > > records
> > > > > > that should be used in type-ahead search / suggestions.
> > > > > > Not all of the data will be indexed and the are no need in Luce=
ne
> > > index
> > > > > to
> > > > > > be persistence. Hope this is a wide pattern of text-search usag=
e.
> > > > > >
> > > > > > (II) Necessary fixes in current implementation.
> > > > > >
> > > > > > a) Implementation of correct *limit *(*offset* seems to be not
> > > required
> > > > > in
> > > > > > text-search tasks for now)
> > > > > > I have investigated the data flow for distributed text queries.
> it
> > > was
> > > > > > simple test prefix query, like 'name'*=3D'ene*'*
> > > > > > For now each server-node returns all response records to the
> > > client-node
> > > > > > and it may contain ~thousands, ~hundred thousands records.
> > > > > > Event if we need only first 10-100. Again, all the results are
> added
> > > to
> > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by page=
s.
> > > > > > I did not find here any means to deliver deterministic result.
> > > > > > So implementing limit as part of query and
> (GridCacheQueryRequest)
> > > will
> > > > > not
> > > > > > change the nature of response but will limit load on nodes and
> > > > > networking.
> > > > > >
> > > > > > Can we consider to open a ticket for this?
> > > > > >
> > > > > > (III) Further extension of Lucene API exposition to Ignite
> > > > > >
> > > > > > a) Sorting
> > > > > > The solution for this could be:
> > > > > > - Make entities comparable
> > > > > > - Add custom comparator to entity
> > > > > > - Add annotations to mark sorted fields for Lucene indexing
> > > > > > - Use comparators when merging responses or reducing to desired
> > > limit on
> > > > > > client node.
> > > > > > Will require full result set to be loaded into memory. Though
> can be
> > > used
> > > > > > for relatively small limits.
> > > > > > BR,
> > > > > > Yuriy Shuliha
> > > > > >
> > > > > > =D0=BF=D1=82, 30 =D1=81=D0=B5=D1=80=D0=BF. 2019 =D0=BE 10:37 Al=
exei Scherbakov <
> > > > > alexey.scherbakoff@gmail.com>
> > > > > > =D0=BF=D0=B8=D1=88=D0=B5:
> > > > > >
> > > > > > > Yuriy,
> > > > > > >
> > > > > > > Note what one of major blockers for text queries is [1] which
> makes
> > > > > > lucene
> > > > > > > indexes unusable with persistence and main reason for
> > > discontinuation.
> > > > > > > Probably it's should be addressed first to make text queries =
a
> > > valid
> > > > > > > product feature.
> > > > > > >
> > > > > > > Distributed sorting and advanved querying is indeed not a
> trivial
> > > task.
> > > > > > > Some kind of merging must be implemented on query originating
> node.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371
> > > > > > >
> > > > > > > =D1=87=D1=82, 29 =D0=B0=D0=B2=D0=B3. 2019 =D0=B3. =D0=B2 23:3=
8, Denis Magda <dmagda@apache.org>:
> > > > > > >
> > > > > > > > Yuriy,
> > > > > > > >
> > > > > > > > If you are ready to take over the full-text search indexes
> then
> > > > > please
> > > > > > go
> > > > > > > > ahead. The primary reason why the community wants to
> discontinue
> > > them
> > > > > > > first
> > > > > > > > (and, probable, resurrect later) are the limitations listed
> by
> > > Andrey
> > > > > > and
> > > > > > > > minimal support from the community end.
> > > > > > > >
> > > > > > > > -
> > > > > > > > Denis
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov <
> > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Yuriy,
> > > > > > > > >
> > > > > > > > > Unfortunatelly, there is a plan to discontinue TextQuerie=
s
> in
> > > > > Ignite
> > > > > > > [1].
> > > > > > > > > Motivation here is text indexes are not persistent, not
> > > > > transactional
> > > > > > > and
> > > > > > > > > can't be user together with SQL or inside SQL.
> > > > > > > > > and there is a lack of interest from community side.
> > > > > > > > > You are weclome to take on these issues and make
> TextQueries
> > > great.
> > > > > > > > >
> > > > > > > > > 1,  PageSize can't be used to limit resultset.
> > > > > > > > > Query results return from data node to client-side cursor
> in
> > > > > > > page-by-page
> > > > > > > > > manner and
> > > > > > > > > this parameter is designed control page size. It is
> supposed
> > > query
> > > > > > > > executes
> > > > > > > > > lazily on server side and
> > > > > > > > > it is not excepted full resultset be loaded to memory on
> server
> > > > > side
> > > > > > at
> > > > > > > > > once, but by pages.
> > > > > > > > > Do you mean you found Lucene load entire resultset into
> memory
> > > > > before
> > > > > > > > first
> > > > > > > > > page is sent to client?
> > > > > > > > >
> > > > > > > > > I'd think a new parameter should be added to limit result=
.
> The
> > > best
> > > > > > > > > solution is to use query language commands for this, e.g.
> > > > > > > "LIMIT/OFFSET"
> > > > > > > > in
> > > > > > > > > SQL.
> > > > > > > > >
> > > > > > > > > This task doesn't look trivial. Query is distributed
> operation
> > > and
> > > > > > same
> > > > > > > > > user query will be executed on data nodes
> > > > > > > > > and then results from all nodes should be correcly merged
> > > before
> > > > > > being
> > > > > > > > > returned via client-cursor.
> > > > > > > > > So, LIMIT should be applied on every node and then on mer=
ge
> > > phase.
> > > > > > > > >
> > > > > > > > > Also, this may be non-obviuos, limiting results make no
> sence
> > > > > without
> > > > > > > > > sorting,
> > > > > > > > > as there is no guarantee every next query run will return
> same
> > > data
> > > > > > > > because
> > > > > > > > > of page reordeing.
> > > > > > > > > Basically, merge phase receive results from data nodes
> > > > > asynchronously
> > > > > > > and
> > > > > > > > > messages from different nodes can't be ordered.
> > > > > > > > >
> > > > > > > > > 2.
> > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more
> > > verbose,
> > > > > > > isn't
> > > > > > > > > it.
> > > > > > > > > b,c. What about distributed query? How partial results fr=
om
> > > nodes
> > > > > > will
> > > > > > > be
> > > > > > > > > merged?
> > > > > > > > >  Does Lucene allows to configure comparator for data
> sorting?
> > > > > > > > > What comparator Ignite should choose to sort result on
> merge
> > > phase?
> > > > > > > > >
> > > > > > > > > 3. For now Lucene engine is not configurable at all. E.g.
> it is
> > > > > > > > impossible
> > > > > > > > > to configure Tokenizer.
> > > > > > > > > I'd think about possible ways to configure engine at firs=
t
> and
> > > only
> > > > > > > then
> > > > > > > > go
> > > > > > > > > further to discuss\implement complex features,
> > > > > > > > > that may depends on engine config.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <
> > > shuliga@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Dear community,
> > > > > > > > > >
> > > > > > > > > > By starting this chain I'd like to open discussion that
> would
> > > > > come
> > > > > > to
> > > > > > > > > > contribution results in subj. area.
> > > > > > > > > >
> > > > > > > > > > Ignite has indexing capabilities, backed up by differen=
t
> > > > > > mechanisms,
> > > > > > > > > > including Lucene.
> > > > > > > > > >
> > > > > > > > > > Currently, Lucene 7.5.0 is used (past year release).
> > > > > > > > > > This is a wide spread and mature technology that covers
> text
> > > > > search
> > > > > > > > area
> > > > > > > > > > and beyond (e.g. spacial data indexing).
> > > > > > > > > >
> > > > > > > > > > My goal is to *expose more Lucene functionality to Igni=
te
> > > > > indexing
> > > > > > > and
> > > > > > > > > > query mechanisms for text data*.
> > > > > > > > > >
> > > > > > > > > > It's quite simple request at current stage. It is comin=
g
> > > from our
> > > > > > > > > project's
> > > > > > > > > > needs, but i believe, will be useful for a lot more
> people.
> > > > > > > > > > Let's walk through and vote or discuss about Jira
> tickets for
> > > > > them.
> > > > > > > > > >
> > > > > > > > > > 1.[trivial] Use  dataQuery.getPageSize()  to limit sear=
ch
> > > > > response
> > > > > > > > items
> > > > > > > > > > inside GridLuceneIndex.query(). Currently it is calling
> > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so
> > > basically
> > > > > all
> > > > > > > > > scored
> > > > > > > > > > matches will me returned, what we do not need in most
> cases.
> > > > > > > > > >
> > > > > > > > > > 2.[simple] Add sorting.  Then more capable search call
> can be
> > > > > > > > > > executed: *IndexSearcher.search(query, count,
> > > > > > > > > > sort) *
> > > > > > > > > > Implementation steps:
> > > > > > > > > > a) Introduce boolean *sortField* parameter in
> > > *@QueryTextFiled *
> > > > > > > > > > annotation. If
> > > > > > > > > > *true *the filed will be indexed but not tokenized.
> Number
> > > types
> > > > > > are
> > > > > > > > > > preferred here.
> > > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It
> > > should
> > > > > > define
> > > > > > > > > > desired sort fields used for querying.
> > > > > > > > > > c) Implement Lucene sort usage in
> GridLuceneIndex.query().
> > > > > > > > > >
> > > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*,
> > > including
> > > > > > > > > > terms/queries boosting.
> > > > > > > > > > *This section for voting only, as requires more detaile=
d
> > > work.
> > > > > > Should
> > > > > > > > be
> > > > > > > > > > extended if community is interested in it.*
> > > > > > > > > >
> > > > > > > > > > Looking forward to your comments!
> > > > > > > > > >
> > > > > > > > > > BR,
> > > > > > > > > > Yuriy Shuliha
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Andrey V. Mashenkov
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Alexei Scherbakov
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

--000000000000ee25c70593cb2c75--