ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuriy Shuliga <shul...@gmail.com>
Subject Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)
Date Fri, 04 Oct 2019 08:01:02 GMT
Ivan,

Yes, your observation is correct.

This behavior lasts from the very beginning when Lucene indexing was
implemented for distributed queries.
Implementation of the *limit* solves the problem of redundant response
size. Without it *ALL* off the records are fetched each time; that is not
good, especially for loose patterns.
In order to solve relevance issue correct sorting should be implemented.

Y.

пт, 4 жовт. 2019 о 10:45 Ivan Pavlukhin <vololo100@gmail.com> пише:

> Yuriy,
>
> Am I getting it right that in your PR if we have a limit N than
> returned items (at most N) will not be strictly the most relevant
> ones? E.g. if one node returned N items faster than others but with
> not so good relevance?
>
> чт, 3 окт. 2019 г. в 17:47, Andrey Mashenkov <andrey.mashenkov@gmail.com>:
> >
> > Yuri,
> >
> > I've done with review.
> > No crime found, but trivial compatibility bug.
> >
> > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <shuliga@gmail.com> wrote:
> >
> > > Denis,
> > >
> > > Thank you for your attention to this.
> > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189
> ticket
> > > is still pending review.
> > > Do we have a chance to move it forward somehow?
> > >
> > > BR,
> > > Yuriy Shuliha
> > >
> > > пн, 30 вер. 2019 о 23:35 Denis Magda <dmagda@apache.org> пише:
> > >
> > > > Yuriy,
> > > >
> > > > I've seen you opening a pull-request with the first changes:
> > > > https://issues.apache.org/jira/browse/IGNITE-12189
> > > >
> > > > Alex Scherbakov and Ivan are you the right guys to do the review?
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <vololo100@gmail.com>
> > > wrote:
> > > >
> > > > > Yuriy,
> > > > >
> > > > > Thank you for providing details! Quite interesting.
> > > > >
> > > > > Yes, we already have support of distributed limit and merging
> sorted
> > > > > subresults for SQL queries. E.g. ReduceIndexSorted and
> > > > > MergeStreamIterator are used for merging sorted streams.
> > > > >
> > > > > Could you please also clarify about score/relevance? Is it
> provided by
> > > > > Lucene engine for each query result? I am thinking how to do sorted
> > > > > merge properly in this case.
> > > > >
> > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <shuliga@gmail.com>:
> > > > > >
> > > > > > Ivan,
> > > > > >
> > > > > > Thank you for interesting question!
> > > > > >
> > > > > > Text searches (or full text searches) are mostly human-oriented.
> And
> > > > the
> > > > > > point of user's interest is topmost part of response.
> > > > > > Then user can read it, evaluate and use the given records for
> further
> > > > > > purposes.
> > > > > >
> > > > > > Particularly in our case, we use Ignite for operations with
> financial
> > > > > data,
> > > > > > and there lots of text stuff like assets names, fin. instruments,
> > > > > companies
> > > > > > etc.
> > > > > > In order to operate with this quickly and reliably, users used
to
> > > work
> > > > > with
> > > > > > text search, type-ahead completions, suggestions.
> > > > > >
> > > > > > For this purposes we are indexing particular string data in
> separate
> > > > > caches.
> > > > > >
> > > > > > Sorting capabilities and response size limitations are very
> important
> > > > > > there. As our API have to provide most relevant information
in
> view
> > > of
> > > > > > limited size.
> > > > > >
> > > > > > Now let me comment some Ignite/Lucene perspective.
> > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs
> > > *already
> > > > > > sorted by *score *(relevance). So most relevant documents are
on
> the
> > > > top.
> > > > > > And currently distributed queries responses from different nodes
> are
> > > > > merged
> > > > > > into final query cursor queue in arbitrary way.
> > > > > > So in fact we already have the score order ruined here. Also
> Ignite
> > > > > > requests all possible documents from Lucene that is redundant
> and not
> > > > > good
> > > > > > for performance.
> > > > > >
> > > > > > I'm implementing *limit* parameter to be part of *TextQuery
*and
> have
> > > > to
> > > > > > notice that we still have to add sorting for text queries
> processing
> > > in
> > > > > > order to have applicable results.
> > > > > >
> > > > > > *Limit* parameter itself should improve the part of issues from
> > > above,
> > > > > but
> > > > > > definitely, sorting by document score at least  should be
> implemented
> > > > > along
> > > > > > with limit.
> > > > > >
> > > > > > This is a pretty short commentary if you still have any
> questions,
> > > > please
> > > > > > ask, do not hesitate)
> > > > > >
> > > > > > BR,
> > > > > > Yuriy Shuliha
> > > > > >
> > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <vololo100@gmail.com>
> пише:
> > > > > >
> > > > > > > Yuriy,
> > > > > > >
> > > > > > > Greatly appreciate your interest.
> > > > > > >
> > > > > > > Could you please elaborate a little bit about sorting?
What
> tasks
> > > > does
> > > > > > > it help to solve and how? It would be great to provide
an
> example.
> > > > > > >
> > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov
<
> > > > > > > alexey.scherbakoff@gmail.com>:
> > > > > > > >
> > > > > > > > Denis,
> > > > > > > >
> > > > > > > > I like the idea of throwing an exception for enabled
text
> queries
> > > > on
> > > > > > > > persistent caches.
> > > > > > > >
> > > > > > > > Also I'm fine with proposed limit for unsorted searches.
> > > > > > > >
> > > > > > > > Yury, please proceed with ticket creation.
> > > > > > > >
> > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <dmagda@apache.org>:
> > > > > > > >
> > > > > > > > > Igniters,
> > > > > > > > >
> > > > > > > > > I see nothing wrong with Yury's proposal in regards
> full-text
> > > > > search
> > > > > > > API
> > > > > > > > > evolution as long as Yury is ready to push it
forward.
> > > > > > > > >
> > > > > > > > > As for the in-memory mode only, it makes total
sense for
> > > > in-memory
> > > > > data
> > > > > > > > > grid deployments when Ignite caches data of an
underlying
> DB
> > > like
> > > > > > > Postgres.
> > > > > > > > > As part of the changes, I would simply throw
an exception
> (by
> > > > > default)
> > > > > > > if
> > > > > > > > > the one attempts to use text indices with the
native
> > > persistence
> > > > > > > enabled.
> > > > > > > > > If the person is ready to live with that limitation
that an
> > > > > explicit
> > > > > > > > > configuration change is needed to come around
the
> exception.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -
> > > > > > > > > Denis
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga
<
> > > shuliga@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello to all again,
> > > > > > > > > >
> > > > > > > > > > Thank you for important comments and notes
given below!
> > > > > > > > > >
> > > > > > > > > > Let me answer and continue the discussion.
> > > > > > > > > >
> > > > > > > > > > (I) Overall needs in Lucene indexing
> > > > > > > > > >
> > > > > > > > > > Alexei has referenced to
> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371
where
> > > > > > > > > > absence of index persistence was declared
as an obstacle
> to
> > > > > further
> > > > > > > > > > development.
> > > > > > > > > >
> > > > > > > > > > a) This ticket is already closed as not
valid.b) There
> are
> > > > > definite
> > > > > > > needs
> > > > > > > > > > (and in our project as well) in just in-memory
indexing
> of
> > > > > selected
> > > > > > > data.
> > > > > > > > > > We intend to use search capabilities for
fetching limited
> > > > amount
> > > > > of
> > > > > > > > > records
> > > > > > > > > > that should be used in type-ahead search
/ suggestions.
> > > > > > > > > > Not all of the data will be indexed and
the are no need
> in
> > > > Lucene
> > > > > > > index
> > > > > > > > > to
> > > > > > > > > > be persistence. Hope this is a wide pattern
of
> text-search
> > > > usage.
> > > > > > > > > >
> > > > > > > > > > (II) Necessary fixes in current implementation.
> > > > > > > > > >
> > > > > > > > > > a) Implementation of correct *limit *(*offset*
seems to
> be
> > > not
> > > > > > > required
> > > > > > > > > in
> > > > > > > > > > text-search tasks for now)
> > > > > > > > > > I have investigated the data flow for distributed
text
> > > queries.
> > > > > it
> > > > > > > was
> > > > > > > > > > simple test prefix query, like 'name'*='ene*'*
> > > > > > > > > > For now each server-node returns all response
records to
> the
> > > > > > > client-node
> > > > > > > > > > and it may contain ~thousands, ~hundred
thousands
> records.
> > > > > > > > > > Event if we need only first 10-100. Again,
all the
> results
> > > are
> > > > > added
> > > > > > > to
> > > > > > > > > > queue in GridCacheQueryFutureAdapter in
arbitrary order
> by
> > > > pages.
> > > > > > > > > > I did not find here any means to deliver
deterministic
> > > result.
> > > > > > > > > > So implementing limit as part of query and
> > > > > (GridCacheQueryRequest)
> > > > > > > will
> > > > > > > > > not
> > > > > > > > > > change the nature of response but will limit
load on
> nodes
> > > and
> > > > > > > > > networking.
> > > > > > > > > >
> > > > > > > > > > Can we consider to open a ticket for this?
> > > > > > > > > >
> > > > > > > > > > (III) Further extension of Lucene API exposition
to
> Ignite
> > > > > > > > > >
> > > > > > > > > > a) Sorting
> > > > > > > > > > The solution for this could be:
> > > > > > > > > > - Make entities comparable
> > > > > > > > > > - Add custom comparator to entity
> > > > > > > > > > - Add annotations to mark sorted fields
for Lucene
> indexing
> > > > > > > > > > - Use comparators when merging responses
or reducing to
> > > desired
> > > > > > > limit on
> > > > > > > > > > client node.
> > > > > > > > > > Will require full result set to be loaded
into memory.
> Though
> > > > > can be
> > > > > > > used
> > > > > > > > > > for relatively small limits.
> > > > > > > > > > BR,
> > > > > > > > > > Yuriy Shuliha
> > > > > > > > > >
> > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei
Scherbakov <
> > > > > > > > > alexey.scherbakoff@gmail.com>
> > > > > > > > > > пише:
> > > > > > > > > >
> > > > > > > > > > > Yuriy,
> > > > > > > > > > >
> > > > > > > > > > > Note what one of major blockers for
text queries is [1]
> > > which
> > > > > makes
> > > > > > > > > > lucene
> > > > > > > > > > > indexes unusable with persistence and
main reason for
> > > > > > > discontinuation.
> > > > > > > > > > > Probably it's should be addressed first
to make text
> > > queries
> > > > a
> > > > > > > valid
> > > > > > > > > > > product feature.
> > > > > > > > > > >
> > > > > > > > > > > Distributed sorting and advanved querying
is indeed
> not a
> > > > > trivial
> > > > > > > task.
> > > > > > > > > > > Some kind of merging must be implemented
on query
> > > originating
> > > > > node.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371
> > > > > > > > > > >
> > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38,
Denis Magda <
> > > dmagda@apache.org
> > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Yuriy,
> > > > > > > > > > > >
> > > > > > > > > > > > If you are ready to take over
the full-text search
> > > indexes
> > > > > then
> > > > > > > > > please
> > > > > > > > > > go
> > > > > > > > > > > > ahead. The primary reason why
the community wants to
> > > > > discontinue
> > > > > > > them
> > > > > > > > > > > first
> > > > > > > > > > > > (and, probable, resurrect later)
are the limitations
> > > listed
> > > > > by
> > > > > > > Andrey
> > > > > > > > > > and
> > > > > > > > > > > > minimal support from the community
end.
> > > > > > > > > > > >
> > > > > > > > > > > > -
> > > > > > > > > > > > Denis
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM
Andrey Mashenkov <
> > > > > > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Yuriy,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unfortunatelly, there is
a plan to discontinue
> > > > TextQueries
> > > > > in
> > > > > > > > > Ignite
> > > > > > > > > > > [1].
> > > > > > > > > > > > > Motivation here is text indexes
are not
> persistent, not
> > > > > > > > > transactional
> > > > > > > > > > > and
> > > > > > > > > > > > > can't be user together with
SQL or inside SQL.
> > > > > > > > > > > > > and there is a lack of interest
from community
> side.
> > > > > > > > > > > > > You are weclome to take on
these issues and make
> > > > > TextQueries
> > > > > > > great.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1,  PageSize can't be used
to limit resultset.
> > > > > > > > > > > > > Query results return from
data node to client-side
> > > cursor
> > > > > in
> > > > > > > > > > > page-by-page
> > > > > > > > > > > > > manner and
> > > > > > > > > > > > > this parameter is designed
control page size. It is
> > > > > supposed
> > > > > > > query
> > > > > > > > > > > > executes
> > > > > > > > > > > > > lazily on server side and
> > > > > > > > > > > > > it is not excepted full resultset
be loaded to
> memory
> > > on
> > > > > server
> > > > > > > > > side
> > > > > > > > > > at
> > > > > > > > > > > > > once, but by pages.
> > > > > > > > > > > > > Do you mean you found Lucene
load entire resultset
> into
> > > > > memory
> > > > > > > > > before
> > > > > > > > > > > > first
> > > > > > > > > > > > > page is sent to client?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd think a new parameter
should be added to limit
> > > > result.
> > > > > The
> > > > > > > best
> > > > > > > > > > > > > solution is to use query
language commands for
> this,
> > > e.g.
> > > > > > > > > > > "LIMIT/OFFSET"
> > > > > > > > > > > > in
> > > > > > > > > > > > > SQL.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This task doesn't look trivial.
Query is
> distributed
> > > > > operation
> > > > > > > and
> > > > > > > > > > same
> > > > > > > > > > > > > user query will be executed
on data nodes
> > > > > > > > > > > > > and then results from all
nodes should be correcly
> > > merged
> > > > > > > before
> > > > > > > > > > being
> > > > > > > > > > > > > returned via client-cursor.
> > > > > > > > > > > > > So, LIMIT should be applied
on every node and then
> on
> > > > merge
> > > > > > > phase.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also, this may be non-obviuos,
limiting results
> make no
> > > > > sence
> > > > > > > > > without
> > > > > > > > > > > > > sorting,
> > > > > > > > > > > > > as there is no guarantee
every next query run will
> > > return
> > > > > same
> > > > > > > data
> > > > > > > > > > > > because
> > > > > > > > > > > > > of page reordeing.
> > > > > > > > > > > > > Basically, merge phase receive
results from data
> nodes
> > > > > > > > > asynchronously
> > > > > > > > > > > and
> > > > > > > > > > > > > messages from different nodes
can't be ordered.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2.
> > > > > > > > > > > > > a. "tokenize" param name
(for @QueryTextFiled)
> looks
> > > more
> > > > > > > verbose,
> > > > > > > > > > > isn't
> > > > > > > > > > > > > it.
> > > > > > > > > > > > > b,c. What about distributed
query? How partial
> results
> > > > from
> > > > > > > nodes
> > > > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > merged?
> > > > > > > > > > > > >  Does Lucene allows to configure
comparator for
> data
> > > > > sorting?
> > > > > > > > > > > > > What comparator Ignite should
choose to sort
> result on
> > > > > merge
> > > > > > > phase?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 3. For now Lucene engine
is not configurable at
> all.
> > > E.g.
> > > > > it is
> > > > > > > > > > > > impossible
> > > > > > > > > > > > > to configure Tokenizer.
> > > > > > > > > > > > > I'd think about possible
ways to configure engine
> at
> > > > first
> > > > > and
> > > > > > > only
> > > > > > > > > > > then
> > > > > > > > > > > > go
> > > > > > > > > > > > > further to discuss\implement
complex features,
> > > > > > > > > > > > > that may depends on engine
config.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17
PM Yuriy Shuliga <
> > > > > > > shuliga@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Dear community,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > By starting this chain
I'd like to open
> discussion
> > > that
> > > > > would
> > > > > > > > > come
> > > > > > > > > > to
> > > > > > > > > > > > > > contribution results
in subj. area.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ignite has indexing
capabilities, backed up by
> > > > different
> > > > > > > > > > mechanisms,
> > > > > > > > > > > > > > including Lucene.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently, Lucene 7.5.0
is used (past year
> release).
> > > > > > > > > > > > > > This is a wide spread
and mature technology that
> > > covers
> > > > > text
> > > > > > > > > search
> > > > > > > > > > > > area
> > > > > > > > > > > > > > and beyond (e.g. spacial
data indexing).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My goal is to *expose
more Lucene functionality
> to
> > > > Ignite
> > > > > > > > > indexing
> > > > > > > > > > > and
> > > > > > > > > > > > > > query mechanisms for
text data*.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It's quite simple request
at current stage. It is
> > > > coming
> > > > > > > from our
> > > > > > > > > > > > > project's
> > > > > > > > > > > > > > needs, but i believe,
will be useful for a lot
> more
> > > > > people.
> > > > > > > > > > > > > > Let's walk through and
vote or discuss about Jira
> > > > > tickets for
> > > > > > > > > them.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1.[trivial] Use  dataQuery.getPageSize()
 to
> limit
> > > > search
> > > > > > > > > response
> > > > > > > > > > > > items
> > > > > > > > > > > > > > inside GridLuceneIndex.query().
Currently it is
> > > calling
> > > > > > > > > > > > > > IndexSearcher.search(query,
*Integer.MAX_VALUE*)
> - so
> > > > > > > basically
> > > > > > > > > all
> > > > > > > > > > > > > scored
> > > > > > > > > > > > > > matches will me returned,
what we do not need in
> most
> > > > > cases.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2.[simple] Add sorting.
 Then more capable search
> > > call
> > > > > can be
> > > > > > > > > > > > > > executed: *IndexSearcher.search(query,
count,
> > > > > > > > > > > > > > sort) *
> > > > > > > > > > > > > > Implementation steps:
> > > > > > > > > > > > > > a) Introduce boolean
*sortField* parameter in
> > > > > > > *@QueryTextFiled *
> > > > > > > > > > > > > > annotation. If
> > > > > > > > > > > > > > *true *the filed will
be indexed but not
> tokenized.
> > > > > Number
> > > > > > > types
> > > > > > > > > > are
> > > > > > > > > > > > > > preferred here.
> > > > > > > > > > > > > > b) Add *sort* collection
to *TextQuery*
> constructor.
> > > It
> > > > > > > should
> > > > > > > > > > define
> > > > > > > > > > > > > > desired sort fields
used for querying.
> > > > > > > > > > > > > > c) Implement Lucene
sort usage in
> > > > > GridLuceneIndex.query().
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3.[moderate] Build complex
queries with
> *TextQuery*,
> > > > > > > including
> > > > > > > > > > > > > > terms/queries boosting.
> > > > > > > > > > > > > > *This section for voting
only, as requires more
> > > > detailed
> > > > > > > work.
> > > > > > > > > > Should
> > > > > > > > > > > > be
> > > > > > > > > > > > > > extended if community
is interested in it.*
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Looking forward to your
comments!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > BR,
> > > > > > > > > > > > > > Yuriy Shuliha
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Andrey V. Mashenkov
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ivan Pavlukhin
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message