ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuriy Shuliga <shul...@gmail.com>
Subject Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)
Date Tue, 17 Sep 2019 14:43:58 GMT
Hello to all again,

Thank you for important comments and notes given below!

Let me answer and continue the discussion.

(I) Overall needs in Lucene indexing

Alexei has referenced to
https://issues.apache.org/jira/browse/IGNITE-5371 where
absence of index persistence was declared as an obstacle to further
development.

a) This ticket is already closed as not valid.b) There are definite needs
(and in our project as well) in just in-memory indexing of selected data.
We intend to use search capabilities for fetching limited amount of records
that should be used in type-ahead search / suggestions.
Not all of the data will be indexed and the are no need in Lucene index to
be persistence. Hope this is a wide pattern of text-search usage.

(II) Necessary fixes in current implementation.

a) Implementation of correct *limit *(*offset* seems to be not required in
text-search tasks for now)
I have investigated the data flow for distributed text queries. it was
simple test prefix query, like 'name'*='ene*'*
For now each server-node returns all response records to the client-node
and it may contain ~thousands, ~hundred thousands records.
Event if we need only first 10-100. Again, all the results are added to
queue in GridCacheQueryFutureAdapter in arbitrary order by pages.
I did not find here any means to deliver deterministic result.
So implementing limit as part of query and (GridCacheQueryRequest) will not
change the nature of response but will limit load on nodes and networking.

Can we consider to open a ticket for this?

(III) Further extension of Lucene API exposition to Ignite

a) Sorting
The solution for this could be:
- Make entities comparable
- Add custom comparator to entity
- Add annotations to mark sorted fields for Lucene indexing
- Use comparators when merging responses or reducing to desired limit on
client node.
Will require full result set to be loaded into memory. Though can be used
for relatively small limits.
BR,
Yuriy Shuliha

пт, 30 серп. 2019 о 10:37 Alexei Scherbakov <alexey.scherbakoff@gmail.com>
пише:

> Yuriy,
>
> Note what one of major blockers for text queries is [1] which makes lucene
> indexes unusable with persistence and main reason for discontinuation.
> Probably it's should be addressed first to make text queries a valid
> product feature.
>
> Distributed sorting and advanved querying is indeed not a trivial task.
> Some kind of merging must be implemented on query originating node.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-5371
>
> чт, 29 авг. 2019 г. в 23:38, Denis Magda <dmagda@apache.org>:
>
> > Yuriy,
> >
> > If you are ready to take over the full-text search indexes then please go
> > ahead. The primary reason why the community wants to discontinue them
> first
> > (and, probable, resurrect later) are the limitations listed by Andrey and
> > minimal support from the community end.
> >
> > -
> > Denis
> >
> >
> > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov <
> > andrey.mashenkov@gmail.com>
> > wrote:
> >
> > > Hi Yuriy,
> > >
> > > Unfortunatelly, there is a plan to discontinue TextQueries in Ignite
> [1].
> > > Motivation here is text indexes are not persistent, not transactional
> and
> > > can't be user together with SQL or inside SQL.
> > > and there is a lack of interest from community side.
> > > You are weclome to take on these issues and make TextQueries great.
> > >
> > > 1,  PageSize can't be used to limit resultset.
> > > Query results return from data node to client-side cursor in
> page-by-page
> > > manner and
> > > this parameter is designed control page size. It is supposed query
> > executes
> > > lazily on server side and
> > > it is not excepted full resultset be loaded to memory on server side at
> > > once, but by pages.
> > > Do you mean you found Lucene load entire resultset into memory before
> > first
> > > page is sent to client?
> > >
> > > I'd think a new parameter should be added to limit result. The best
> > > solution is to use query language commands for this, e.g.
> "LIMIT/OFFSET"
> > in
> > > SQL.
> > >
> > > This task doesn't look trivial. Query is distributed operation and same
> > > user query will be executed on data nodes
> > > and then results from all nodes should be correcly merged before being
> > > returned via client-cursor.
> > > So, LIMIT should be applied on every node and then on merge phase.
> > >
> > > Also, this may be non-obviuos, limiting results make no sence without
> > > sorting,
> > > as there is no guarantee every next query run will return same data
> > because
> > > of page reordeing.
> > > Basically, merge phase receive results from data nodes asynchronously
> and
> > > messages from different nodes can't be ordered.
> > >
> > > 2.
> > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose,
> isn't
> > > it.
> > > b,c. What about distributed query? How partial results from nodes will
> be
> > > merged?
> > >  Does Lucene allows to configure comparator for data sorting?
> > > What comparator Ignite should choose to sort result on merge phase?
> > >
> > > 3. For now Lucene engine is not configurable at all. E.g. it is
> > impossible
> > > to configure Tokenizer.
> > > I'd think about possible ways to configure engine at first and only
> then
> > go
> > > further to discuss\implement complex features,
> > > that may depends on engine config.
> > >
> > >
> > >
> > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <shuliga@gmail.com>
> wrote:
> > >
> > > > Dear community,
> > > >
> > > > By starting this chain I'd like to open discussion that would come to
> > > > contribution results in subj. area.
> > > >
> > > > Ignite has indexing capabilities, backed up by different mechanisms,
> > > > including Lucene.
> > > >
> > > > Currently, Lucene 7.5.0 is used (past year release).
> > > > This is a wide spread and mature technology that covers text search
> > area
> > > > and beyond (e.g. spacial data indexing).
> > > >
> > > > My goal is to *expose more Lucene functionality to Ignite indexing
> and
> > > > query mechanisms for text data*.
> > > >
> > > > It's quite simple request at current stage. It is coming from our
> > > project's
> > > > needs, but i believe, will be useful for a lot more people.
> > > > Let's walk through and vote or discuss about Jira tickets for them.
> > > >
> > > > 1.[trivial] Use  dataQuery.getPageSize()  to limit search response
> > items
> > > > inside GridLuceneIndex.query(). Currently it is calling
> > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all
> > > scored
> > > > matches will me returned, what we do not need in most cases.
> > > >
> > > > 2.[simple] Add sorting.  Then more capable search call can be
> > > > executed: *IndexSearcher.search(query, count,
> > > > sort) *
> > > > Implementation steps:
> > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled *
> > > > annotation. If
> > > > *true *the filed will be indexed but not tokenized. Number types are
> > > > preferred here.
> > > > b) Add *sort* collection to *TextQuery* constructor. It should define
> > > > desired sort fields used for querying.
> > > > c) Implement Lucene sort usage in GridLuceneIndex.query().
> > > >
> > > > 3.[moderate] Build complex queries with *TextQuery*, including
> > > > terms/queries boosting.
> > > > *This section for voting only, as requires more detailed work. Should
> > be
> > > > extended if community is interested in it.*
> > > >
> > > > Looking forward to your comments!
> > > >
> > > > BR,
> > > > Yuriy Shuliha
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message