lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: Sampled Queries -- Use Cases and Feedback
Date Tue, 11 Jun 2019 11:36:05 GMT
Atri, in the abstract it sounds like a great idea, but in practice it will
only be as good as the data that drives it. I think that to make this work
it would be a good idea to write up a proposal of some sort targeting
different open (or commercial, although I doubt you would get much of this)
source projects that use lucene-based search asking them to contribute
their data.

Also can we learn anything from the previous attempt? What did they try?
How can this effort about the same pitfalls?

Even with document and query  data, you still need some kind of relevance
ground truth, and this is notoriously difficult to get. Probably click
through stats are the most generic proxy for that.

So as a thought experiment, maybe contact Wikipedia and ask if they would
be willing to share some sample of queries and logs. Or did you have
another idea how to drive this? Then with one pilot participant, you could
maybe get others to join. I think if you have some commitments, or at least
serious expression of interest, from data providers, then you can start to
think about what to actually do with the data, but I would start there?

On Mon, Jun 10, 2019, 2:54 AM Atri Sharma <> wrote:

> Any thoughts on this? I am envisioning applications to machine
> learning systems, where the training dataset might be a small sample
> of the entire dataset, and the user wants scoring to be done only on
> samples of the dataset.
> On Fri, Jun 7, 2019 at 5:45 PM Atri Sharma <> wrote:
> >
> > Hi All,
> >
> > While working on a new Query type, I was inclined to think of a couple
> > of use cases where the documents being scored need not be all of the
> > data set, but a sample of them. This can be useful for very large
> > datasets, where a query is only interested in getting the "feel" of
> > the data, and other queries where the data is being aggregated over
> > time, so a wide enough sample of the data is good enough for the user
> > at the tradeoff of improved performance. Faceting already has sampling
> > mechanisms, so there are ideas to be borrowed from that part.
> >
> > I have some ideas on introducing a new query type and associated
> > semantics to allow this functionality to be present from ground up.
> > Specifically, a query type which wraps another query and "feeds"
> > offsets to the inner query, along with a limit of collection of hits.
> > I can go in more detail, but wanted to get some thoughts and feedback
> > before delving deeper.
> >
> > Atri
> --
> Regards,
> Atri
> Apache Concerted
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message