lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-10123) Analytics Component 2.0
Date Thu, 29 Jun 2017 18:38:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068767#comment-16068767
] 

Hoss Man commented on SOLR-10123:
---------------------------------

(FWIW Houston, attaching patches showing your progress/attempts makes it easier for people
to follow along with exactly what you're doing and offer meaningful ideas/suggestions)

bq. However the randomized doc-values cannot be used since docValues are required for almost
all Analytics Component functionality.

That's fine -- if the feature requires docValues it requires docValues.  The main reasons
the docValue randomization was added was:
* to help catch bugs/assumptions in code related to docValues
* so tests for things like facets (which work with non-dv tries, but require dv's for points)
could do this...{code}
@BeforeClass
public static void beforeClass() throws Exception {
  // we need DVs on point fields to compute stats & facets
  if (Boolean.getBoolean(NUMERIC_POINTS_SYSPROP)) System.setProperty(NUMERIC_DOCVALUES_SYSPROP,"true");
{code}

bq. Almost all tests pass now, however there is a difference between SortedSetDocValues (TrieField)
and SortedNumericDocValues (PointField) that might make this impossible. ...

What you're talking about is noted in SOLR-10924.  Personally i consider it a feature of Points
fields.  

How we deal with it depends largely on what folks think the "right" behavior is and how it
should be documented.  From an end user standpoint i think it's *great* -- they'll have an
accurate statistical representation of the data they put in, and if they don't wnat duplicate
values considered they shouldn't put the dups in. (ie: document it as a limitation of using
Trie numerics, not a "bug" in Points)

How it affects the tests and what should be done there is a harder question because I have
no idea how much this impacts the existing tests with your current working changes.

One approach is to leave the test data in place, leave the duplicate values in place, and
account for the discrepancy in the assertions -- ala TestExportWriter.testDuplicates()

A diff approach would be to change the tests to ensure it didn't use duplicates in it's tests
data, so the numbers are equivalent regardless of the underlying implementation.

A third option, is to eliminate the points randomization completley -- i wouldn't advise this
unless tthe other options are for some reason completley impossible -- and systematically
test both Trie fields and Point fields with diff tests that know about the diff behavior.

But as things stand right now, this jira claims the new code works with Point fields, but
this claim is not backed up by any new testing, so _something_ needs to change.





> Analytics Component 2.0
> -----------------------
>
>                 Key: SOLR-10123
>                 URL: https://issues.apache.org/jira/browse/SOLR-10123
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Houston Putman
>              Labels: features
>         Attachments: SOLR-10123.patch, SOLR-10123.patch, SOLR-10123.patch
>
>
> A completely redesigned Analytics Component, introducing the following features:
> * Support for distributed collections
> * New JSON request language, and response format that fits JSON better.
> * Faceting over mapping functions in addition to fields (Value Faceting)
> * PivotFaceting with ValueFacets
> * More advanced facet sorting
> * Support for PointField types
> * Expressions over multi-valued fields
> * New types of mapping functions
> ** Logical
> ** Conditional
> ** Comparison
> * Concurrent request execution
> * Custom user functions, defined within the request
> Fully backwards compatible with the orifinal Analytics Component with the following exceptions:
> * All fields used must have doc-values enabled
> * Expression results can no longer be used when defining Range and Query facets
> * The reverse(string) mapping function is no longer a native function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message