lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Smith <localde...@gmail.com>
Subject Re: statistics in hitlist
Date Mon, 05 Mar 2018 21:11:37 GMT
Thanks Joel for your help on this.

What I've done so far:
- unzip downloaded solr-7.2
- modify the _default "managed-schema" to add the random field type and the
dynamic random field
- start solr7 using "solr start -c"
- indexed my data using pint/pdouble/boolean field types etc

I can now run the random function all by itself, it returns random results
as expected. So far so good!

However... now trying to get the regression stuff working:

let(a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15000",
fl="oil_first_90_days_production,oil_last_30_days_production"),
    b=col(a, oil_first_90_days_production),
    c=col(a, oil_last_30_days_production),
    d=regress(b, c))

Posted directly into solr admin UI. Run the streaming expression and I get
this error message:
"EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
expected but found type java.lang.String for value
oil_first_90_days_production"

It thinks my numeric field is defined as a string? But when I view the
schema, those 2 fields are defined as ints:


When I run a normal query and choose xml as output format, then it also
puts "int" elements into the hitlist, so the schema appears to be correct
it's just when using this regress function that something goes wrong and
solr thinks the field is string.

Any suggestions?
Thanks!
‚Äč


On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <joelsolr@gmail.com> wrote:

> The field type will also need to be in the schema:
>
>  <!-- The "RandomSortField" is not used to store or search any
>
>          data.  You can declare fields of this type it in your schema
>
>          to generate pseudo-random orderings of your docs for sorting
>
>          or function purposes.  The ordering is generated based on the
> field
>
>          name and the version of the index. As long as the index version
>
>          remains unchanged, and the same field name is reused,
>
>          the ordering of the docs will be consistent.
>
>          If you want different psuedo-random orderings of documents,
>
>          for the same version of the index, use a dynamicField and
>
>          change the field name in the request.
>
>      -->
>
> <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <joelsolr@gmail.com> wrote:
>
> > You'll need to have this field in your schema:
> >
> > <dynamicField name="random_*" type="random" />
> >
> > I'll check to see if the default schema used with solr start -c has this
> > field, if not I'll add it. Thanks for pointing this out.
> >
> > I checked and right now the random expression is only accepting one fq,
> > but I consider this a bug. It should accept multiple. I'll create ticket
> > for getting this fixed.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <localdevjs@gmail.com> wrote:
> >
> >> Joel, thanks for the pointers to the streaming feature. I had no idea
> solr
> >> had that (and also just discovered the very intersting sql feature! I
> will
> >> be sure to investigate that in more detail in the future).
> >>
> >> However I'm having some trouble getting basic streaming functions
> working.
> >> I've already figured out that I had to move to "solr cloud" instead of
> >> "solr standalone" because I was getting errors about "cannot find zk
> >> instance" or whatever which went away when using "solr start -c"
> instead.
> >>
> >> But now I'm trying to use the random function since that was one of the
> >> functions used in your example.
> >>
> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >>
> >> I posted that directly in the "stream" section of the solr admin UI.
> This
> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
> case
> >> it was a bug in one)
> >>
> >> I get back an error message:
> >> *sort param could not be parsed as a query, and is not a field that
> exists
> >> in the index: random_-255009774*
> >>
> >> I'm not passing in any sort field anywhere. But the solr logs show these
> >> three log entries:
> >>
> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
status=400
> >> QTime=19
> >>
> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
> >> Request to collection [tx_header] failed due to (400)
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774, retry? 0
> >>
> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1]
> o.a.s.c.s.i.s.ExceptionStream
> >> java.io.IOException:
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774
> >>
> >>
> >> So basically it looks like solr is injecting the "sort=random_" stuff
> into
> >> my query and of course that is failing on the search since that
> >> field/column doesn't exist in my schema. Everytime I run the random
> >> function, I get a slightly different field name that it injects, but
> they
> >> all start with "random_" etc.
> >>
> >> I have tried adding my own sort field instead, hoping solr wouldn't
> inject
> >> one for me, but it still injected a random sort fieldname:
> >> random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
> >> asc")
> >>
> >>
> >> Assuming I can fix that whole problem, my second question is: can I add
> >> multiple "fq=" parameters to the random function? I build a pretty
> >> complicated query using many fq= fields, and then want to run some stats
> >> on
> >> that hitlist; so somehow I have to pass in the query that made up the
> >> exact
> >> hitlist to these various functions, but when I used multiple "fq="
> values
> >> it only seemed to use the last one I specified and just ignored all the
> >> previous fq's?
> >>
> >> Thanks in advance for any comments/suggestions...!
> >>
> >>
> >>
> >>
> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <joelsolr@gmail.com>
> >> wrote:
> >>
> >> > This is going to be a complex answer because Solr actually now has
> >> multiple
> >> > ways of doing regression analysis as part of the Streaming Expression
> >> > statistical programming library. The basic documentation is here:
> >> >
> >> > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
> >> >
> >> > Here is a sample expression that performs a simple linear regression
> in
> >> > Solr 7.2:
> >> >
> >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> >> > fieldB"),
> >> >     b=col(a, fieldA),
> >> >     c=col(a, fieldB),
> >> >     d=regress(b, c))
> >> >
> >> >
> >> > The expression above takes a random sample of 15000 results from
> >> > collection1. The result set will include fieldA and fieldB in each
> >> record.
> >> > The result set is stored in variable "a".
> >> >
> >> > Then the "col" function creates arrays of numbers from the results
> >> stored
> >> > in variable a. The values in fieldA are stored in the variable "b".
> The
> >> > values in fieldB are stored in variable "c".
> >> >
> >> > Then the regress function performs a simple linear regression on
> arrays
> >> > stored in variables "b" and "c".
> >> >
> >> > The output of the regress function is a map containing the regression
> >> > result. This result includes RSquared and other attributes of the
> >> > regression model such as R (correlation), slope, y intercept etc...
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Joel Bernstein
> >> > http://joelsolr.blogspot.com/
> >> >
> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <localdevjs@gmail.com>
> >> wrote:
> >> >
> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but the
> >> end
> >> > > result of all this is supposed to be obtaining R^2. Is there no way
> of
> >> > > obtaining this value, then (short of iterating over all the results
> in
> >> > the
> >> > > hitlist and calculating it myself)?
> >> > >
> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> joelsolr@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Typically SSE is the sum of the squared errors of the prediction
> in
> >> a
> >> > > > regression analysis. The stats component doesn't perform
> regression,
> >> > > > although it might be a nice feature.
> >> > > >
> >> > > >
> >> > > >
> >> > > > Joel Bernstein
> >> > > > http://joelsolr.blogspot.com/
> >> > > >
> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> localdevjs@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > I'm using solr, and enabling stats as per this page:
> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
> component.html
> >> > > > >
> >> > > > > I want to get more stat values though. Specifically I'm
looking
> >> for
> >> > > > > r-squared (coefficient of determination). This value is
not
> >> present
> >> > in
> >> > > > > solr, however some of the pieces used to calculate r^2 are
in
> the
> >> > stats
> >> > > > > element, for example:
> >> > > > >
> >> > > > > <double name="min">0.0</double>
> >> > > > > <double name="max">10.0</double>
> >> > > > > <long name="count">15</long>
> >> > > > > <long name="missing">17</long>
> >> > > > > <double name="sum">85.0</double>
> >> > > > > <double name="sumOfSquares">603.0</double>
> >> > > > > <double name="mean">5.666666666666667</double>
> >> > > > > <double name="stddev">2.943920288775949</double>
> >> > > > >
> >> > > > >
> >> > > > > So I have the sumOfSquares available (SST), and using this
> >> > > calculation, I
> >> > > > > can get R^2:
> >> > > > >
> >> > > > > R^2 = 1 - SSE/SST
> >> > > > >
> >> > > > > All I need then is SSE. Is there anyway I can get SSE from
those
> >> > other
> >> > > > > stats in solr?
> >> > > > >
> >> > > > > Thanks in advance!
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message