lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Solr schema design: fitting time-series data
Date Mon, 16 Jan 2017 04:04:26 GMT
bq: I know multivalued fields don't support complex data  types

Not sure what you're talking about here. mulitValued actually has
nothing to do with data types. You can have text fields which
are analyzed and produce multiple tokens and are multiValued.
You can have primitive types (string, int/long/float/double,
boolean etc) that are multivalued. or they can be single valued.

All "multiValued" means is that the _input_ can have the same field
repeated, i.e.
<doc>
  <field name="mytext">some stuff</field>
  <field name="mytext">more stuff</field>
  <field name='myint"">32</field>
  <field name='myint"">77</field>
</doc>

This doc would fail of mytext or myint were multiValued=false but
succeed if multiValued=true at index time.

There are some subtleties with text (analyzed) multivalued fields having
to do with token offsets, but that's not germane.

Does that change your problem? Your document could have a dozen
timestamps....

However, there isn't a good way to query across multiple multivalued fields
in parallel. That is, a doc like

myint=1
myint=2
myint=3
mylong=4
mylong=5
mylong=6

there's no good way to say "only match this document if mhyint=1 AND
mylong=4 AND they_are_both_in_the_same_position.

That is, asking for myint=1 AND mylong=6 would match the above. Is
that what you're
wondering about?

------------------
I expect you're really asking to do the second above, in which case you might
want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x

Best,
Erick

On Sun, Jan 15, 2017 at 7:31 PM, map reduced <k3t.git.1@gmail.com> wrote:
> Hi,
>
> I am trying to fit the following data in Solr to support flexible queries
> and would like to get your input on the same. I have data about users say:
>
> contentID (assume uuid),
> platform (eg. website, mobile etc),
> softwareVersion (eg. sw1.1, sw2.5, ..etc),
> regionId (eg. us144, uk123, etc..)
> ....
>
> and few more other such fields. This data is partially pre aggregated (read
> Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
> mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
> format:
>
> timestamp  pre-aggregated data [ uniques, total]
>  Jan 15    [ 12, 4]
>  Jan 14    [ 4, 3]
>  Jan 13    [ 8, 7]
>  ...        ...
>
> And then I also have less granular data say "contentID = uuid123 and
> platform = mobile and softwareVersion = ANY and regionId = ANY (These
> values will be more than above table since granularity is reduced)
>
> timestamp : pre-aggregated data [uniques, total]
>  Jan 15    [ 100, 40]
>  Jan 14    [ 45, 30]
>  ...           ...
>
> I'll get queries like "contentID = uuid123 and platform = mobile" , give
> sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
> platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 -
> Jan01.
>
> I was thinking of simple schema where documents will be like (first example
> above):
>
> {
>   "contentID": "uuid12349789",
>   "platform" : "mobile",
>   "softwareVersion": "sw1.2",
>   "regionId": "ANY",
>   "ts" : "2017-01-15T01:01:21Z",
>   "unique": 12,
>   "total": 4
> }
>
> second example from above:
>
> {
>   "contentID": "uuid12349789",
>   "platform" : "mobile",
>   "softwareVersion": "ANY",
>   "regionId": "ANY",
>   "ts" : "2017-01-15T01:01:21Z",
>   "unique": 100,
>   "total": 40
> }
>
> Possible optimization:
>
> {
>   "contentID": "uuid12349789",
>   "platform.mobile.softwareVersion.sw1.2.region.us12" : {
>       "unique": 12,
>       "total": 4
>   },
>  "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
>       "unique": 100,
>       "total": 40
>   },
>   "ts" : "2017-01-15T01:01:21Z"
>   }
>
> Challenges: Number of such rows is very large and it'll grow exponentially
> with every new field - For instance if I go with above suggested schema,
> I'll end up storing a new document for each combination of
> contentID,platform,softwareVersion,regionId. Now if we throw in another
> field to this document, number of combinations increase exponentially.I
> have more than a billion such combination rows already.
>
> I am hoping to find advice by experts if
>
>    1. Multiple such fields can be fit in same document for different 'ts'
>    such that range queries are possible on it.
>    2. time range (ts) can be fit in same document as a list(?) (to reduce
>    number of rows). I know multivalued fields don't support complex data
>    types, but if anything else can be done with the data/schema to reduce
>    query time and number of rows.
>
> The number of these rows are very large, for sure more than 1billion (if we
> go with the schema I was suggesting). What schema would you suggest for
> this that'll fit query requirements?
>
> FYI: All queries will be exact match on fields (no partial or tokenized),
> so no analysis on fields is necessary. And almost all queries are range
> queries.
>
> Thanks,
>
> KP

Mime
View raw message