lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)
Date Sat, 14 Oct 2017 03:59:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204466#comment-16204466
] 

David Smiley commented on SOLR-11299:
-------------------------------------

The timezone bit is for two things:
* the interpretation of the partition time size.  A timezone is useful and in fact necessary
for the same reasons as facet.range.gap with dates which support it.  See SOLR-2690 for context
as to why {{TZ}} exists.
* allowing for shorter friendly collection names like mycollection_2017-10-13 instead of needing
to get to the hour.  This isn't a big deal, granted.  I don't really like millisecond collection
names, sorry.  Hey [~hossman] I recall we both attended an LSR presentation (Rocana?) that
described a time partitioning strategy with the dubious choice of milliseconds in the name
and you were like, oh yeah, ol collection 1507953042461 -- there's some great data in there
:-)

RE alias metadata for storing partition ranges... yeah I suppose that's possible but I admit
I like the lean sufficiency of the names themselves in series being adequate.  The only problem
I can think of with using the names alone is that you must have a complete contiguous series
with no gaps of collections that haven't been created.  That doesn't seam like a serious limitation,
I think?  If we wanted metadata on each partition like the start and end range, I'm not inclined
to think the alias is where it goes -- more likely it's metadata on the collection.

> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
>                 Key: SOLR-11299
>                 URL: https://issues.apache.org/jira/browse/SOLR-11299
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: David Smiley
>            Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think logs or
sensor data / IOT) itself without a lot of manual/external work.  The most naive and painless
approach today is to create a collection with a high numShards with hash routing but this
isn't as good as partitioning the underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change.  (No need to over-provision,
use shard splitting, or re-index with different config)
> * Faster queries: 
>     ** can search fewer shards, reducing overall load
>     ** realtime search is more tractable (since most shards are stable -- good caches)
>     ** "recent" shards (that might be queried more) can be allocated to faster hardware
>     ** aged out data is simply removed, not marked as deleted.  Deleted docs still have
search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system nonetheless (compare
to random subset missing)
> Ideally you could set this up once and then simply work with a collection (potentially
actually an alias) in a normal way (search or update), letting Solr handle the addition of
new partitions, removing of old ones, and appropriate routing of requests depending on their
nature.
> This issue is an umbrella issue for the particular tasks that will make it all happen
-- either subtasks or issue linking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message