Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Thu, 1 Dec 2016 19:38:58 +0000 (UTC)
From: "Michael Sun (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <JIRA.13020652.1479167230000.415676.1480621138400@Atlassian.JIRA>
In-Reply-To: <JIRA.13020652.1479167230000@Atlassian.JIRA>
References: <JIRA.13020652.1479167230000@Atlassian.JIRA> <JIRA.13020652.1479167230804@arcas>
Subject: [jira] [Commented] (SOLR-9764) Design a memory efficient DocSet if
 a query returns all docs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 01 Dec 2016 19:39:00 -0000


    [ https://issues.apache.org/jira/browse/SOLR-9764?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15712=
868#comment-15712868 ]=20

Michael Sun commented on SOLR-9764:
-----------------------------------

bq. Lucene does have this implemented already =E2=80=93 RoaringDocIdSet
That's cool. Let me check it out to understand pros and cons. Thanks [~elyo=
grag] to point it out.

bq. I do not know how it would perform when actually used as a filterCache =
entry, compared to the current bitset implementation.
For this particular use case (a query matches all docs), the approach in pa=
tch should be better than roaring bitmap. The patch designed a MatchAllDocS=
et for this use case, which uses no memory other than storing the size. In =
addition, MatchAllDocSet would be faster in creating DocSet, union, interse=
ct etc. since no real bit manipulation is required.=20

> Design a memory efficient DocSet if a query returns all docs
> ------------------------------------------------------------
>
>                 Key: SOLR-9764
>                 URL: https://issues.apache.org/jira/browse/SOLR-9764
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public)=20
>            Reporter: Michael Sun
>         Attachments: SOLR-9764.patch, SOLR-9764.patch, SOLR-9764.patch, S=
OLR-9764.patch, SOLR-9764.patch, SOLR_9764_no_cloneMe.patch
>
>
> In some use cases, particularly use cases with time series data, using co=
llection alias and partitioning data into multiple small collections using =
timestamp, a filter query can match all documents in a collection. Currentl=
y BitDocSet is used which contains a large array of long integers with ever=
y bits set to 1. After querying, the resulted DocSet saved in filter cache =
is large and becomes one of the main memory consumers in these use cases.
> For example. suppose a Solr setup has 14 collections for data in last 14 =
days, each collection with one day of data. A filter query for last one wee=
k data would result in at least six DocSet in filter cache which matches al=
l documents in six collections respectively.  =20
> This is to design a new DocSet that is memory efficient for such a use ca=
se.  The new DocSet removes the large array, reduces memory usage and GC pr=
essure without losing advantage of large filter cache.
> In particular, for use cases when using time series data, collection alia=
s and partition data into multiple small collections using timestamp, the g=
ain can be large.
> For further optimization, it may be helpful to design a DocSet with run l=
ength encoding. Thanks [~mmokhtar] for suggestion.=20


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org