Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
From: Mark Bakker <mbakker@stackstate.com>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Subject: Search trough versioned data
Thread-Topic: Search trough versioned data
Thread-Index: AQHRMk/40UjNWjoSSU2dEtTrQzlccw==
Sender: Mark Bakker <mbakker@xebia.com>
Date: Wed, 9 Dec 2015 08:17:32 +0000
Message-ID: 
 <DB4PR02MB0461BDFBC5271AC8D8BFE9D4D5E80@DB4PR02MB0461.eurprd02.prod.outlook.com>
Accept-Language: nl-NL, en-US
Content-Language: nl-NL
received-spf: None (protection.outlook.com: xebia.com does not designate
 permitted sender hosts)
spamdiagnosticoutput: 1:23
spamdiagnosticmetadata: NSPM
Content-Type: multipart/alternative;
	boundary="_000_DB4PR02MB0461BDFBC5271AC8D8BFE9D4D5E80DB4PR02MB0461eurp_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-originalarrivaltime: 09 Dec 2015 08:17:32.3954
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 3d4d17ea-1ae4-4705-947e-51369c5a5f79
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB4PR02MB0461

--_000_DB4PR02MB0461BDFBC5271AC8D8BFE9D4D5E80DB4PR02MB0461eurp_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hello,


We have a need for search for data in a facetted and 'normal' way. For this=
 we use Lucene at the moment. Our data is sharded in equal sized 'blocks' a=
nd our Lucene indexes follow this shards of data. Each block is ( between 2=
56MB and 4GB ).  So indexes are always for 256MB till 4GB of data. If the d=
ata grows it will be split (and the index as well).


Currently this setup works. Our Lucene indexes split if we get new blocks o=
f data and fill again till the maximum size is reached again.


But now we have an extra requirement. We need to do the search in a version=
ed way (our data is versioned). For us a version is a change in the total d=
ataset with a transaction precision of microseconds.


We see a few possibilities:

1. Use the Lucene commit points and keep the data forever with NoDeletionPo=
licy. With this we think we can not scala it to millions of different commi=
t points. If I read the documentation correct each commit will give me now =
an extra file and that will not really scale.


2. Save extra versions of the documents on each update while using extra fi=
elds in the index and the facet index

from, till (long) for normal search and extra facets for the from and extra=
 facets for the till 'timestamp' (date split in 5 facets with 300 - 1000 un=
ique values each) to speed up the facetted search.


I hope there is some other possible solution which I don't know of.


Our requirements:

-Search through 4GB of data on documents with 5-100 fields and 3-15 facets =
each.

-Have a response time < 100ms.

-Be able to do 20 queries per second

-Be able to search trough each 'snapshot' where a snapshot is defined as a =
change in the total dataset. Snapshots have a time precision of 1 millionth=
 of a second.


The questions I have:

1. What will be the most appropriate way to implement such a search 1,2 or =
an other solution?

2. In case 1, will this be a solution worth looking into?

3. In case 2 will Lucene be efficient with documents which look quite the s=
ame (different versions of the same document)?

4. In case 2 will this solve our requirements?


Kind regards,


Mark Bakker

--_000_DB4PR02MB0461BDFBC5271AC8D8BFE9D4D5E80DB4PR02MB0461eurp_--