Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 931D91739F for ; Wed, 9 Dec 2015 14:10:46 +0000 (UTC) Received: (qmail 74905 invoked by uid 500); 9 Dec 2015 14:10:44 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 74840 invoked by uid 500); 9 Dec 2015 14:10:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 798 invoked by uid 99); 9 Dec 2015 08:28:45 -0000 X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.998 X-Spam-Level: ** X-Spam-Status: No, score=2.998 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled From: Mark Bakker To: "java-user@lucene.apache.org" Subject: Search trough versioned data Thread-Topic: Search trough versioned data Thread-Index: AQHRMk/40UjNWjoSSU2dEtTrQzlccw== Sender: Mark Bakker Date: Wed, 9 Dec 2015 08:17:32 +0000 Message-ID: Accept-Language: nl-NL, en-US Content-Language: nl-NL X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=mbakker@xebia.com; x-originating-ip: [157.56.255.229] x-microsoft-exchange-diagnostics: 1;DB4PR02MB0461;5:sfDpsb0UuWDK0FEycTS3T+QGe5sTCTOl6rbU4zHI8dikbS6dF9mymdjzhKKNVqMAKZM6HZT1TSOwK5WKB48jrBmQzOqVOWTngRq+IlN6YJFjFY8oBVpHSK5cCcTC7HJrBTGi6KLP8Ge17PmDJnDqAg==;24:26PZZkmxu5zgcVXQYtGiiK0vQOI1mvhM6FuenAQJxYviOYh6QAV0vgyVZMoICX/f5ynnYkRoFS21HUUq5MjtJUsLhGkTKWsXe8VmTvwxWqI= x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DB4PR02MB0461; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(601004)(2401047)(8121501046)(5005006)(520078)(3002001)(10201501046);SRVR:DB4PR02MB0461;BCL:0;PCL:0;RULEID:;SRVR:DB4PR02MB0461; x-forefront-prvs: 0785459C39 x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(6009001)(189002)(199003)(71364002)(76576001)(2501003)(40100003)(77096005)(106116001)(5003600100002)(122556002)(81156007)(106356001)(97736004)(5004730100002)(50986999)(450100001)(1096002)(551934003)(2351001)(54356999)(107886002)(5001960100002)(10400500002)(105586002)(92566002)(101416001)(110136002)(5002640100001)(1220700001)(87936001)(19627405001)(3846002)(11100500001)(74316001)(19625215002)(66066001)(33656002)(229853001)(2900100001)(6116002)(16236675004)(586003)(189998001)(102836003)(42882005);DIR:OUT;SFP:1101;SCL:1;SRVR:DB4PR02MB0461;H:DB4PR02MB0461.eurprd02.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; received-spf: None (protection.outlook.com: xebia.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:23 spamdiagnosticmetadata: NSPM Content-Type: multipart/alternative; boundary="_000_DB4PR02MB0461BDFBC5271AC8D8BFE9D4D5E80DB4PR02MB0461eurp_" MIME-Version: 1.0 X-OriginatorOrg: xebia.com X-MS-Exchange-CrossTenant-originalarrivaltime: 09 Dec 2015 08:17:32.3954 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3d4d17ea-1ae4-4705-947e-51369c5a5f79 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB4PR02MB0461 --_000_DB4PR02MB0461BDFBC5271AC8D8BFE9D4D5E80DB4PR02MB0461eurp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello, We have a need for search for data in a facetted and 'normal' way. For this= we use Lucene at the moment. Our data is sharded in equal sized 'blocks' a= nd our Lucene indexes follow this shards of data. Each block is ( between 2= 56MB and 4GB ). So indexes are always for 256MB till 4GB of data. If the d= ata grows it will be split (and the index as well). Currently this setup works. Our Lucene indexes split if we get new blocks o= f data and fill again till the maximum size is reached again. But now we have an extra requirement. We need to do the search in a version= ed way (our data is versioned). For us a version is a change in the total d= ataset with a transaction precision of microseconds. We see a few possibilities: 1. Use the Lucene commit points and keep the data forever with NoDeletionPo= licy. With this we think we can not scala it to millions of different commi= t points. If I read the documentation correct each commit will give me now = an extra file and that will not really scale. 2. Save extra versions of the documents on each update while using extra fi= elds in the index and the facet index from, till (long) for normal search and extra facets for the from and extra= facets for the till 'timestamp' (date split in 5 facets with 300 - 1000 un= ique values each) to speed up the facetted search. I hope there is some other possible solution which I don't know of. Our requirements: -Search through 4GB of data on documents with 5-100 fields and 3-15 facets = each. -Have a response time < 100ms. -Be able to do 20 queries per second -Be able to search trough each 'snapshot' where a snapshot is defined as a = change in the total dataset. Snapshots have a time precision of 1 millionth= of a second. The questions I have: 1. What will be the most appropriate way to implement such a search 1,2 or = an other solution? 2. In case 1, will this be a solution worth looking into? 3. In case 2 will Lucene be efficient with documents which look quite the s= ame (different versions of the same document)? 4. In case 2 will this solve our requirements? Kind regards, Mark Bakker --_000_DB4PR02MB0461BDFBC5271AC8D8BFE9D4D5E80DB4PR02MB0461eurp_--