From dev-return-321078-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Fri May 4 15:11:34 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id DF9D9180634 for ; Fri, 4 May 2018 15:11:33 +0200 (CEST) Received: (qmail 2323 invoked by uid 500); 4 May 2018 13:11:32 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 2298 invoked by uid 99); 4 May 2018 13:11:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 May 2018 13:11:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 70C9B1807EC; Fri, 4 May 2018 13:11:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.129 X-Spam-Level: ** X-Spam-Status: No, score=2.129 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id enSTRqYAu5uh; Fri, 4 May 2018 13:11:29 +0000 (UTC) Received: from mail-ot0-f173.google.com (mail-ot0-f173.google.com [74.125.82.173]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 40CAF5F1B4; Fri, 4 May 2018 13:11:29 +0000 (UTC) Received: by mail-ot0-f173.google.com with SMTP id h8-v6so24391263otb.2; Fri, 04 May 2018 06:11:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=14lOeVkVA39kJo9+9EiQKSxMKIWreeBs7XXv8DvlZ8o=; b=IaB4UXHC37ghQmAOT6EEXYLWKvozkwzbecAEghAWHVIBGh0EljsI1yei+vu59G9WEo umhr2Lvs/z5Eg2NgAqFc6YQ6maWguz+Axmu2M/ZxRpjEDz5cWXpTHf4uc3vk3WIVH8Dx Uew8n8EFU8DTlpxsmr6ogMd3IipQHHwonQSp1ojNqWL/SmXS3NfiXDh3m9UB20d7TNDg sUQSMiWwF2y711s4fj5yaW57EoS9hWXysZVmrNWryYFyITiF138YPNYbq9ic6Lnu+TqC KCa5EMF/h1EUJQZrKdkV5czZ0URP2yeuTxFx9A1vCDj1iL4qdrxf8m1cLFydi7lNmmP/ DImA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=14lOeVkVA39kJo9+9EiQKSxMKIWreeBs7XXv8DvlZ8o=; b=hzkhwNJuyOLclKGP95OGZtu/xJ2z4SK56O60meVjNPX6g7mnCsTaPAMPptDV5pdmft J36s9XbH78hDQdrf2JEi9DodtqU/0OAGYMUDkQ8FiMWOicQzU0J0bp55PQ5s2FmpB/zi XNwH7j3HmmrySfPUrftvd9+lJwQgbUlwG1ReZCzGR1Go0QsQs8HwNipM+EiwJ1mvpXAo DX6mp4RwqQfY0toMJ0aykFW5qdwVPNvtPtcWnJ0L2CwuMPMh3bpalJC35ucDM2EvyuRD 7TaqRzVmkYfAzmch/DYMt29hHwimM0EWFmDZ9mIQj62h/LSaao1hP493fZcEY5T1BPUE k1Cw== X-Gm-Message-State: ALQs6tC4vqyUFcaWTaEbXvG/9UkaQL043af8/KunjO+Zd1w6bLerYREg mmhBImcoFjH7omCXw6Rr3igNQDivcIC94WMDt5o= X-Google-Smtp-Source: AB8JxZq3O8OW+1J96BfZFNYa28CXP0HA2X7WmrSPjffRbuEu+0piI7eYJHY+gFGUHNSSIyVJw7fCH0KTnpxwWdqr1Rw= X-Received: by 2002:a9d:4117:: with SMTP id o23-v6mr18525783ote.21.1525439488230; Fri, 04 May 2018 06:11:28 -0700 (PDT) MIME-Version: 1.0 Received: by 10.201.120.67 with HTTP; Fri, 4 May 2018 06:11:27 -0700 (PDT) From: manish gupta Date: Fri, 4 May 2018 18:41:27 +0530 Message-ID: Subject: Query on searchAfter API usage in IndexSearcher To: dev@lucene.apache.org, general@lucene.apache.org Content-Type: multipart/alternative; boundary="000000000000d28905056b610e68" --000000000000d28905056b610e68 Content-Type: text/plain; charset="UTF-8" Hi Team, I am new to Lucene and I am trying to use Lucene for text search in my project to achieve better results in terms of query performance. Initially I was facing lot of GC issues while using lucene as I was using search API and passing all the documents count. As my data size is around 4 billion the number of documents created by Lucene were huge. Internally search API uses TopScoreDocCollector which internally creates a PriorityQueue of given documents count thus causing lot of GC. *To avoid this problem I am trying to query using a pagination way wherein I am query only 10 documents at a time and after that I am using seacrhAfter API to query further passing the lastScoreDoc from previous result. This has resolved the GC problem but the query time has increased by a huge margin from 3 sec to 600 sec.* *When I debugged I found that even though I use the searchAfter API, it is not avoiding the IO and every time it is reading the data from disk again. It is only skipping the results filled in previous search. Is my understanding correct?. If yes please let me know if there is a better way to query the results in incremental order so as to avoid GC and with minimal impact on query performance.* Regards Manish Gupta --000000000000d28905056b610e68 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Team,

I am new to Lucene and I am tr= ying to use Lucene for text search in my project to achieve better results = in terms of query performance.

Initially I was fac= ing lot of GC issues while using lucene as I was using search API and passi= ng all the documents count. As my data size is around 4 billion the number = of documents created by Lucene were huge. Internally search API uses TopSco= reDocCollector which internally creates a PriorityQueue of given documents = count thus causing lot of GC.

To avoid this pro= blem I am trying to query using a pagination way wherein I am query only 10= documents at a time and after that I am using seacrhAfter API to query fur= ther passing the lastScoreDoc from previous result. This has resolved the G= C problem but the query time has increased by a huge margin from 3 sec to 6= 00 sec.

When I debugged I found that= even though I use the searchAfter API, it is not avoiding the IO and every= time it is reading the data from disk again. It is only skipping the resul= ts filled in previous search. Is my understanding correct?. If yes please l= et me know if there is a better way to query the results in incremental ord= er so as to avoid GC and with minimal impact on query performance.

Regards
Manish Gupta
--000000000000d28905056b610e68--