Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 5170 invoked from network); 23 Nov 2009 11:14:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Nov 2009 11:14:13 -0000 Received: (qmail 70897 invoked by uid 500); 23 Nov 2009 11:14:12 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 70808 invoked by uid 500); 23 Nov 2009 11:14:12 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 70800 invoked by uid 99); 23 Nov 2009 11:14:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Nov 2009 11:14:12 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Nov 2009 11:14:00 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 90D65234C04C for ; Mon, 23 Nov 2009 03:13:39 -0800 (PST) Message-ID: <1378628439.1258974819578.JavaMail.jira@brutus> Date: Mon, 23 Nov 2009 11:13:39 +0000 (UTC) From: "Robert Muir (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing In-Reply-To: <1307591248.1227004424235.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1458?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D127= 81364#action_12781364 ]=20 Robert Muir commented on LUCENE-1458: ------------------------------------- Michael, it is a valid unicode String though, this is ok, and such things a= re supported by the unicode standard. also, perhaps it would help convince you if i instead wrote the code as .te= rms("=F0=A9=AC=85".charAt(0)); previously, naive treatment of text like this would work correctly, now wit= h byte it cannot. I hope you can start to see how many east asian applications will break bec= ause of this. http://www.unicode.org/notes/tn12/ > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-comp= at.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUC= ENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patc= h, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.pat= ch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.pa= tch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.p= atch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE= -1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz= 2, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > =20 > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org