Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of john.wang@gmail.com designates
 209.85.210.192 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=PH8dC5YXcHS8hba6ZnxJcTiqUo5JMo9vtt7zdxPfwKDdQkdzFt1RX+/E2h1Cj3VBCi
         RhOYamwaun+t9Gbi8sx44dxcZHu2NoPINbF2Ojc7CfjpPRBh1np7cxwN7f1/KrOyouix
         Swx79famDs2O/bf41yHnN7pRH1oI23symFiUg=
MIME-Version: 1.0
In-Reply-To: <31EC32D34DE94D449EF9A82C4CEFCBCE@VEGA>
References: <1307591248.1227004424235.JavaMail.jira@brutus>
	 <418830629.1253798056142.JavaMail.jira@brutus>
	 <31EC32D34DE94D449EF9A82C4CEFCBCE@VEGA>
Date: Thu, 8 Oct 2009 11:41:17 -0700
Message-ID: <8837fb770910081141k36cfa466y5ca9fce9feb9638e@mail.gmail.com>
Subject: Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible
	indexing
From: John Wang <john.wang@gmail.com>
To: java-dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=001636ed77c51654bc047570cf4d

--001636ed77c51654bc047570cf4d
Content-Type: text/plain; charset=ISO-8859-1

Hi guys:

     What are your thoughts about contributing Kamikaze as a lucene contrib
package? We just finished porting kamikaze to lucene 2.9. With the new 2.9
api, it allows us for some more code tuning and optimization improvements.

     We will be releasing kamikaze, it might a good time to add it to the
lucene contrib package if there is interest.

Thanks

-John

On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> By the way: In the last RC of Lucene 2.9 we added a new method to DocIdSet
> called isCacheable(). It is used by e.g. CachingWrapperFilter to determine,
> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI (the
> default is false, so all custom DocIdSets are copied to OpenBitSetDISI by
> CachingWrapperFilter, even if not needed - if a DocIdSet does not do disk
> IO
> and have a fast iterator like e.g. the FieldCache ones in
> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe this
> should also be added to Kamikaze, which is a really nice project!
> Especially
> filter DocIdSets should pass this method to its delegate (see
> FilterDocIdSet
> in Lucene).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: John Wang (JIRA) [mailto:jira@apache.org]
> > Sent: Thursday, September 24, 2009 3:14 PM
> > To: java-dev@lucene.apache.org
> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible
> > indexing
> >
> >
> >     [ https://issues.apache.org/jira/browse/LUCENE-
> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> > tabpanel&focusedCommentId=12759112#action_12759112 ]
> >
> > John Wang commented on LUCENE-1458:
> > -----------------------------------
> >
> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
> > abstraction that was migrated from Solr)
> >
> > It has three components:
> >
> > 1) P4Delta
> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
> jira
> > ticket and a patch for Lucene awhile ago with performance numbers. It is
> > significantly faster than DisjunctionScorer)
> > 3) algorithm to determine which DocIdSet implementations to use given
> some
> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the
> > application behavior if not all parameters are given.
> >
> > So please feel free to incorporate anything you see if or move it to
> > contrib.
> >
> >
> > > Further steps towards flexible indexing
> > > ---------------------------------------
> > >
> > >                 Key: LUCENE-1458
> > >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
> > >             Project: Lucene - Java
> > >          Issue Type: New Feature
> > >          Components: Index
> > >    Affects Versions: 2.9
> > >            Reporter: Michael McCandless
> > >            Assignee: Michael McCandless
> > >            Priority: Minor
> > >         Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-
> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-
> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
> > 1458.tar.bz2, LUCENE-1458.tar.bz2
> > >
> > >
> > > I attached a very rough checkpoint of my current patch, to get early
> > > feedback.  All tests pass, though back compat tests don't pass due to
> > > changes to package-private APIs plus certain bugs in tests that
> > > happened to work (eg call TermPostions.nextPosition() too many times,
> > > which the new API asserts against).
> > > [Aside: I think, when we commit changes to package-private APIs such
> > > that back-compat tests don't pass, we could go back, make a branch on
> > > the back-compat tag, commit changes to the tests to use the new
> > > package private APIs on that branch, then fix nightly build to use the
> > > tip of that branch?o]
> > > There's still plenty to do before this is committable! This is a
> > > rather large change:
> > >   * Switches to a new more efficient terms dict format.  This still
> > >     uses tii/tis files, but the tii only stores term & long offset
> > >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> > >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> > >     are structured by field, so we don't have to record field number
> > >     in every term.
> > > .
> > >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> > >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > > .
> > >     RAM usage when loading terms dict index is significantly less
> > >     since we only load an array of offsets and an array of String (no
> > >     more TermInfo array).  It should be faster to init too.
> > > .
> > >     This part is basically done.
> > >   * Introduces modular reader codec that strongly decouples terms dict
> > >     from docs/positions readers.  EG there is no more TermInfo used
> > >     when reading the new format.
> > > .
> > >     There's nice symmetry now between reading & writing in the codec
> > >     chain -- the current docs/prox format is captured in:
> > > {code}
> > > FormatPostingsTermsDictWriter/Reader
> > > FormatPostingsDocsWriter/Reader (.frq file) and
> > > FormatPostingsPositionsWriter/Reader (.prx file).
> > > {code}
> > >     This part is basically done.
> > >   * Introduces a new "flex" API for iterating through the fields,
> > >     terms, docs and positions:
> > > {code}
> > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > > {code}
> > >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> > >     old API on top of the new API to keep back-compat.
> > >
> > > Next steps:
> > >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> > >     fix any hidden assumptions.
> > >   * Expose new API out of IndexReader, deprecate old API but emulate
> > >     old API on top of new one, switch all core/contrib users to the
> > >     new API.
> > >   * Maybe switch to AttributeSources as the base class for TermsEnum,
> > >     DocsEnum, PostingsEnum -- this would give readers API flexibility
> > >     (not just index-file-format flexibility).  EG if someone wanted
> > >     to store payload at the term-doc level instead of
> > >     term-doc-position level, you could just add a new attribute.
> > >   * Test performance & iterate.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

--001636ed77c51654bc047570cf4d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi guys:<br><br>=A0=A0=A0=A0 What are your thoughts about contributing Kami=
kaze as a lucene contrib package? We just finished porting kamikaze to luce=
ne 2.9. With the new 2.9 api, it allows us for some more code tuning and op=
timization improvements.<br>
<br>=A0=A0=A0=A0 We will be releasing kamikaze, it might a good time to add=
 it to the lucene contrib package if there is interest.<br><br>Thanks<br><b=
r>-John<br><br><div class=3D"gmail_quote">On Thu, Sep 24, 2009 at 6:20 AM, =
Uwe Schindler <span dir=3D"ltr">&lt;<a href=3D"mailto:uwe@thetaphi.de">uwe@=
thetaphi.de</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">By the way: In th=
e last RC of Lucene 2.9 we added a new method to DocIdSet<br>
called isCacheable(). It is used by e.g. CachingWrapperFilter to determine,=
<br>
if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI (the=
<br>
default is false, so all custom DocIdSets are copied to OpenBitSetDISI by<b=
r>
CachingWrapperFilter, even if not needed - if a DocIdSet does not do disk I=
O<br>
and have a fast iterator like e.g. the FieldCache ones in<br>
FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe this<=
br>
should also be added to Kamikaze, which is a really nice project! Especiall=
y<br>
filter DocIdSets should pass this method to its delegate (see FilterDocIdSe=
t<br>
in Lucene).<br>
<br>
-----<br>
<font color=3D"#888888">Uwe Schindler<br>
H.-H.-Meier-Allee 63, D-28213 Bremen<br>
<a href=3D"http://www.thetaphi.de" target=3D"_blank">http://www.thetaphi.de=
</a><br>
eMail: <a href=3D"mailto:uwe@thetaphi.de">uwe@thetaphi.de</a><br>
</font><div class=3D"im"><br>
<br>
&gt; -----Original Message-----<br>
&gt; From: John Wang (JIRA) [mailto:<a href=3D"mailto:jira@apache.org">jira=
@apache.org</a>]<br>
&gt; Sent: Thursday, September 24, 2009 3:14 PM<br>
&gt; To: <a href=3D"mailto:java-dev@lucene.apache.org">java-dev@lucene.apac=
he.org</a><br>
</div><div><div></div><div class=3D"h5">&gt; Subject: [jira] Commented: (LU=
CENE-1458) Further steps towards flexible<br>
&gt; indexing<br>
&gt;<br>
&gt;<br>
&gt; =A0 =A0 [ <a href=3D"https://issues.apache.org/jira/browse/LUCENE-" ta=
rget=3D"_blank">https://issues.apache.org/jira/browse/LUCENE-</a><br>
&gt; 1458?page=3Dcom.atlassian.jira.plugin.system.issuetabpanels:comment-<b=
r>
&gt; tabpanel&amp;focusedCommentId=3D12759112#action_12759112 ]<br>
&gt;<br>
&gt; John Wang commented on LUCENE-1458:<br>
&gt; -----------------------------------<br>
&gt;<br>
&gt; Just a FYI: Kamikaze was originally started as our sandbox for Lucene<=
br>
&gt; contributions until 2.4 is ready. (we needed the DocIdSet/Iterator<br>
&gt; abstraction that was migrated from Solr)<br>
&gt;<br>
&gt; It has three components:<br>
&gt;<br>
&gt; 1) P4Delta<br>
&gt; 2) Logical boolean operations on DocIdSet/Iterators (I have created a =
jira<br>
&gt; ticket and a patch for Lucene awhile ago with performance numbers. It =
is<br>
&gt; significantly faster than DisjunctionScorer)<br>
&gt; 3) algorithm to determine which DocIdSet implementations to use given =
some<br>
&gt; parameters, e.g. miniD,maxid,id count etc. It learns and adjust from t=
he<br>
&gt; application behavior if not all parameters are given.<br>
&gt;<br>
&gt; So please feel free to incorporate anything you see if or move it to<b=
r>
&gt; contrib.<br>
&gt;<br>
&gt;<br>
&gt; &gt; Further steps towards flexible indexing<br>
&gt; &gt; ---------------------------------------<br>
&gt; &gt;<br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Key: LUCENE-1458<br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 URL: <a href=3D"https://issues.ap=
ache.org/jira/browse/LUCENE-1458" target=3D"_blank">https://issues.apache.o=
rg/jira/browse/LUCENE-1458</a><br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0 =A0 Project: Lucene - Java<br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0Issue Type: New Feature<br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0Components: Index<br>
&gt; &gt; =A0 =A0Affects Versions: 2.9<br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0 =A0Reporter: Michael McCandless<br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0 =A0Assignee: Michael McCandless<br>
&gt; &gt; =A0 =A0 =A0 =A0 =A0 =A0Priority: Minor<br>
&gt; &gt; =A0 =A0 =A0 =A0 Attachments: LUCENE-1458-back-compat.patch, LUCEN=
E-1458-back-<br>
&gt; compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE=
-<br>
&gt; 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,<b=
r>
&gt; LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-<b=
r>
&gt; 1458.tar.bz2, LUCENE-1458.tar.bz2<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; I attached a very rough checkpoint of my current patch, to get ea=
rly<br>
&gt; &gt; feedback. =A0All tests pass, though back compat tests don&#39;t p=
ass due to<br>
&gt; &gt; changes to package-private APIs plus certain bugs in tests that<b=
r>
&gt; &gt; happened to work (eg call TermPostions.nextPosition() too many ti=
mes,<br>
&gt; &gt; which the new API asserts against).<br>
&gt; &gt; [Aside: I think, when we commit changes to package-private APIs s=
uch<br>
&gt; &gt; that back-compat tests don&#39;t pass, we could go back, make a b=
ranch on<br>
&gt; &gt; the back-compat tag, commit changes to the tests to use the new<b=
r>
&gt; &gt; package private APIs on that branch, then fix nightly build to us=
e the<br>
&gt; &gt; tip of that branch?o]<br>
&gt; &gt; There&#39;s still plenty to do before this is committable! This i=
s a<br>
&gt; &gt; rather large change:<br>
&gt; &gt; =A0 * Switches to a new more efficient terms dict format. =A0This=
 still<br>
&gt; &gt; =A0 =A0 uses tii/tis files, but the tii only stores term &amp; lo=
ng offset<br>
&gt; &gt; =A0 =A0 (not a TermInfo). =A0At seek points, tis encodes term &am=
p; freq/prox<br>
&gt; &gt; =A0 =A0 offsets absolutely instead of with deltas delta. =A0Also,=
 tis/tii<br>
&gt; &gt; =A0 =A0 are structured by field, so we don&#39;t have to record f=
ield number<br>
&gt; &gt; =A0 =A0 in every term.<br>
&gt; &gt; .<br>
&gt; &gt; =A0 =A0 On first 1 M docs of Wikipedia, tii file is 36% smaller (=
0.99 MB<br>
&gt; &gt; =A0 =A0 -&gt; 0.64 MB) and tis file is 9% smaller (75.5 MB -&gt; =
68.5 MB).<br>
&gt; &gt; .<br>
&gt; &gt; =A0 =A0 RAM usage when loading terms dict index is significantly =
less<br>
&gt; &gt; =A0 =A0 since we only load an array of offsets and an array of St=
ring (no<br>
&gt; &gt; =A0 =A0 more TermInfo array). =A0It should be faster to init too.=
<br>
&gt; &gt; .<br>
&gt; &gt; =A0 =A0 This part is basically done.<br>
&gt; &gt; =A0 * Introduces modular reader codec that strongly decouples ter=
ms dict<br>
&gt; &gt; =A0 =A0 from docs/positions readers. =A0EG there is no more TermI=
nfo used<br>
&gt; &gt; =A0 =A0 when reading the new format.<br>
&gt; &gt; .<br>
&gt; &gt; =A0 =A0 There&#39;s nice symmetry now between reading &amp; writi=
ng in the codec<br>
&gt; &gt; =A0 =A0 chain -- the current docs/prox format is captured in:<br>
&gt; &gt; {code}<br>
&gt; &gt; FormatPostingsTermsDictWriter/Reader<br>
&gt; &gt; FormatPostingsDocsWriter/Reader (.frq file) and<br>
&gt; &gt; FormatPostingsPositionsWriter/Reader (.prx file).<br>
&gt; &gt; {code}<br>
&gt; &gt; =A0 =A0 This part is basically done.<br>
&gt; &gt; =A0 * Introduces a new &quot;flex&quot; API for iterating through=
 the fields,<br>
&gt; &gt; =A0 =A0 terms, docs and positions:<br>
&gt; &gt; {code}<br>
&gt; &gt; FieldProducer -&gt; TermsEnum -&gt; DocsEnum -&gt; PostingsEnum<b=
r>
&gt; &gt; {code}<br>
&gt; &gt; =A0 =A0 This replaces TermEnum/Docs/Positions. =A0SegmentReader e=
mulates the<br>
&gt; &gt; =A0 =A0 old API on top of the new API to keep back-compat.<br>
&gt; &gt;<br>
&gt; &gt; Next steps:<br>
&gt; &gt; =A0 * Plug in new codecs (pulsing, pfor) to exercise the modulari=
ty /<br>
&gt; &gt; =A0 =A0 fix any hidden assumptions.<br>
&gt; &gt; =A0 * Expose new API out of IndexReader, deprecate old API but em=
ulate<br>
&gt; &gt; =A0 =A0 old API on top of new one, switch all core/contrib users =
to the<br>
&gt; &gt; =A0 =A0 new API.<br>
&gt; &gt; =A0 * Maybe switch to AttributeSources as the base class for Term=
sEnum,<br>
&gt; &gt; =A0 =A0 DocsEnum, PostingsEnum -- this would give readers API fle=
xibility<br>
&gt; &gt; =A0 =A0 (not just index-file-format flexibility). =A0EG if someon=
e wanted<br>
&gt; &gt; =A0 =A0 to store payload at the term-doc level instead of<br>
&gt; &gt; =A0 =A0 term-doc-position level, you could just add a new attribu=
te.<br>
&gt; &gt; =A0 * Test performance &amp; iterate.<br>
&gt;<br>
&gt; --<br>
&gt; This message is automatically generated by JIRA.<br>
&gt; -<br>
&gt; You can reply to this email to add a comment to the issue online.<br>
&gt;<br>
&gt;<br>
&gt; ---------------------------------------------------------------------<=
br>
&gt; To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsubscribe@lucene.=
apache.org">java-dev-unsubscribe@lucene.apache.org</a><br>
&gt; For additional commands, e-mail: <a href=3D"mailto:java-dev-help@lucen=
e.apache.org">java-dev-help@lucene.apache.org</a><br>
<br>
<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsubscribe@lucene.apach=
e.org">java-dev-unsubscribe@lucene.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:java-dev-help@lucene.apa=
che.org">java-dev-help@lucene.apache.org</a><br>
<br>
</div></div></blockquote></div><br>

--001636ed77c51654bc047570cf4d--