Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 65467 invoked from network); 31 Aug 2009 07:23:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 31 Aug 2009 07:23:13 -0000 Received: (qmail 97078 invoked by uid 500); 31 Aug 2009 07:23:12 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 97005 invoked by uid 500); 31 Aug 2009 07:23:12 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 96997 invoked by uid 99); 31 Aug 2009 07:23:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Aug 2009 07:23:12 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of buschmic@gmail.com designates 209.85.217.216 as permitted sender) Received: from [209.85.217.216] (HELO mail-gx0-f216.google.com) (209.85.217.216) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Aug 2009 07:23:01 +0000 Received: by gxk12 with SMTP id 12so261841gxk.4 for ; Mon, 31 Aug 2009 00:22:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=xZhaSVtgBLm08XpGWnihzsG9/szqyW8ZHwvam4nFL3I=; b=YrFnVG9goFUIOs6z8lLNWCbkKShq2UfDFcL8UhWkuoxdPwSLHdSaJ3Yv8yTUhV4ltw 3jA8OhxxssPek2oD2H+TQc+/HMU6hog6ZkARQbwI2cRr5GnS2KX/DHzWK8gk2L5Bhi0v ZDsu0oLOZ1k1yJLAoOXgR16aEIShuvkHRGTBM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=pC2mHA+0mVVlSoliBYesfCxwKqJKQlr+ymMrps1PyyEeuhmdzDQWfk+QCZtM4ROl3E Vl19OGSDIwlbzch0ymsOA16k0evsnpqde+uk4dre3gbR2CeiJerivHsv7JChYZNhoUZ5 IdGggDj0E03PHj2PhW+aqeqwSgUQPxsSYYKzs= MIME-Version: 1.0 Received: by 10.150.132.11 with SMTP id f11mr8462005ybd.280.1251703360710; Mon, 31 Aug 2009 00:22:40 -0700 (PDT) In-Reply-To: References: <4A9A1AC8.7090805@gmail.com> Date: Mon, 31 Aug 2009 00:22:40 -0700 Message-ID: Subject: Re: Parallel incremental indexing From: Michael Busch To: java-dev@lucene.apache.org, yonik@lucidimagination.com Content-Type: multipart/alternative; boundary=000e0cd4866c2de80e04726ae61d X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd4866c2de80e04726ae61d Content-Type: text/plain; charset=ISO-8859-1 On Sun, Aug 30, 2009 at 6:08 AM, Yonik Seeley wrote: > Cool stuff! > > Thanks. It's actually really fun to work on! After I had the parallel indexing working and didn't have to worry anymore about how to manage parallel indexes the fun of implementing cool features on top of this started. I hope you'll have that fun in Solr too! :) > We should also think about how to do single document field updates or > field adds since that is the most common usecase - not that it needs > I completely agree that we should solve that problem too. > to be implemented in the first version, but kept in mind so we don't > box ourselves in. > This code is currently non intrusive from Lucene's point of view (it can't be cause I use it on top of vanilla 2.4.1). But I agree: when we integrate it more tightly in Lucene to make it more efficient we should keep the end goal in mind (e.g. the use cases you mentioned). > > Doug mentioned some ideas he had in passing almost a year ago about > how to add a field to a single document, and it is similar in that it > used parallel reader. IndexWriter would be modified to maintain the > same structure across parallel indexes, as you note. If one wanted to > add a new field value to document 1000, one would have to index dummy > documents for docs 0-999... instead of this, the index format should > support gaps. On a segment merge, the IndexWriter could simply merge > in this new segment. > > Yeah currently it's kind of unefficient that we have to call addDocument() 999 times with an empty document to achive this. The .frq and .prx files however work great as they use delta encoding. Also .del files support DGaps now. On the other hand especially the stored fields index (.fdx) doesn't support gaps because of random access support. Also norm files and term vectors (though both can be turned off) don't support gaps. > Anyway, updateable documents is fundamental enough, we should also > consider changes to the index format if it makes it easer. > > Yes I agree. We should make changes to the default index format if that makes updating documents more efficient. Note that I said "default index format" :) - I'm already excited about having parallel indexing and flexible indexing in Lucene. It will be awesome what you can do then with Lucene! So I think we should start with the necessary work to keep parallel indexes in sync. When that's done we should continue with the usecases we discussed, including the work of changing the index format to support gaps. > -Yonik > http://www.lucidimagination.com > > > On Sun, Aug 30, 2009 at 2:23 AM, Michael Busch wrote: > > Hi all, > > > > I just added a wiki page for a new feature I'd like to add to > > Lucene. Please take a look at the link. I will add more details and > > diagrams to the page, but for now it should give a rough idea about > > how to implement it: > > > > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > > > > Basically the idea is to allow updating documents partially, e.g. only > > a subset of the fields without having to reindex the entire > > document. This is a feature that is very often asked for. > > > > We have implemented the solution in IBM and it's working > > great. It is a technology that allowed us already to add really exciting > > new features to products that weren't easily possible before. > > > > The implementation I can currently contribute has some limitations: > > e.g. multi-threaded indexing is not supported. But let me make clear > > that this is not a limitation of the design described in the wiki - we > > have these limitations because we implemented this on top of Lucene's 2.4 > > APIs. If we decide to add this to Lucene's core we should > > reimplement some parts to overcome those limitations. > > > > In my opinion this will be a great addition to Lucene that many > > people will find very useful. In Solr this is also something users often > > ask for. > > > > In the last weeks I worked on getting internal approval for the > contribution > > to Lucene and the good news is that I already have a signed > > software grant ready - so if the community likes this feature and > > decides to add this to Lucene there won't be any delay for legal work > > from IBM's side. > > > > Btw: I will be on vacation from 09/03-09/20 and won't have internet > > access most of the time, so if I stop responding end of next week you'll > > know why... > > > > Please let me know what you think! > > > > Michael > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-dev-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --000e0cd4866c2de80e04726ae61d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Sun, Aug 30, 2009 at 6:08 AM, Yonik S= eeley <y= onik@lucidimagination.com> wrote:
Cool stuff!


Thanks. It's actually really fun t= o work on! After I had the parallel indexing working and didn't have to= worry anymore about how to manage parallel indexes the fun of implementing= cool features on top of this started. I hope you'll have that fun in S= olr too! :)
=A0
We should also think about how to do single document field updates or
field adds since that is the most common usecase - not that it needs

I completely agree that we should solve that = problem too.
=A0
to be implemented in the first version, but kept in mind so we don't box ourselves in.

This code is currentl= y non intrusive from Lucene's point of view (it can't be cause I us= e it on top of vanilla 2.4.1). But I agree: when we integrate it more tight= ly in Lucene to make it more efficient we should keep the end goal in mind = (e.g. the use cases you mentioned).=A0
=A0

Doug mentioned some ideas he had in passing almost a year ago about
how to add a field to a single document, and it is similar in that it
used parallel reader. =A0IndexWriter would be modified to maintain the
same structure across parallel indexes, as you note. =A0If one wanted to add a new field value to document 1000, one would have to index dummy
documents for docs 0-999... instead of this, the index format should
support gaps. =A0On a segment merge, the IndexWriter could simply merge
in this new segment.


Yeah currently it's kind of u= nefficient that we have to call addDocument() 999 times with an empty docum= ent to achive this. The .frq and .prx files however work great as they use = delta encoding. Also .del files support DGaps now. On the other hand especi= ally the stored fields index (.fdx) doesn't support gaps because of ran= dom access support. Also norm files and term vectors (though both can be tu= rned off) don't support gaps.=A0

=A0
Anyway, updateable documents is fundamental enough, we should also
consider changes to the index format if it makes it easer.


Yes I agree. We should make chang= es to the default index format if that makes updating documents more effici= ent. Note that I said "default index format" :) - I'm already= excited about having parallel indexing and flexible indexing in Lucene. It= will be awesome what you can do then with Lucene!

So I think we should start with the necessary work to k= eep parallel indexes in sync. When that's done we should continue with = the usecases we discussed, including the work of changing the index format = to support gaps.


=A0
-Yonik
http://www.lu= cidimagination.com


On Sun, Aug 30, 2009 at 2:23 AM, Michael Busch<buschmic@gmail.com> wrote:
> Hi all,
>
> I just added a wiki page for a new feature I'd like to add to
> Lucene. Please take a look at the link. I will add more details and > diagrams to the page, but for now it should give a rough idea about > how to implement it:
>
> http://wiki.apache.org/lucene-java/ParallelIncrement= alIndexing
>
> Basically the idea is to allow updating documents partially, e.g. only=
> a subset of the fields without having to reindex the entire
> document. This is a feature that is very often asked for.
>
> We have implemented the solution in IBM and it's working
> great. It is a technology that allowed us already to add really exciti= ng
> new features to products that weren't easily possible before.
>
> The implementation I can currently contribute has some limitations: > e.g. multi-threaded indexing is not supported. But let me make clear > that this is not a limitation of the design described in the wiki - we=
> have these limitations because we implemented this on top of Lucene= 9;s 2.4
> APIs. If we decide to add this to Lucene's core we should
> reimplement some parts to overcome those limitations.
>
> In my opinion this will be a great addition to Lucene that many
> people will find very useful. In Solr this is also something users oft= en
> ask for.
>
> In the last weeks I worked on getting internal approval for the contri= bution
> to Lucene and the good news is that I already have a signed
> software grant ready - so if the community likes this feature and
> decides to add this to Lucene there won't be any delay for legal w= ork
> from IBM's side.
>
> Btw: I will be on vacation from 09/03-09/20 and won't have interne= t
> access most of the time, so if I stop responding end of next week you&= #39;ll
> know why...
>
> Please let me know what you think!
>
> =A0Michael
>
>
> ------------------------= ---------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


--000e0cd4866c2de80e04726ae61d--