Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of chris.lu@gmail.com designates
 64.233.184.237 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=X25iumE23cHQzWwUgKgz4qVzb+KLOwiOXnnLRwksNpBMKxAiH8jV1RsiF20xNMI7Y9zvVKYrsH5hN5SursDLb9fdxsoOwRIx6grrSCTwM0XfKLeQ27MW4nkvUiLVPlMnYAjzq8Z/+da6y1alYIAOF7fQHHCHjgZKl23yzgtJ+Eg=
Message-ID: <6e3ae6310706190021q4202fbet2c69726d990fa95f@mail.gmail.com>
Date: Tue, 19 Jun 2007 00:21:38 -0700
From: "Chris Lu" <chris.lu@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: FW: Lucene indexing vs RDBMS insertion.
In-Reply-To: <005501c7b218$78f17dc0$6ad47940$@com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <003d01c7af1a$bc11eb90$3435c2b0$@com>
	 <359a92830706150811h751a208bwa0cc65c7853ee1a1@mail.gmail.com>
	 <002901c7b182$da91edd0$8fb5c970$@com>
	 <6e3ae6310706181118m18f59255o3d05026663dbe197@mail.gmail.com>
	 <005501c7b218$78f17dc0$6ad47940$@com>

Optimized index vs un-optimized actually is very much like searching
on one optimized index vs MultiSearcher on multiple optimized indexes.

Each segment is like a small index. If you just add them together,
they just behave like multiple indexes. If segments number is small,
like 3, there won't be much difference. But when you have merge factor
of 1000, you may have 1000 segments. After each search on each
segment, you need to dynamically merge the results, which can be slow.
If you optimize it, your search will be faster, but indexing is
slower.

So you may need to balance your needs. Need your search be very quick,
or ordinary quick is enough? You may want to tune it several times to
get a good balance.

In my project DBSight, what it did is to do indexing on a dedicated
machine, and when it's done, transfer it to a dedicated searching
machine. It won't work for your high requirements though. But I know
from it that "AddIndexes" is pretty slow, especially when the index is
large. Adding index without optimization would be faster, maybe 50%,
but still adding indexes could take hours for million-level documents.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 6/18/07, Chew Yee Chuang <yeechuang@tecforte.com> wrote:
> Thanks for the sharing and suggestion.
> Yes Chris, the index is to be partitioned by date time, and old index will
> not be access so frequent.
>
> I also did consider indexing in parallel to different index as well Erick.
> But I can only put all index in ONE machine and there is only ONE machine to
> process the job (both searching and indexing).
>
> I haven't try out the addIndexes to combine indexes but I did tried out
> MultiSearcher for 3 millions of documents in 3 separate indexes and it does
> not have much different compare to a search in 1 index. Could you mind to
> share your experience in addIndexes, how is the performance for that ? In my
> situation,  the indexes may be use for searching at the same time. Another
> worry for indexing is optimization process, I have tried it out and it take
> quite some time to optimize my index, e.g indexing for 5000 is about
> 10seconds (MergeFactor = 1000, and recreate IndexWriter every 5000
> documents), but to optimize, it will take around ONE minute. So do you guys
> have any suggestion on optimization process as well ? when should I run the
> optimization process ? and is it a lot of different with searching in a
> optimized index compare to a un-optimize ?
>
> ------
> eChuang, Chew
>
> -----Original Message-----
> From: Chris Lu [mailto:chris.lu@gmail.com]
> Sent: Tuesday, June 19, 2007 2:19 AM
> To: java-user@lucene.apache.org
> Subject: Re: FW: Lucene indexing vs RDBMS insertion.
>
> Definitely very aggressive.
>
> Currently my experience is that, together with database access,
> DBSight can do 3 million in 2 hours, with Pentium D 3.4Hz. Seems you
> definitely need some good hardware, and a fast hard drive for this. I
> feel the hard drive is actually the bottleneck for large indexes.
> Partitioning your data and pruning the old data should be also
> considered.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
> inutes
>
>
> On 6/18/07, Chew Yee Chuang <yeechuang@tecforte.com> wrote:
> > Thanks for your suggestion Erick. I'm planning to test the indexing soon.
> > For your information, currently the system is inserting into RDBMS which
> is
> > around 1000 records per seconds. Thus, if lucene in place, I would expect
> it
> > will index that much of documents per seconds as well (Our target is 3.6
> > millions of document to be indexed in 1 hour). Beside of that, I'm
> planning
> > to queue the record so lucene will have enough time to index it. Anyway,
> > thanks for your suggestion and will come back to you once I tested the
> > solution.
> >
> > Thanks,
> > eChuang, Chew
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Friday, June 15, 2007 11:11 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: FW: Lucene indexing vs RDBMS insertion.
> >
> > From my perspective, this is an irrelevant question. The real question
> > is "is Lucene indexing fast enough for my application?". Which nobody
> > can answer for you, you have to experiment.
> >
> > If you're building an index that's only updated every 6 months,
> > Lucene is certainly "fast enough". If you're recreating the
> > index every 6 seconds, it's a different question.
> >
> > So, I recommend that you create a test application that does
> > nothing except read your source, do whatever parsing you
> > need to do and does NOT index it at all. Record the time it
> > takes.
> >
> > Then try the same thing WITH indexing and record the difference.
> >
> > Then, to get a sense of the dimension of the problem, try
> > substituting inserting into the RDBMS instead of the Lucene
> > index.
> >
> > Once you have numbers, you can make better decisions
> > And people can give you better advice,  especially if you
> > include more detail of your design.
> >
> > Best
> > Erick
> >
> > On 6/15/07, Chew Yee Chuang <yeechuang@tecforte.com> wrote:
> > >
> > > Hi, I'm  a new user to Lucene, and heard that it is a powerful tool for
> > > full
> > > text search and I'm planning to use it in my project for data storage
> > > purpose. Before the implementation, I could like to know whether there
> is
> > > performance issue on Lucene indexing process. I have no doubt on the
> > > retrieving and searching feature in Lucene but the indexing process. I
> > > have
> > > tested my current system to insert 1000 records in RDBMS storage it took
> > > about 1 seconds. Thus, if I change my solution to Lucene, can Lucene
> > > indexing process perform faster than RDBMS ? I have go through some of
> the
> > > article talking about the "MergeFactor" and "MaxMergeDocs" parameter for
> > > fine tune the indexing process, but no comparison between Lucene
> indexing
> > > process and RDBMS insertion. Thus, hope someone who have experience in
> > > Lucene can provide this information or some article that discuss between
> > > Lucene and RDBMS.
> > >
> > >
> > >
> > > I really appreciate any help in this. Thanks
> > >
> > >
> > > No virus found in this outgoing message.
> > > Checked by AVG Free Edition.
> > > Version: 7.5.472 / Virus Database: 269.8.16/849 - Release Date:
> 6/14/2007
> > > 12:44 PM
> > >
> > >
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.5.472 / Virus Database: 269.9.0/852 - Release Date: 6/17/2007
> > 8:23 AM
> >
> >
> > No virus found in this outgoing message.
> > Checked by AVG Free Edition.
> > Version: 7.5.472 / Virus Database: 269.9.0/852 - Release Date: 6/17/2007
> > 8:23 AM
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.472 / Virus Database: 269.9.0/852 - Release Date: 6/17/2007
> 8:23 AM
>
>
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.5.472 / Virus Database: 269.9.0/853 - Release Date: 6/18/2007
> 3:02 PM
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org