lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chew Yee Chuang" <yeechu...@tecforte.com>
Subject RE: FW: Lucene indexing vs RDBMS insertion.
Date Fri, 06 Jul 2007 04:08:49 GMT
Greetings,

I tried the RAM directory and it really speed up the indexing process a lot.
Now my indexing can handle 700 - 800 documents/sec with mergeFactor:10 and
addIndexesNoOptimize. However, when I run indexing and searching
concurrently, my indexing process will hit the maximum heap space(512m) and
thus OutOfMemory exception was thrown. The indexing and searching is run on
the same index, and search is activated every 5sec with more than 30
queries/search. I'm running all this in a Pentium 4, 2.8GHz with 1GB Ram
machine. 

Is it anyway to lighten indexing process? I had think of scoring, because I
don’t really need the relevancy of a given document to a query, but I don’t
know what lucene did in order to get the score, just thinking it might have
some process during indexing so the search can return documents' score /
order but documents' score, pls correct me if I'm wrong. Any suggestion to
lighten the indexing is always welcome.

Thank you

 
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, June 20, 2007 8:56 PM
To: java-user@lucene.apache.org
Subject: Re: FW: Lucene indexing vs RDBMS insertion.

That's a tough one. What I still don't get is why your 1,000 records/sec
is important. If you're really inserting records that fast, for very long,
you must have an impressive piece of hardware <G>...

That said, it might be possible to do something like have your base
index out on disk somewhere, and use, say, a RAMdir for some period
of time to collect new insertions. Your searches would have to go
against both indexes. Periodically you'd have to merge the RAMdir
with your filesystem based index and close/reopen the FSDir index.

One problem you'll have is that your application won't be able to see
index updates until you close/reopen your readers. Doing this every
minute for a large index will lead to performance problems. The RAMDir
approach at least lets you close/open on a (hopefully) much smaller/faster
index every minute. Of course you'll have to combine the results from the
different indexes, which may or may not be difficult.

Hope this helps
Erick

On 6/20/07, Chew Yee Chuang <yeechuang@tecforte.com> wrote:
>
> Greetings Erick, my index need to have latest data (almost real time, but
> a
> delay of less than 1 minutes is acceptable). Thus there is no way to
> schedule the indexing. What I can do is to find a solution to minimize
> delay
> so system can get "almost" real time data to display.
>
> Thanks.
> -------
> eChuang, Chew
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, June 19, 2007 10:03 PM
> To: java-user@lucene.apache.org
> Subject: Re: FW: Lucene indexing vs RDBMS insertion.
>
> You still haven't described how often you need to index and why.
> That's critical. If you have an index that only needs to be updated
> every month, many of your concerns disappear. If it needs to be
> updated every 5 seconds, it's another matter entirely.
>
> So which is it?
>
> Best
> Erick
>
> On 6/18/07, Chew Yee Chuang <yeechuang@tecforte.com> wrote:
> >
> > Thanks for the sharing and suggestion.
> > Yes Chris, the index is to be partitioned by date time, and old index
> will
> > not be access so frequent.
> >
> > I also did consider indexing in parallel to different index as well
> Erick.
> > But I can only put all index in ONE machine and there is only ONE
> machine
> > to
> > process the job (both searching and indexing).
> >
> > I haven't try out the addIndexes to combine indexes but I did tried out
> > MultiSearcher for 3 millions of documents in 3 separate indexes and it
> > does
> > not have much different compare to a search in 1 index. Could you mind
> to
> > share your experience in addIndexes, how is the performance for that ?
> In
> > my
> > situation,  the indexes may be use for searching at the same time.
> Another
> > worry for indexing is optimization process, I have tried it out and it
> > take
> > quite some time to optimize my index, e.g indexing for 5000 is about
> > 10seconds (MergeFactor = 1000, and recreate IndexWriter every 5000
> > documents), but to optimize, it will take around ONE minute. So do you
> > guys
> > have any suggestion on optimization process as well ? when should I run
> > the
> > optimization process ? and is it a lot of different with searching in a
> > optimized index compare to a un-optimize ?
> >
> > ------
> > eChuang, Chew
> >
> > -----Original Message-----
> > From: Chris Lu [mailto:chris.lu@gmail.com]
> > Sent: Tuesday, June 19, 2007 2:19 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: FW: Lucene indexing vs RDBMS insertion.
> >
> > Definitely very aggressive.
> >
> > Currently my experience is that, together with database access,
> > DBSight can do 3 million in 2 hours, with Pentium D 3.4Hz. Seems you
> > definitely need some good hardware, and a fast hard drive for this. I
> > feel the hard drive is actually the bottleneck for large indexes.
> > Partitioning your data and pruning the old data should be also
> > considered.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> >
> >
>
>
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
> > inutes
> >
> >
> > On 6/18/07, Chew Yee Chuang <yeechuang@tecforte.com> wrote:
> > > Thanks for your suggestion Erick. I'm planning to test the indexing
> > soon.
> > > For your information, currently the system is inserting into RDBMS
> which
> > is
> > > around 1000 records per seconds. Thus, if lucene in place, I would
> > expect
> > it
> > > will index that much of documents per seconds as well (Our target is
> 3.6
> > > millions of document to be indexed in 1 hour). Beside of that, I'm
> > planning
> > > to queue the record so lucene will have enough time to index it.
> Anyway,
> > > thanks for your suggestion and will come back to you once I tested the
> > > solution.
> > >
> > > Thanks,
> > > eChuang, Chew
> > >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Friday, June 15, 2007 11:11 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: FW: Lucene indexing vs RDBMS insertion.
> > >
> > > From my perspective, this is an irrelevant question. The real question
> > > is "is Lucene indexing fast enough for my application?". Which nobody
> > > can answer for you, you have to experiment.
> > >
> > > If you're building an index that's only updated every 6 months,
> > > Lucene is certainly "fast enough". If you're recreating the
> > > index every 6 seconds, it's a different question.
> > >
> > > So, I recommend that you create a test application that does
> > > nothing except read your source, do whatever parsing you
> > > need to do and does NOT index it at all. Record the time it
> > > takes.
> > >
> > > Then try the same thing WITH indexing and record the difference.
> > >
> > > Then, to get a sense of the dimension of the problem, try
> > > substituting inserting into the RDBMS instead of the Lucene
> > > index.
> > >
> > > Once you have numbers, you can make better decisions
> > > And people can give you better advice,  especially if you
> > > include more detail of your design.
> > >
> > > Best
> > > Erick
> > >
> > > On 6/15/07, Chew Yee Chuang <yeechuang@tecforte.com> wrote:
> > > >
> > > > Hi, I'm  a new user to Lucene, and heard that it is a powerful tool
> > for
> > > > full
> > > > text search and I'm planning to use it in my project for data
> storage
> > > > purpose. Before the implementation, I could like to know whether
> there
> > is
> > > > performance issue on Lucene indexing process. I have no doubt on the
> > > > retrieving and searching feature in Lucene but the indexing process.
> I
> > > > have
> > > > tested my current system to insert 1000 records in RDBMS storage it
> > took
> > > > about 1 seconds. Thus, if I change my solution to Lucene, can Lucene
> > > > indexing process perform faster than RDBMS ? I have go through some
> of
> > the
> > > > article talking about the "MergeFactor" and "MaxMergeDocs" parameter
> > for
> > > > fine tune the indexing process, but no comparison between Lucene
> > indexing
> > > > process and RDBMS insertion. Thus, hope someone who have experience
> in
> > > > Lucene can provide this information or some article that discuss
> > between
> > > > Lucene and RDBMS.
> > > >
> > > >
> > > >
> > > > I really appreciate any help in this. Thanks
> > > >
> > > >
> > > > No virus found in this outgoing message.
> > > > Checked by AVG Free Edition.
> > > > Version: 7.5.472 / Virus Database: 269.8.16/849 - Release Date:
> > 6/14/2007
> > > > 12:44 PM
> > > >
> > > >
> > >
> > > No virus found in this incoming message.
> > > Checked by AVG Free Edition.
> > > Version: 7.5.472 / Virus Database: 269.9.0/852 - Release Date:
> 6/17/2007
> > > 8:23 AM
> > >
> > >
> > > No virus found in this outgoing message.
> > > Checked by AVG Free Edition.
> > > Version: 7.5.472 / Virus Database: 269.9.0/852 - Release Date:
> 6/17/2007
> > > 8:23 AM
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.5.472 / Virus Database: 269.9.0/852 - Release Date: 6/17/2007
> > 8:23 AM
> >
> >
> > No virus found in this outgoing message.
> > Checked by AVG Free Edition.
> > Version: 7.5.472 / Virus Database: 269.9.0/853 - Release Date: 6/18/2007
> > 3:02 PM
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.472 / Virus Database: 269.9.0/853 - Release Date: 6/18/2007
> 3:02 PM
>
>
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.5.472 / Virus Database: 269.9.1/854 - Release Date: 6/19/2007
> 1:12 PM
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

No virus found in this incoming message.
Checked by AVG Free Edition. 
Version: 7.5.472 / Virus Database: 269.9.4/860 - Release Date: 6/21/2007
5:53 PM
 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.476 / Virus Database: 269.10.0/887 - Release Date: 7/5/2007
1:55 PM
 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message