lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lamprecht <clampre...@gmail.com>
Subject Re: Lucene bulk indexing
Date Tue, 19 Apr 2005 21:39:15 GMT
Muffadal,

First, you should add some timing code to determine whether your
database is slow, or your indexing (I think tokenization occurs in the
call to writer.addDocument()).  Assuming your database query is the
slowdown, read on...

Depending on the details of your database (which fields are indexed,
etc), your query using LIMIT x,y may be slow.  You might try this:

At the start of your index process, issue a one-time query to your
database that returns a list of ALL primary keys that you want to
index.  Save this list of primary keys in memory.

Then take your records in "chunks" of 1000 or whatever, using a query
like "select * from mytable where primarykey = 1 or primarykey = 2 or
... or primarykey = 1000".  A primary key lookup like this should be
fast.  Also I got a small performance increase by replacing the "OR"
query above with something like "select * from mytable where
primarykey IN (1,2,3,...,1000)" (with MySQL 4.1).


On 4/19/05, Mufaddal Khumri <MKhumri@allegromedical.com> wrote:
> Hi,
> 
> The 20000 products I mentioned are 20000 rows. I get the products in
> bulk by using a limit clause.
> 
> I am using hibernate with MySQL server on a 2.8GHz, 1.00GB Ram machine.
> 
> I am baffled that 1.2 or 1.5 million records are being indexed in 20
> minutes compared to the 20000 records I am indexing in 22 minutes. One
> possibility is that the hardware you guys are using is much better. I do
> not know what else can cause the indexing to be so slow. Where am I
> going wrong with this?
> 
> This is the gist of what I am doing in my indexer for all products:
> 
> Product prod = list.next();
> Document doc = new Document();
> String name = prod.getName();
> Description desc
> =(Description)prod.getDescriptions().get(Description.LONG);
> String longDesc = null, shortDesc = null, content = null;
> 
> if(desc != null)
>  longDesc = desc.getText();
> else
>  longDesc = "";
> 
> desc = (Description)prod.getDescriptions().get(Description.SHORT);
> if(desc != null)
>  shortDesc = desc.getText();
> else
>  shortDesc = "";
> 
> content = name + " " + shortDesc + " " + longDesc + " ";
> 
> doc.add(Field.UnIndexed("id", prod.getId() + ""));
> doc.add(Field.Keyword("entity","product"));
> doc.add(Field.Text("name", name));
> doc.add(Field.Text("shortDescription", (shortDesc.equals("") ? "" :
> shortDesc)));
> doc.add(Field.Text("longDescription", (longDesc.equals("") ? "" :
> longDesc)));
> doc.add(Field.Text("content", content));
> 
> try
> {
>        writer.addDocument(doc);
> }
> catch (IOException e)
> {
>        log.error("Error adding document: " + e.getMessage());
> }
> 
> Where could I be slowing down the indexing process? Is it because I am
> indexing the longDescription as a Field.Text? (longDescription could
> have a fair amount of text averaging about 4000 words).
> 
> Any ideas?
> 
> Thanks,
> Mufaddal.
> 
> -----Original Message-----
> From: Daniel Herlitz [mailto:daniel@dhlz.com]
> Sent: Tuesday, April 19, 2005 2:16 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene bulk indexing
> 
> Agree. We run an index with about 2.5 million documents and around 30
> fields. The indexing itself of 20000 items should only take a few
> seconds on a reasonably fast machine.
> 
> /D
> 
> Kevin L. Cobb wrote:
> 
> >I think your bottleneck is most likely the DB hit. I assume by 20000
> >products you mean 20000 distinct entries into the Lucene Index, i.e.
> >20000 rows in the DB to select from.
> >
> >I index about 1.5 million rows from a SQL Server 2000 database with
> >several fields for each entry and it finishes in about twenty minutes
> so
> >you should be able to index 20000 rows in a few seconds.
> >
> >Make sure your database table(s) are indexed appropriately according to
> >your select statements. Indexing correctly will be the biggest
> >performance improvement you will see.
> >
> >Best of luck.
> >
> >KLCobb
> >
> >-----Original Message-----
> >From: Mufaddal Khumri [mailto:MKhumri@allegromedical.com]
> >Sent: Tuesday, April 19, 2005 2:11 PM
> >To: java-user@lucene.apache.org
> >Subject: Lucene bulk indexing
> >
> >Hi,
> >
> >I am sure this question must be raised before and maybe it has been
> even
> >answered. I would be grateful, if someone could point me in the right
> >direction or give their thoughts on this topic.
> >
> >The problem:
> >
> >I have approximately over 20000 products that I need to index. At the
> >moment I get X number of products at a time and index them. This
> process
> >takes about 26 minutes (Am indexing the database id, product name,
> >product description).
> >
> >I was thinking of ways to make this indexing faster. For this I was
> >thinking about writing a threaded module that would index X number of
> >products simultaneously. For instance I could spawn (Number of
> >products/X) number of threads and do the indexing. I am guessing this
> >would be faster but by what factor would this be faster? (I understand
> >the writes to the index are synchronized by lucene).
> >
> >Is there any other approach by which I could speed up the indexing?
> >Thoughts? Suggestions?
> >
> >Thanks,
> >Mufaddal.
> >
> >
> >-----------------------------------------------------------------------
> -
> >------------------
> >This email and any files transmitted with it are confidential
> >and intended solely for the use of the individual or entity
> >to whom they are addressed. If you have received this
> >email in error please notify the system manager. Please
> >note that any views or opinions presented in this email
> >are solely those of the author and do not necessarily
> >represent those of the company. Finally, the recipient
> >should check this email and any attachments for the
> >presence of viruses. The company accepts no liability for
> >any damage caused by any virus transmitted by this email.
> >Consult your physician prior to the use of any medical
> >supplies or product.
> >-----------------------------------------------------------------------
> -
> >------------------
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ------------------------------------------------------------------------------------------
> This email and any files transmitted with it are confidential
> and intended solely for the use of the individual or entity
> to whom they are addressed. If you have received this
> email in error please notify the system manager. Please
> note that any views or opinions presented in this email
> are solely those of the author and do not necessarily
> represent those of the company. Finally, the recipient
> should check this email and any attachments for the
> presence of viruses. The company accepts no liability for
> any damage caused by any virus transmitted by this email.
> Consult your physician prior to the use of any medical
> supplies or product.
> ------------------------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message