lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nic Wise" <Nic.W...@bbc.com>
Subject RE: Website and keeping data fresh
Date Tue, 12 Feb 2008 11:07:01 GMT
If it's any help, we (when I was at Quest) were removing items and
adding them on an ongoing basis. We had indexes with 25+million items,
around 15GB+ from memory. We'd add around 100K items a day, and some
items were added more than once (which means remove then add). We had
very good performance once we changed the MaxDocuments (to 100K, was
2.4billion / maxint) and the MergeFactor - merging really large blocks
(I don't know the official term - segments?) made performance lousy, but
you don't NEED to keep them all in one file.

Does that make sense?

How much stuff are you putting into the book index? 500meg sounds about
right, but an hour sounds a little high?

-----Original Message-----
From: Gautam Lad [mailto:gautam@rogers.com] 
Sent: 12 February 2008 04:21
To: lucene-net-user@incubator.apache.org
Subject: RE: Website and keeping data fresh

Very good to know.  

Since not a lot of documents update during the course of the day and
since
we already re-build the index at night, I doubt it would hurt
performance as
you say :)

Thanks,
--

Gautam Lad

-----Original Message-----
From: Kurt Mackey [mailto:kurt@mubble.net] 
Sent: February 11, 2008 10:10 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: Website and keeping data fresh

Nope.  For that few writes, I can't see how you'd ever need to optimize
during the day.  You might run a few tests to find out how many writes
cause
search performance to degrade, but I suspect it's a lot. :)

Optimizing is slow because it essentially writes all the index contents
to a
new index file.

-Kurt

-----Original Message-----
From: Gautam Lad [mailto:gautam@rogers.com]
Sent: Monday, February 11, 2008 8:40 PM
To: lucene-net-user@incubator.apache.org
Subject: Website and keeping data fresh

Hey all,



I recently moved our company's external website to use dotLucene, and so
far
it's been great and is working flawlessly.



I have several indices that I use to manage our website.  Since our
company
is in the Book industry I have several indices that are used for various
parts of the page.



Eg. Our main catalog is searchable and so we have a "Book" index that
can be
searched by Title, Description, Author, etc.



We also have an Author table that can be searched by First name, Last
name,
bio, etc.



Finally we have a BookAuthor relationship table that is used when a Book
is
searched, the BookAuthor is searched to find out if the Book's authors
have
other books.



The indices are as:



Book (primary key: ISBN) - 160,000+ documents

Author (primary key: AuthorID) - 60, 000+ documents

BookAuthor (contains LinkID), 100, 000+ documents





So far things are working great.  The book index is about 500MB and is
not a
big overhead on our system.



Now here's where the problem lies.



To keep things fresh on the site, we have a nightly job that rebuilds
entire
index and then copies the data over to the production index folder (it
takes
about an hour to rebuild entire site and a min or two to copy things
over).



However, there will be times when the information will need to be
updated
almost live during the normal day-to-day hours.



Say for example a book's description has changed.  What I do is I delete
the
document and then re-add it.



Unfortunately deleting and re-adding it to the index takes a few minutes
and
this is causing issues with information not being available when someone
tries to look on the site.





Here's the log from our background service that rebuild documents:



20080211 16:59:32 [Engine] [book] Deleting isbn(1554700310).  Status: 1

20080211 16:59:32 [Engine] [book] [00:00:00:000] Getting table count

20080211 16:59:34 [Engine] [book] [00:00:02:156] Rows loaded 1

20080211 16:59:34 [Engine] [book] [00:00:02:156] Getting table schema

20080211 16:59:34 [Engine] [book] [00:00:02:218] Getting data reader

20080211 16:59:36 [Engine] [book] [16:59:36:000] Index dump started

20080211 16:59:36 [Engine] [book] [00:00:00:078] Total indexed: 1

20080211 16:59:36 [Engine] [book] [00:00:00:078] Optimizing index

20080211 17:02:23 [Engine] [book] [00:02:46:917] Index finished



You can see from the moment it deleted the ISBN from the "book" index to
when it finally added it back, it took only 4 seconds.  But when the
call to
Writer.Optimize() is called it takes almost 2-1/2 minutes to optimize
the
index.



Is optimizing the index even necessary at this point?



Any help is greatly appreciated.



--

Gautam Lad 
This e-mail (and any attachments) is confidential and may contain personal views which are
not the views of the BBC unless specifically stated. If you have received it in error, please
delete it from your system. Do not use, copy or disclose the information in any way nor act
in reliance on it and notify the sender immediately.
 
Please note that the BBC monitors e-mails sent or received. Further communication will signify
your consent to this

This e-mail has been sent by one of the following wholly-owned subsidiaries of the BBC:
 
BBC Worldwide, Registration Number: 1420028 England, Registered Address: Woodlands, 80 Wood
Lane, London W12 0TT
BBC World, Registration Number: 04514407 England, Registered Address: Woodlands, 80 Wood Lane,
London W12 0TT
BBC World Distribution Limited, Registration Number: 04514408, Registered Address: Woodlands,
80 Wood Lane, London W12 0TT

Mime
View raw message