lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: scalability limit in terms of numbers of large documents
Date Mon, 16 Aug 2010 08:09:38 GMT
On Sat, 2010-08-14 at 03:24 +0200, andynuss wrote:
> Lets say that I am indexing large book documents broken into chapters.  A
> typical book that you buy at amazon.  What would be the approximate limit to
> the number of books that can be indexed slowly and searched quickly.  The
> search unit would be a chapter, so assume that a book is divided into 15-50
> chapters.  Any ideas?

Hathi Trust has an excellent blog where they write about indexing 
5 million+ scanned books.  http://www.hathitrust.org/blogs
They focus on OCR'ed books where dirty data is a big problem, but most
of their thoughts and solutions can be used for clean data too.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message