lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen>
Subject Re: scalability limit in terms of numbers of large documents
Date Mon, 16 Aug 2010 08:09:38 GMT
On Sat, 2010-08-14 at 03:24 +0200, andynuss wrote:
> Lets say that I am indexing large book documents broken into chapters.  A
> typical book that you buy at amazon.  What would be the approximate limit to
> the number of books that can be indexed slowly and searched quickly.  The
> search unit would be a chapter, so assume that a book is divided into 15-50
> chapters.  Any ideas?

Hathi Trust has an excellent blog where they write about indexing 
5 million+ scanned books.
They focus on OCR'ed books where dirty data is a big problem, but most
of their thoughts and solutions can be used for clean data too.

Toke Eskildsen

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message