lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen <t...@kb.dk>
Subject Re: Is Solr can do that ?
Date Sat, 22 Jun 2019 09:35:38 GMT
Matheo Software Info <info@matheo-software.com> wrote:
> My question is very simple ☺ I would like to know if Solr can process
> around 30To of data (Pdf, Text, Word, etc…) ?

Simple answer: Yes. Assuming 30To means 30 terabyte.

> What is the best way to index this huge data ? several servers ?
> several shards ? other ?

As other participants has mentioned, it is hard to give numbers. What we can do is share experience.

We are doing webarchive indexing and I guess there would be quite an overlap with your content
as we also use Tika. One difference is that the images in a webarchive are quite cheap to
index, so you'll probably need (relatively) more hardware than we use. Very roughly we used
40 CPU-years to index 600 (700? I forget) TB of data in one of our runs. Scaling to your 30TB
this suggests something like 2 CPU-years, or a couple of months for a 16 core machine.

This is just to get a ballpark: You will do yourself a huge favor by building a test-setup
and process 1 TB or so of your data to get _your_ numbers, before you design your indexing
setup. It is our experience that the analyzing part (Tika) takes much more power than the
Solr indexing part: At our last run we had 30-40 CPU-cores doing Tika (and related analysis)
feeding into a Solr running on a 4-core machine on spinning drives.


As for Solr setup for search, then you need to describe in detail what your requirements are,
before we can give you suggestions. Is the index updated all the time, in batches or one-off?
How many concurrent users? Are the searches interactive or batch-jobs? What kind of aggregations
do you need?

In our setup we build separate collections that are merged to single segments and never updated.
Our use varies between very few interactive users and a lot of batch jobs. Scaling this specialized
setup to your corpus size would require about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided
among 4 shards. You are likely to need quite a lot more than that, so this is just to say
that at this scale the use of the index matters _a lot_.

- Toke Eskildsen
Mime
View raw message