lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: What is the best way to index 15 million documents of total size 425 GB?
Date Thu, 03 Mar 2016 20:14:38 GMT
What does a typical document look like - number of columns, data type,
size? How much is text vs. numeric? Are there any large blobs? I mean, 15M
docs in 425GB indicates about 28K per row/document which seems rather large.

Is the PG data VARCHAR(n) or CHAR(n). IOW, might it have lots of trailing
blanks for text columns?

As always, the very first question in Solr data modeling is always how do
you intend to query the data - queries will determine the data model.

Ultimately, the issue will not be how long it takes to index, but query
latency and query throughput.

30GB sounds way too small for a 425GB index in terms of odds of low query
latency.

-- Jack Krupansky

On Thu, Mar 3, 2016 at 12:54 PM, Aneesh Mon N <aneeshmonn@gmail.com> wrote:

> Hi,
>
> We are facing a huge performance issue while indexing the data to Solr, we
> have around 15 million records in a PostgreSql database which has to be
> indexed to Solr 5.3.1 server.
> It takes around 16 hours to complete the indexing as of now.
>
> To be noted that all the fields are stored so as to support the atomic
> updates.
>
> Current approach:
> We use a ETL tool(Pentaho) to fetch the data from database in chunks of
> 1000 records, convert them into xml format and pushes to Solr. This is run
> in 10 parallel threads.
>
> System params
> Solr Version: 5.3.1
> Size on disk: 425 GB
>
> Database, ETL machine and SOLR are of 16 core and 30 GB RAM
> Database and SOLR Disk: RAID
>
> Any pointers best approaches to index these kind of data would be helpful.
>
> --
> Regards,
> Aneesh Mon N
> Chennai
> +91-8197-188-588
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message