lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vikram Srinivasan <vikram.sriniva...@zettata.com>
Subject Slow Indexing speed for csv files, multi-threaded indexing
Date Tue, 05 Nov 2013 05:45:16 GMT
Hello,

  I know this has been discussed extensively in past posts. I have tried a
bunch of suggestions and I still have a few questions.

 I am using solr4.4 from tomcat 7. I am using openjdk1.7 and I am using 1
solr core
 I am trying to index a bunch of csv files (total size 13GB). Each csv file
contains a long list of tuples - ( word1 word2, frequency) as shown below.
(bigram frequencies)

E.g: blue sky, 2500
       green grass, 300

My schema.xml is as  simple as can be: I am trying to index these two
fields of type string and long and do not use any tokenizer or analyzer
factories as shown below.


 <fields>
<field name="_version_" type="long" indexed="true" stored="true"
multiValued="false" omitNorms="true" />
                <field name="word" type="string" indexed="true"
stored="true" multiValued="false" omitNorms="true" />

      <field name="frequency" type="long" indexed="true" stored="true"
                        multiValued="false" omitNorms="true" />


        </fields>

In my solrconfig.xml:

My rambuffer size is 100MB, merge factor is 10, maxIndexingThreads is 8.

I am using solrj and concurrentupdatesolrserver (CUSS) to index. I have set
the queue size to 10000 and number of threads to 10 and javabin format.

I run my solrj instance by providing the path to the directory where the
csv files are stored.

I start one instance of CUSS and have multiple threads reading from the
various files simultaneously and writing into the CUSS threads
simutaneously. I do a commit only after all the records have been indexed.
Also my autocommit values for number of documents and commit time are set
to very large numbers.

I have tried indexing a test set of csv files which contains 1.44M records
(total size 21MB).  All my tests have been on different types of Amazon ec2
instances - e.g. m1.xlarge (4vCPU, 15GB RAM) and m3.2xlarge(8vCPU, 30GB
RAM).

I have set my jvm heap size large enough and tuned gc parameters as seen on
various forums.

Observations:

1. My indexing speed for 1.44M records (or row in CSV file) is 240s on the
m1.xlarge instance and 160s on the m3.2xlarge instance.
2. The indexing speed is independent of whether I have one large file with
1.44M rows or 2 files with 720K rows each.
3. My indexing speed is independent of the number of threads and queue size
I specify for CUSS. I have kept set these parameters as low as 1 for both
queue size and number of threads with no difference..
4. My indexing speed is independent of merge factor, rambuffer and number
of indexing threads. I've tried various settings.
5. It appears that I am not really indexing my files in parallel if I use a
single solr core. Is this not possible? What exactly does maxindexthreads
in solrconfig control?
6. My concern is that my indexing speed is way slower than what I've seen
claimed on various forums (e.g., 29GB wikipedia in 13 minutes, 50GB in 39
minutes etc.) even with a single solr core.

What am I doing wrong? How do I speed up my indexing? Any suggestions will
be appreciated.

Thanks,
Vikram

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message