From Fergus McMenemie <>
Subject Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Date Wed, 26 Nov 2008 15:54:20 GMT
Hello Grant, 

Not much good with Java profilers (yet!) so I thought I 
would send a script!

Details... details! Having decided to produce a script to 
replicate the 1.2 vis 1.3 speed problem. The required rigor 
revealed a lot more.

1) The faster version I have previously referred to as 1.2,
   was actually a "1.3-dev" I had downloaded as part of the
   solr bootcamp class at ApacheCon Europe 2008. The ID
   string in the CHANGES.txt document is:-
   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
2) I did actually download and speed test a version of 1.2 
   from the internet. It's CHANGES.txt id is:-
   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
   Speed wise it was about the same as 1.3 at 64min. It also
   had lots of char set issues and is ignored from now on.
3) The version I was planning to use, till I found this,
   speed issue was the "latest" official version:-
   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
   I also verified the behavior with a nightly build.
   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
Anyway, The following script indexes the content in 22min
for the 1.3-dev version and takes 68min for the newer releases
of 1.3. I took the conf directory from the 1.3dev (bootcamp) 
release and used it replace the conf directory from the
official 1.3 release. The 3x slow down was still there; it is
not a configuration issue!

#! /bin/bash

# This script assumes a /usr/local/tomcat link to whatever version
# of tomcat you have installed. I have "apache-tomcat-5.5.20" Also 
# /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. 
# All the following was done as root.

# I have a directory /usr/local/ts which contains four versions of solr. The
# "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 1.3beata
# I got while attending a solr bootcamp. I indexed the same content using the
# different versions of solr as follows:
cd /usr/local/ts
if [ "" ] 
   echo "Starting from a-fresh"
   sleep 5 # allow time for me to interrupt!
   cp -Rp apache-solr-bc/example/solr      ./solrbc  #bc = bootcamp
   cp -Rp apache-solr-nightly/example/solr ./solrnightly
   cp -Rp apache-solr-1.3.0/example/solr   ./solr13
   # the gaz is regularly updated and its name keeps changing :-) The page
   # has a link to the latest
   # version.
   curl "" >
   unzip -q
   # delete corrupt blips!
   perl -i -n -e 'print unless  
       ($. > 2128495 and $. < 2128505) or
       ($. > 5944254 and $. < 5944260) 
       ;' geonames_dd_dms_date_20081118.txt
   #following was used to detect bad short records
   #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if (@F != 26);'
   # my set of fields and copyfields for the schema.xml
      <field name="UNI"           type="string" indexed="true"  stored="true" required="true"
      <field name="CCODE"         type="string" indexed="true"  stored="true"/>
      <field name="DSG"           type="string" indexed="true"  stored="true"/>
      <field name="CC1"           type="string" indexed="true"  stored="true"/>
      <field name="LAT"           type="sfloat" indexed="true"  stored="true"/>
      <field name="LONG"          type="sfloat" indexed="true"  stored="true"/>
      <field name="MGRS"          type="string" indexed="false" stored="true"/>
      <field name="JOG"           type="string" indexed="false" stored="true"/>
      <field name="FULL_NAME"     type="string" indexed="true"  stored="true"/>
      <field name="FULL_NAME_ND"  type="string" indexed="true"  stored="true"/>
      <!--field name="text"       type="text"   indexed="true"  stored="false" multiValued="true"/
      <!--field name="timestamp"  type="date"   indexed="true"  stored="true"  default="NOW"
      <copyField source="FULL_NAME" dest="text"/>
      <copyField source="FULL_NAME_ND" dest="text"/>
   # add in my fields and copyfields
   perl -i -p -e "print qq($fields) if s/<fields>//;"           solr*/conf/schema.xml
   perl -i -p -e "print qq($copyfields) if s[</fields>][];"     solr*/conf/schema.xml
   # change the unique key and mark the "id" field as not required
   perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;"            solr*/conf/schema.xml
   perl -i -p -e 's/required="true"//i if m/<field name="id"/;' solr*/conf/schema.xml
   # enable remote streaming in solrconfig file
   perl -i -p -e 's/enableRemoteStreaming="false"/enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml

# some constants to keep the curl command shorter

export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr -Dsolr.solr.home=`pwd`/solr"

echo 'Getting ready to index the data set using solrbc (bc = bootcamp)'
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] 
   echo "Tomcat would not shutdown"
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
rm solr # rm the symbolic link
ln -s solrbc solr
rm -r solr/data
sleep 10 # give solr time to launch and setup
echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip"

echo "Getting ready to index the data set using solrnightly"
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] 
   echo "Tomcat would not shutdown"
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps
rm solr # rm the symbolic link
ln -s solrnightly solr
rm -r solr/data
sleep 10 # give solr time to launch and setup
echo "Starting indexing at " `date` " with solrnightly"
time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip"

>On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>> Hello Grant,
>>> Were you overwriting the existing index or did you also clean out the
>>> Solr data directory, too?  In other words, was it a fresh index, or  
>>> an
>>> existing one?  And was that also the case for the 22 minute time?
>> No in each case it was a new index. I store the indexes (the "data"  
>> dir)
>> outside the solr home directory. For the moment I, rm -rf the index  
>> dir
>> after each edit to the solrconfig.sml or schema.xml file and reindex
>> from scratch. The relaunch of tomcat recreates the index dir.
>>> Would it be possible to profile the two instance and see if you  
>>> notice
>>> anything different?
>> I dont understand this. Do mean run a profiler against the tomcat
>> image as indexing takes place, or somehow compare the indexes?
>Something like JProfiler or any other Java profiler.
>> I was think of making a short script that replicates the results,
>> and posting it here, would that help?
>Very much so.
>>> Thanks,
>>> Grant
>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>>>> Hello,
>>>> I have a CSV file with 6M records which took 22min to index with
>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>>>> webapps with version 1.3, wiped my index and restarted tomcat.
>>>> Indexing the exact same content now takes 69min. My machine has
>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M.
>>>> Are there any tweaks I can use to get the original index time
>>>> back. I read through the release notes and was expecting a
>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>>>> it to 64MB; it had no effect.
>>>> -- 


Fergus McMenemie     
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer

