lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fergus McMenemie <fer...@twig.me.uk>
Subject Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Date Wed, 26 Nov 2008 15:54:20 GMT
Hello Grant, 

Not much good with Java profilers (yet!) so I thought I 
would send a script!

Details... details! Having decided to produce a script to 
replicate the 1.2 vis 1.3 speed problem. The required rigor 
revealed a lot more.

1) The faster version I have previously referred to as 1.2,
   was actually a "1.3-dev" I had downloaded as part of the
   solr bootcamp class at ApacheCon Europe 2008. The ID
   string in the CHANGES.txt document is:-
   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
   
2) I did actually download and speed test a version of 1.2 
   from the internet. It's CHANGES.txt id is:-
   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
   Speed wise it was about the same as 1.3 at 64min. It also
   had lots of char set issues and is ignored from now on.
   
3) The version I was planning to use, till I found this,
   speed issue was the "latest" official version:-
   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
   I also verified the behavior with a nightly build.
   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
   
Anyway, The following script indexes the content in 22min
for the 1.3-dev version and takes 68min for the newer releases
of 1.3. I took the conf directory from the 1.3dev (bootcamp) 
release and used it replace the conf directory from the
official 1.3 release. The 3x slow down was still there; it is
not a configuration issue!
=================================






#! /bin/bash

# This script assumes a /usr/local/tomcat link to whatever version
# of tomcat you have installed. I have "apache-tomcat-5.5.20" Also 
# /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. 
# All the following was done as root.


# I have a directory /usr/local/ts which contains four versions of solr. The
# "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 1.3beata
# I got while attending a solr bootcamp. I indexed the same content using the
# different versions of solr as follows:
cd /usr/local/ts
if [ "" ] 
then 
   echo "Starting from a-fresh"
   sleep 5 # allow time for me to interrupt!
   cp -Rp apache-solr-bc/example/solr      ./solrbc  #bc = bootcamp
   cp -Rp apache-solr-nightly/example/solr ./solrnightly
   cp -Rp apache-solr-1.3.0/example/solr   ./solr13
   
   # the gaz is regularly updated and its name keeps changing :-) The page
   # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest
   # version.
   curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip" > geonames.zip
   unzip -q geonames.zip
   # delete corrupt blips!
   perl -i -n -e 'print unless  
       ($. > 2128495 and $. < 2128505) or
       ($. > 5944254 and $. < 5944260) 
       ;' geonames_dd_dms_date_20081118.txt
   #following was used to detect bad short records
   #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if (@F != 26);'
geonames_dd_dms_date_20081118.txt
   
   # my set of fields and copyfields for the schema.xml
   fields='
   <fields>
      <field name="UNI"           type="string" indexed="true"  stored="true" required="true"
/> 
      <field name="CCODE"         type="string" indexed="true"  stored="true"/>
      <field name="DSG"           type="string" indexed="true"  stored="true"/>
      <field name="CC1"           type="string" indexed="true"  stored="true"/>
      <field name="LAT"           type="sfloat" indexed="true"  stored="true"/>
      <field name="LONG"          type="sfloat" indexed="true"  stored="true"/>
      <field name="MGRS"          type="string" indexed="false" stored="true"/>
      <field name="JOG"           type="string" indexed="false" stored="true"/>
      <field name="FULL_NAME"     type="string" indexed="true"  stored="true"/>
      <field name="FULL_NAME_ND"  type="string" indexed="true"  stored="true"/>
      <!--field name="text"       type="text"   indexed="true"  stored="false" multiValued="true"/
-->
      <!--field name="timestamp"  type="date"   indexed="true"  stored="true"  default="NOW"
multiValued="false"/-->
   '
   copyfields='
      </fields>
      <copyField source="FULL_NAME" dest="text"/>
      <copyField source="FULL_NAME_ND" dest="text"/>
   '
   
   # add in my fields and copyfields
   perl -i -p -e "print qq($fields) if s/<fields>//;"           solr*/conf/schema.xml
   perl -i -p -e "print qq($copyfields) if s[</fields>][];"     solr*/conf/schema.xml
   # change the unique key and mark the "id" field as not required
   perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;"            solr*/conf/schema.xml
   perl -i -p -e 's/required="true"//i if m/<field name="id"/;' solr*/conf/schema.xml
   # enable remote streaming in solrconfig file
   perl -i -p -e 's/enableRemoteStreaming="false"/enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml
   fi

# some constants to keep the curl command shorter
skip="MODIFY_DATE,RC,UFI,DMS_LAT,DMS_LONG,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"
file=`pwd`"/geonames.txt"

export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr -Dsolr.solr.home=`pwd`/solr"

echo 'Getting ready to index the data set using solrbc (bc = bootcamp)'
/usr/local/tomcat/bin/shutdown.sh
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] 
   then 
   echo "Tomcat would not shutdown"
   exit
   fi
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
rm solr # rm the symbolic link
ln -s solrbc solr
rm -r solr/data
/usr/local/tomcat/bin/startup.sh
sleep 10 # give solr time to launch and setup
echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip"

echo "Getting ready to index the data set using solrnightly"
/usr/local/tomcat/bin/shutdown.sh
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] 
   then 
   echo "Tomcat would not shutdown"
   exit
   fi
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps
rm solr # rm the symbolic link
ln -s solrnightly solr
rm -r solr/data
/usr/local/tomcat/bin/startup.sh
sleep 10 # give solr time to launch and setup
echo "Starting indexing at " `date` " with solrnightly"
time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip"




>On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>
>> Hello Grant,
>>
>>> Were you overwriting the existing index or did you also clean out the
>>> Solr data directory, too?  In other words, was it a fresh index, or  
>>> an
>>> existing one?  And was that also the case for the 22 minute time?
>>
>> No in each case it was a new index. I store the indexes (the "data"  
>> dir)
>> outside the solr home directory. For the moment I, rm -rf the index  
>> dir
>> after each edit to the solrconfig.sml or schema.xml file and reindex
>> from scratch. The relaunch of tomcat recreates the index dir.
>>
>>> Would it be possible to profile the two instance and see if you  
>>> notice
>>> anything different?
>> I dont understand this. Do mean run a profiler against the tomcat
>> image as indexing takes place, or somehow compare the indexes?
>
>Something like JProfiler or any other Java profiler.
>
>>
>>
>> I was think of making a short script that replicates the results,
>> and posting it here, would that help?
>
>
>Very much so.
>
>
>>
>>
>>>
>>> Thanks,
>>> Grant
>>>
>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a CSV file with 6M records which took 22min to index with
>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>>>> webapps with version 1.3, wiped my index and restarted tomcat.
>>>>
>>>> Indexing the exact same content now takes 69min. My machine has
>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M.
>>>>
>>>> Are there any tweaks I can use to get the original index time
>>>> back. I read through the release notes and was expecting a
>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>>>> it to 64MB; it had no effect.
>>>> -- 

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Mime
View raw message