lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikhail Khludnev (Commented) (JIRA)" <>
Subject [jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded
Date Mon, 16 Apr 2012 18:58:19 GMT


Mikhail Khludnev commented on SOLR-3360:


Thank you for providing feedback. I have several considerations for your issue:

# to be honest I didn't pay much attention to these counter when fixing threads, I didn't
assert it. So, it might be a bug with counters. But the main subject is your index is it correct?
Does it has expected number of docs? Are all master entities were properly connected to the
details ones? Pls let us know your observations.

# even DIH code would be correct, you add too many threads. The reason of adding threads is
get high CPU utilization, if you exceeds your IO limits you waste CPU time for contentions.
Could you start from 2? 

# I suppose significant time were spend for obtaining JDBC connections, btw how many of them
are avalable in parallel? If you are not happy how DIH scales you can check what does it spent
time for. Logs with debug level for DIH enabled are appreciated. You also can take sampling
by jconsole, or even manually run jstack <JVMPID>

> Problem with DataImportHandler multi-threaded
> ---------------------------------------------
>                 Key: SOLR-3360
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 3.6
>         Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
>            Reporter: Claudio R
> Hi,
> If I use dataimport with 1 thread, I got:
> <lst name="statusMessages">
>    <str name="Total Requests made to DataSource">5001</str>
>    <str name="Total Rows Fetched">1000</str>
>    <str name="Total Documents Skipped">0</str>
>    <str name="Full Dump Started">2012-04-16 11:21:57</str>
>    <str name="">Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents.</str>
>    <str name="Committed">2012-04-16 11:23:19</str>
>    <str name="Total Documents Processed">1000</str>
>    <str name="Time taken">0:1:22.390</str>
> </lst>
> If I use datamport with 10 threads, I got:
> <lst name="statusMessages">
>    <str name="Total Requests made to DataSource">0</str>
>    <str name="Total Rows Fetched">10000</str>
>    <str name="Total Documents Skipped">0</str>
>    <str name="Full Dump Started">2012-04-16 11:31:43</str>
>    <str name="">Indexing completed. Added/Updated: 10000 documents. Deleted 0 documents.</str>
>    <str name="Committed">2012-04-16 11:41:50</str>
>    <str name="Total Documents Processed">10000</str>
>    <str name="Time taken">0:10:7.586</str>
> </lst>
> The configuration of 10 threads consumed 10 times longer than the configuration with
1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
>    <dataSource driver="" url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test"
user="user" password="pass"/>
>       <document>
>          <entity name="indice" rootEntity="true" threads="10" transformer="RegexTransformer,TemplateTransformer"
query="select top 1000 i.id_indice, i.a, i.b from indice i where i.status = 'I'" deltaImportQuery="i.id_indice,
i.a, i.b from indice i where id_indice in ('${}')" deltaQuery="select
id_indice from indice where status='I' and data_hora_modificacao >= convert(datetime, '${dataimporter.last_index_time}',
120)" deletedPkQuery="select id_indice from indice where status='D' and data_hora_modificacao
>= convert(datetime, '${dataimporter.last_index_time}', 120)">	
>             <field column="id_indice" name="id_indice" />
>             <field column="a" name="a" />
>             <field column="b" name="b" />
>             <entity name="filtro" transformer="RegexTransformer,TemplateTransformer"
query="select categoria, sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
>                <field name="filtro_categoria" column="categoria" />
>                <field name="filtro_sub_categoria" column="sub_categoria" />
>                <field name="nv_sub_categoria" column="nv_sub_categoria" template="${filtro.categoria}|${filtro.sub_categoria}"
>             </entity>
>             <entity name="pagina_relacionada" query="select url from pagina_relacionada
where indice_id_indice = '${indice.id_indice}'">
>                <field name="pagina_relacionada_url" column="url" />
>             </entity>
>             <entity name="veja_mais" query="select chamada, url from veja_mais where
indice_id_indice = '${indice.id_indice}'">
>                <field name="veja_mais_chamada" column="chamada" />
>                <field name="veja_mais_url" column="url" />
>             </entity>
>             <entity name="video" query="select url from video where indice_id_indice
= '${indice.id_indice}'">
>                <field name="video_url" column="url" />
>             </entity>
>             <entity name="galeria" query="select url from galeria where indice_id_indice
= '${indice.id_indice}'">
>                <field name="galeria_url" column="url" />
>             </entity>
>          </entity>
>       </document>
> </dataConfig>
> Thanks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message