lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
Date Mon, 08 Dec 2014 14:09:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237881#comment-14237881
] 

Erick Erickson commented on SOLR-6666:
--------------------------------------

And he breaks his promise again. Sorry, came down with something this weekend and got swamped.
This week fer sure (he says again).

> Dynamic copy fields are considering all dynamic fields, causing a significant performance
impact on indexing documents
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6666
>                 URL: https://issues.apache.org/jira/browse/SOLR-6666
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis, update
>         Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 specific
CopyFields for dynamic fields, but without wildcards (the fields are dynamic, the copy directive
is not)
>            Reporter: Liram Vardi
>            Assignee: Erick Erickson
>         Attachments: SOLR-6666.patch
>
>
> Result:
> After applying a fix for this issue, tests which we conducted show more than 40 percent
improvement on our insertion performance.
> Explanation:
> Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. This bottleneck
can be found at org.apache.solr.schema.IndexSchema, in the following method, "getCopyFieldsList()":
> {code:title=getCopyFieldsList() |borderStyle=solid}
> final List<CopyField> result = new ArrayList<>();
>     for (DynamicCopy dynamicCopy : dynamicCopyFields) {
>       if (dynamicCopy.matches(sourceField)) {
>         result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField),
dynamicCopy.maxChars));
>       }
>     }
>     List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
>     if (null != fixedCopyFields) {
>       result.addAll(fixedCopyFields);
>     }
> {code}
> This function tries to find for an input source field all its copyFields (All its destinations
which Solr need to move this field). 
> As you can probably note, the first part of the procedure is the procedure most “expensive”
step (takes O( n ) time while N is the size of the "dynamicCopyFields" group).
> The next part is just a simple "hash" extraction, which takes O(1) time. 
> Our schema contains over then 500 copyFields but only 70 of then are "indexed" fields.

> We also have one dynamic field with  a wildcard ( * ), which "catches" the rest of the
document fields. 
> As you can conclude, we have more than 400 copyFields that are based on this dynamicField
but all, except one, are fixed (i.e. does not contain any wildcard).
> From some reason, the copyFields registration procedure defines those 400 fields as "DynamicCopyField
" and then store them in the “dynamicCopyFields” array, 
> This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification:
All of those 400 copyFields are not glob and therefore do not need any complex pattern matching
to the input field. They all can be store at the "fixedCopyFields".
> Only copyFields with asterisks need this "special" treatment and they are (especially
on our case) pretty rare.  
> Therefore, we created a patch which fix this problem by changing the registerCopyField()
procedure.
> Test which we conducted show that there is no change in the Indexing results. Moreover,
the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java).
>        



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message