lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Champagne (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-5855) Increasing solr highlight performance with caching
Date Wed, 01 Apr 2015 14:17:53 GMT

     [ https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thomas Champagne updated SOLR-5855:
-----------------------------------
    Attachment: SOLR-5855-without-cache.patch

I create a patch with the two optimizations based on branch_5x.

This patch don't use a cache. I move the call of method searcher.getIndexReader().getTermVectors(docId)
before the loop on the fields and get "term vectors" only one time for each document.

HighlightingByFastVector doesn't benefit of the change but I think it will be possible to
change this.

In the patch, there is a new unit test for testing this feature : Create 20 docs with 80 fields
(1/2 null, 1/2 with a value) and 10 queries with hl.fl=*
Running test without patch : ~6 sec
Running test with patch : ~3 sec

Tell me your opinion about this small patch. I think it is easier than the patch with caching.

> Increasing solr highlight performance with caching
> --------------------------------------------------
>
>                 Key: SOLR-5855
>                 URL: https://issues.apache.org/jira/browse/SOLR-5855
>             Project: Solr
>          Issue Type: Improvement
>          Components: highlighter
>    Affects Versions: Trunk
>            Reporter: Daniel Debray
>             Fix For: Trunk
>
>         Attachments: SOLR-5855-without-cache.patch, highlight.patch
>
>
> Hi folks,
> while investigating possible performance bottlenecks in the highlight component i discovered
two places where we can save some cpu cylces.
> Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
> First in method doHighlighting (lines 411-417):
> In the loop we try to highlight every field that has been resolved from the params on
each document. Ok, but why not skip those fields that are not present on the current document?

> So i changed the code from:
> for (String fieldName : fieldNames) {
>   fieldName = fieldName.trim();
>   if( useFastVectorHighlighter( params, schema, fieldName ) )
>     doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, docSummaries, docId,
doc, fieldName );
>   else
>     doHighlightingByHighlighter( query, req, docSummaries, docId, doc, fieldName );
> }
> to:
> for (String fieldName : fieldNames) {
>   fieldName = fieldName.trim();
>   if (doc.get(fieldName) != null) {
>     if( useFastVectorHighlighter( params, schema, fieldName ) )
>       doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, docSummaries, docId,
doc, fieldName );
>     else
>       doHighlightingByHighlighter( query, req, docSummaries, docId, doc, fieldName );
>   }
> }
> The second place is where we try to retrieve the TokenStream from the document for a
specific field.
> line 472:
> TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(),
docId, fieldName);
> where..
> public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int docId, String
field) throws IOException {
>   Fields vectors = reader.getTermVectors(docId);
>   if (vectors == null) {
>     return null;
>   }
>   Terms vector = vectors.terms(field);
>   if (vector == null) {
>     return null;
>   }
>   if (!vector.hasPositions() || !vector.hasOffsets()) {
>     return null;
>   }
>   return getTokenStream(vector);
> }
> keep in mind that we currently hit the IndexReader n times where n = requested rows(documents)
* requested amount of highlight fields.
> in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on a warm solr
and 1.100.000ns on a cold solr.
> If we store the returning Fields vectors in a cache, this lookups only take 25000ns.
> I would suggest something like the following code in the doHighlightingByHighlighter
method in the DefaultSolrHighlighter class (line 472):
> Fields vectors = null;
> SolrCache termVectorCache = searcher.getCache("termVectorCache");
> if (termVectorCache != null) {
>   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
>   if (vectors == null) {
>     vectors = searcher.getIndexReader().getTermVectors(docId);
>     if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
>   } 
> } else {
>   vectors = searcher.getIndexReader().getTermVectors(docId);
> }
> TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, fieldName);
> and TokenSources class:
> public static TokenStream getTokenStreamWithOffsets(Fields vectors, String field) throws
IOException {
>   if (vectors == null) {
>     return null;
>   }
>   Terms vector = vectors.terms(field);
>   if (vector == null) {
>     return null;
>   }
>   if (!vector.hasPositions() || !vector.hasOffsets()) {
>     return null;
>   }
>   return getTokenStream(vector);
> }
> 4000ms on 1000 docs without cache
> 639ms on 1000 docs with cache
> 102ms on 30 docs without cache
> 22ms on 30 docs with cache
> on an index with 190.000 docs with a numFound of 32000 and 80 different highlight fields.
> I think querys with only one field to highlight on a document does not benefit that much
from a cache like this, thats why i think an optional cache would be the best solution there.

> As i saw the FastVectorHighlighter uses more or less the same approach and could also
benefit from this cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message