lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-5244) Full Search Result Export
Date Tue, 24 Dec 2013 14:41:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856335#comment-13856335
] 

Joel Bernstein commented on SOLR-5244:
--------------------------------------

More testing of this feature shows the real challenge will be performance of exporting string
fields. Right now the docId->BytesRef lookup is way to slow to be interesting on a large
scale, even with in memory docValues. This must be do to the compression on the docValues.

To get this working we'll need to have faster memory caches in place. I think we can build
segment level caches at commit time by caching the top X terms in a particular field based
on docFrequency. The cache would be a read only ord to BytesRef (hppc IntObjectOpenHashMap)
which we should be able to perform in neighborhood of 10 million lookups per second. The in-memory
docId->BytesRef lookup performs at less then 1 million records per-second. 

I think if we also move to a threaded approach we'll be able increase throughput.

I'm shooting to achieve an export rate of 5+ million small records per-second from a single
server. This would scale linearly with the number of servers so a cluster of 100 servers could
export 500+ million small records per-second.

 

> Full Search Result Export
> -------------------------
>
>                 Key: SOLR-5244
>                 URL: https://issues.apache.org/jira/browse/SOLR-5244
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 5.0
>            Reporter: Joel Bernstein
>            Priority: Minor
>             Fix For: 5.0, 4.7
>
>         Attachments: SOLR-5244.patch
>
>
> It would be great if Solr could efficiently export entire search result sets without
scoring or ranking documents. This would allow external systems to perform rapid bulk imports
from Solr. It also provides a possible platform for exporting results to support distributed
join scenarios within Solr.
> This ticket provides a patch that has two pluggable components:
> 1) ExportQParserPlugin: which is a post filter that gathers a BitSet with document results
and does not delegate to ranking collectors. Instead it puts the BitSet on the request context.
> 2) BinaryExportWriter: Is a output writer that iterates the BitSet and prints the entire
result as a binary stream. A header is provided at the beginning of the stream so external
clients can self configure.
> Note:
> These two components will be sufficient for a non-distributed environment. 
> For distributed export a new Request handler will need to be developed.
> After applying the patch and building the dist or example, you can register the components
through the following changes to solrconfig.xml
> Register export contrib libraries:
> <lib dir="../../../dist/" regex="solr-export-\d.*\.jar" />
>  
> Register the "export" queryParser with the following line:
>  
> <queryParser name="export" class="org.apache.solr.export.ExportQParserPlugin"/>
>  
> Register the "xbin" writer:
>  
> <queryResponseWriter name="xbin" class="org.apache.solr.export.BinaryExportWriter"/>
>  
> The following query will perform the export:
> {code}
> http://localhost:8983/solr/collection1/select?q=*:*&fq={!export}&wt=xbin&fl=join_i
> {code}
> Initial patch supports export of four data-types:
> 1) Single value trie int, long and float
> 2) Binary doc values.
> The numerics are currently exported from the FieldCache and the Binary doc values can
be in memory or on disk.
> Since this is designed to export very large result sets efficiently, stored fields are
not used for the export.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message