lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikhail Khludnev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-5244) Full Search Result Export
Date Tue, 24 Dec 2013 21:27:50 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856461#comment-13856461
] 

Mikhail Khludnev commented on SOLR-5244:
----------------------------------------

bq. Does it cause any issues with the normal response writer flow?
I don't think so. it hits dedicated handlers. So, it's well separated from regular flow.
bq. More testing of this feature shows
i wonder if you can post numbers and profiler stacktrace. 
How many fields are dumped in your test case? 
I have one thought: _BinaryDocValuesImpl.get(int, BytesRef)_ hits _docToOffset_ and _bytes_
after that per every given docnum. Asserting that sequential reading is faster than a random
one it makes sense to buffer array of offsets and then look through it for reading  _bytes_.
Also, looping by _binaryFieldWriters_ per every doc seems like a columnar performance killer.

bq. I think we can build segment level caches..
can you highlight how it differs from old good FieldCaches (I mean what's produced by FieldCacheImpl.BinaryDocValuesCache)
?
bq. I'm shooting to achieve an export rate of 5+ million small records 
It sounds really ambitious to me. My expectation about average IO rate is 100-200 MB/sec (and
I might wrong here). so few millions might hit the ceiling. 


> Full Search Result Export
> -------------------------
>
>                 Key: SOLR-5244
>                 URL: https://issues.apache.org/jira/browse/SOLR-5244
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 5.0
>            Reporter: Joel Bernstein
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: SOLR-5244.patch
>
>
> It would be great if Solr could efficiently export entire search result sets without
scoring or ranking documents. This would allow external systems to perform rapid bulk imports
from Solr. It also provides a possible platform for exporting results to support distributed
join scenarios within Solr.
> This ticket provides a patch that has two pluggable components:
> 1) ExportQParserPlugin: which is a post filter that gathers a BitSet with document results
and does not delegate to ranking collectors. Instead it puts the BitSet on the request context.
> 2) BinaryExportWriter: Is a output writer that iterates the BitSet and prints the entire
result as a binary stream. A header is provided at the beginning of the stream so external
clients can self configure.
> Note:
> These two components will be sufficient for a non-distributed environment. 
> For distributed export a new Request handler will need to be developed.
> After applying the patch and building the dist or example, you can register the components
through the following changes to solrconfig.xml
> Register export contrib libraries:
> <lib dir="../../../dist/" regex="solr-export-\d.*\.jar" />
>  
> Register the "export" queryParser with the following line:
>  
> <queryParser name="export" class="org.apache.solr.export.ExportQParserPlugin"/>
>  
> Register the "xbin" writer:
>  
> <queryResponseWriter name="xbin" class="org.apache.solr.export.BinaryExportWriter"/>
>  
> The following query will perform the export:
> {code}
> http://localhost:8983/solr/collection1/select?q=*:*&fq={!export}&wt=xbin&fl=join_i
> {code}
> Initial patch supports export of four data-types:
> 1) Single value trie int, long and float
> 2) Binary doc values.
> The numerics are currently exported from the FieldCache and the Binary doc values can
be in memory or on disk.
> Since this is designed to export very large result sets efficiently, stored fields are
not used for the export.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message