lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2700) transaction logging
Date Thu, 25 Aug 2011 09:30:29 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090886#comment-13090886
] 

Simon Willnauer commented on SOLR-2700:
---------------------------------------

{quote}
Just to get a rough idea of performance, I uploaded one of my CSV test files (765MB, 100M
docs, 7 small string fields per doc).
Time to complete indexing was 42% longer, and the transaction log grew to 1.8GB. The lucene
index was 1.2GB. The log was on the same device, so the main impact may have been disk IO.
{quote}

I think this is far from what we can really do here. I didn't look too close at the code yet
but it seems you are doing blocking writes which might not be ideal here at all. I think what
you can do here is to allocate the space you need per record and write concurrently on a Channel
(see FileChannel#write(ByteBuffer src, long position)), the same is true for reads (FileChannel#read(ByteBuffer
dst, long position)). What we need to store in main memory is the offset and the length to
do the realtime get here.

To take that one step further it might be good keep around the already serialized data if
possible so if binary update is used can we piggyback the bytes in the SolrInputDocument somehow?
If not I think we should use a faster hand written serialization instead of java serialization
which is proven to be freaking slow.

Another totally different idea for the RT get is to spend more time on a RAM Reader that is
capable of doing exactSeeks on the anyway used BytesRefHash. I don't thinks this would be
too far away since the biggest problem here is to provide an efficiently sorted dictionary.
maybe this should be a long term goal for the RT Get feature. 

Since we are already doing Write Behind here we could also try to use some compression especially
if the source data is large, not sure if that will pay off though since we are not keeping
the logs around forever. 

Eventually I think this should be a feature that lives outside of solr since many Lucene applications
could make use of it. ElasticSearch for instance uses pretty similar features which could
be adopted to something like a DurableIndexWriter wrapper.

> transaction logging
> -------------------
>
>                 Key: SOLR-2700
>                 URL: https://issues.apache.org/jira/browse/SOLR-2700
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Yonik Seeley
>         Attachments: SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch,
SOLR-2700.patch
>
>
> A transaction log is needed for durability of updates, for a more performant realtime-get,
and for replaying updates to recovering peers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message