incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Block Cache V2 Status?
Date Wed, 02 Oct 2013 00:50:02 GMT
Sorry for my slow response.  I wanted to wait until I had time to write a
full response (and not say that I would and forget again).

Let me tell you about my use case that doesn't work in the current block
cache implementation that sort of sparked the v2.

The cluster that I run with Blur has a lot of servers and the servers have
a lot of ram.  We of course use the block cache to make searching faster,
however we also use filters for a lot of the searches.  We run a custom
filter cache (extends o.a.b.manager.BlurFilterCache) that uses an off heap
bitset filter.  It is off heap for the same reason that the block cache is
off heap, to reduce GC times.  The problem is that the bitset filter sizes
change with the number of records (lucene documents) and we regularly add
large amounts of records/rows.  The block cache is a fixed size so if I
allow for 80GB of off heap memory and 75GB is for block cache and leave 5GB
for the bitset filter.  That means that I have to manage how many bitsets
by size can reside in that 5GB with an ever increasing number of
records/rows.  While this is possible, I was thinking that is would be nice
to just have the block cache handle it.  Plus I could write the bitset to
the directory as a file (one per segment) and only reload it when needed
(after restart, shard movement, etc).

The problem is that current block cache is a fixed block size (8K by
default).  That means that each file is logically broken up into blocks of
size 8K and are loaded and unloaded from the block cache in those fixed
sizes.  While this seems to work well for the random access nature of
Lucene searches, it does not work well for sequentially sweeping through
the bitset file (or any other file for that matter).  In all of my mirco
benchmarks a "next" call on a bitset that has 10% cardinality or having to
"advance" in the bitset was over 400% slower than a normal OpenBitSet
implementation.  I tried to add links in the value object in the LRU to the
next block to create a link list effect but that had almost effect.  Still
very slow.

Plus I want to be able to support sorting at some point and in Lucene 4.+
there is a file based sort that would fit very nicely in the block cache as
well if the access pattern worked out.

So in short we need to be able to support variable length block sizes based
on the file or usage pattern.  For some files maybe 4K is better, maybe
with smaller segments?  Other files maybe 256K block sizes.  Files like the
bitset files, or the files used for sorting, map the whole file in as a
single block (or up to some max size of 100s of MB?).  I think is would be
really awesome to run a tuning program against a HDFS cluster to see what
block size is best.  The program could create a large files in the GB range
and perform random accesses on it with variable length buffers ranging from
1K to 512K and see what seems to provide the most throughput at the lowest
latency.  We could then use that size to know how to configure the block
cache to be most effective.

So the design that's a work in progress in the v2 code is the same
ConcurrentLinkedHashMap (LRU).  However this time around there are no slabs
of memory, just the key and value in the LRU.  The LRU is now weighted (
https://code.google.com/p/concurrentlinkedhashmap/wiki/Design -> Weighted
Values) so that each block size counts for it's own size.  This mean we can
now tell the LRU that it has a 80GB size and each 8K block has a weight of
8192.  This makes configuration a lot easier and that also gives us our
variable length block sizes within the same LRU.

The biggest issue with this design is that each value in the LRU (the block
data) is it's own memory allocation.  I started trying to use a ByteBuffer
like in the current design but as the number of entries increased the
performance overhead is too great.  The problem is that the ByteBuffer has
a couple of objects internal to it and one of them is a Cleaner object that
is used to deallocate memory from the process as the ByteBuffer is GCed.
 So with millions of these objects hanging around on the heap GC was a real
issue.  Also the ByteBuffer allocates one extra page of memory beyond what
is asked for, so for an 8K block and the normal page size of 4K that's 50%
waste in every entry.  Not good.

So for ease of testing the value in the LRU is an interface with some basic
methods read, write, size, etc.  I did this so that I could test an on heap
implementation that is byte[] based and an off heap that uses Unsafe for
direct memory access.  One interesting performance increase with the v2
approach is that there is no memory copying during a cache hit.  In the
current implementation (v1), when a hit is made the block is copied from
the off heap memory to the local buffer in the InputIndex.  This is very
expensive when all you need to read is a single byte, or an int, etc.  So
in v2 the value is actually referenced in the IndexInput as it's current
buffer.  It can do this safely by implementing reference counting and a
deallocation method (only used in the Unsafe impl).

The next question is what is the state of the code committed.  All the
tests pass, but a large part of the code is disabled because of a bug I
haven't tracked down yet.

https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-store/src/test/java/org/apache/blur/store/CacheDirectoryTestSuite.java;h=2085bd0b044470b4e1122c5949e634429c58012c;hb=apache-blur-0.2#l55

Line 55 needs to return true and all the tests need to pass.

Once they do, I will perform a burn-in test on a cluster to validate that
there are no issues.  After that I would like to move to make it be the
default implementation.

I hope this helps to explain the reasons behind the new block cache and how
it's be designed.  Sorry for not posting about it earlier, I got distracted
with the release of 0.2.0.

Aaron








On Tue, Oct 1, 2013 at 12:24 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Hi Aaron,
>
> What is the status of the new implementation? I couldn't find any
> discussion on this new implementation and its design. If I missed a
> particular email thread which discussed the goals of V2 can you please
> point me to that thread, if not can you elaborate your thoughts, goals and
> what needs to be done? Thank You.
>
> - Rahul
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message