incubator-blur-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Block-Cache and usage
Date Sun, 23 Mar 2014 14:54:51 GMT
>
> No.  Typically the hit to miss ratio is very high, its a metric that is
> recorded in Blur


This is such a handy feature. Thanks for providing such detailed metrics

Just to add to the benefits of block-cache, I just found out that readFully
or sync(seek+read) in FSDataInputStream occurs entirely in a synchronized
method in hadoop that could limit throughput/QPS when multiple IndexInputs
are open for same lucene file.

Block-cache should shine in such scenarios...

Thanks a lot for your inputs.

--
Ravi


On Thu, Mar 20, 2014 at 5:45 PM, Aaron McCurry <amccurry@gmail.com> wrote:

> On Wed, Mar 19, 2014 at 1:57 PM, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
> > One obvious case is a cache-hit scenario, where instead of using the
> > block-cache, there is a fairly heavy round-trip to data-node. It is also
> > highly likely that the data-node might have evicted the hot-pages due to
> > other active reads.
>
>
> Or writes.  The normal behavior in the Linux filesystem cache is to cache
> newly written data and evict the oldest data from memory.  So during merges
> (or any other writes from other Hadoop processes) the Linux filesystem will
> unload pages that you might be using.
>
>
> >
>
>
> > How much of cache-hit happens in Blur? Will I be correct in saying that
> > repeated terms occurring in search only will benefit block-cache?
> >
>
> No.  Typically the hit to miss ratio is very high, its a metric that is
> recorded in Blur (you can access via the blue shell by running the top
> command).  It's not unusual to see hits in the 5000-10000/s range with a
> block size of 64KB and misses occurring at the same time between 10-20/s.
>  This has a lot to due with how Lucene stores it's indexes, they are highly
> compressed files (although not compressed with a generic compression
> scheme).
>
>
> Let me know if you any other questions.
>
> Aaron
>
> >
> > --
> > Ravi
> >
> >
> > On Wed, Mar 19, 2014 at 11:06 PM, Ravikumar Govindarajan <
> > ravikumar.govindarajan@gmail.com> wrote:
> >
> > > I was looking at block-cache code and trying to understand why we need
> > it.
> > >
> > > We divide the file into blocks of 8KB and write to hadoop. While
> reading,
> > > we only read in batches of 8KB and store in block-cache
> > >
> > > This is a form of read-ahead caching on the client-side[shard-server].
> Am
> > > I correct in understanding?
> > >
> > > Recent releases of hadoop have a notion of read-ahead caching in
> > data-node
> > > itself. The default value is 4MB but I believe it can also be
> configured
> > to
> > > whatever is needed.
> > >
> > > What are the advantages of a block-cache vis-a-vis data-node read-ahead
> > > cache?
> > >
> > > I also am not familiar with hadoop IO sub-system as to whether it's
> > > correct and performant to do read-aheads in data-nodes for a use-case
> > like
> > > lucene.
> > >
> > > Can someone help me?
> > >
> > > --
> > > Ravi
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message