incubator-blur-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Index Warmup in Blur
Date Thu, 03 Oct 2013 17:44:05 GMT
The mutate APIs are a good fit for individual cols update. BlurCodec will
be cool and solve a lot of problems.

There are 3 caveats for such a codec

1. Scores for affected queries will be wrong, until segment-merge

2. Responsibility of ordering updates must be on the client.

3. Repeated updates for the same document can either take a generational
approach [Lucene-4258] or use a single version of storage [Redis/TC etc..],
pushing the onus to client, depending on how the Codec shapes up.

The former will be semantically correct but really sluggish while the
latter will be faster during search



On Thu, Oct 3, 2013 at 8:53 PM, Aaron McCurry <amccurry@gmail.com> wrote:

> On Thu, Oct 3, 2013 at 11:08 AM, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
> > Yeah, you are correct. A BDB file might probably never be ported to HDFS.
> >
> > Our daily update frequency comes to about 20% of insertion rate.
> >
> > Lets say "UPDATE <TABLE> SET COL2=1 WHERE COL1=X".
> >
> > This update could potentially span across tens of thousands of SQL rows
> in
> > our case, where COL2 is just a boolean flip.
> >
> > The problem is not with lucene's ability to handle load. Instead it is
> with
> > the consistent load it puts on our content servers to read and
> re-tokenize
> > such huge rows just for a boolean flip. Another big winner is that all
> our
> > updatable fields are not involved in scoring at all. Just matching will
> do.
> >
> > The changes also sit in BDB only till the next segment merge, after which
> > it is cleaned out. There is very little perf hit here for us, as users
> > don't immediately search after a change.
> >
> > I am afraid there is no documentation/code/numbers on this currently in
> > public, as it is still proprietary but is remarkably similar to the
> popular
> > to RedisCodec.
> >
> > "If you really need partial document updates, there would need to be
> > changes
> > throughout the entire stack"
> >
> > You mean, the entire stack of Blur? In case this is possible, can you
> give
> > me 10000-ft overview of what you have in mind?
> >
>
> Interesting, now that I think about it.  The situation that you describe is
> very interesting, I'm wondering if we came up with something like this in
> Blur that it would fix our large Row issue.  Or at the very least help the
> problem.
>
> https://issues.apache.org/jira/browse/BLUR-220
>
> Plus the more I think about it, the mutate methods are probably the right
> implementation for modifying single columns.  So the API of Blur probably
> wouldn't need to be changed.  Maybe just the way it goes about dealing with
> changes.  I thinking maybe we need our own BlurCodec to handle large Rows
> as well as Record (Document) updates.
>
> As an aside I constantly am having to refer to Records as Documents, this
> is why I think we need a rename.
>
> Aaron
>
>
>
>
>
>
> >
> > --
> > Ravi
> >
> >
> > On Thu, Oct 3, 2013 at 5:36 PM, Aaron McCurry <amccurry@gmail.com>
> wrote:
> >
> > > The biggest issue with this is that the shards (the indexes) inside of
> > Blur
> > > actually move from one server to another.  So to support this behavior
> > all
> > > the indexes are stored in HDFS.  Do due the differences between HDFS
> and
> > > the a normal POSIX file system, I highly doubt that the BDB file form
> in
> > > TokyoCabinet can ever be supported.
> > >
> > > If you really need partial document updates, there would need to be
> > changes
> > > throughout the entire stack.  I am curious why you need this feature?
>  Do
> > > you have that many updates to the index?  What is the update frequency?
> > >  I'm just curious of what kind of performance you get out of a setup
> like
> > > that?  Since I haven't ever run such a setup I have no idea how to
> > compare
> > > that kind of system to a base Lucene setup.
> > >
> > > Could you point be to some code or documentation?  I would to go and
> > take a
> > > look.
> > >
> > > Thanks,
> > > Aaron
> > >
> > >
> > >
> > > On Thu, Oct 3, 2013 at 7:00 AM, Ravikumar Govindarajan <
> > > ravikumar.govindarajan@gmail.com> wrote:
> > >
> > > > One more help.
> > > >
> > > > We also maintain a file by name "BDB", just like the "Sample" file
> for
> > > > tracing used by Blur.
> > > >
> > > > This "BDB" file pertains to TokyoCabinet and is used purely for
> > > supporting
> > > > partial updates to a document.
> > > > All operations on this file rely on local file-paths only, through
> the
> > > use
> > > > of native code.
> > > > Currently, all update requests are local to the index files and it
> > > becomes
> > > > trivial to support.
> > > >
> > > > Any pointers on how to take this forward in Blur set-up of
> > shard-servers
> > > &
> > > > controllers?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > > On Tue, Oct 1, 2013 at 10:15 PM, Aaron McCurry <amccurry@gmail.com>
> > > wrote:
> > > >
> > > > > You can control the fields to warmup via:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_TableDescriptor
> > > > >
> > > > > The preCacheCols field.  The comment is wrong however, so I will
> > > create a
> > > > > task to correct.  The use of the field is: "family.column" just
> like
> > > you
> > > > > would search.
> > > > >
> > > > > Aaron
> > > > >
> > > > >
> > > > > On Tue, Oct 1, 2013 at 12:41 PM, Ravikumar Govindarajan <
> > > > > ravikumar.govindarajan@gmail.com> wrote:
> > > > >
> > > > > > Thanks Aaron
> > > > > >
> > > > > > General sampling and warming is fine and the code is really
> concise
> > > and
> > > > > > clear.
> > > > > >
> > > > > >  The act of reading
> > > > > > brings the data into the block cache and the result is that
the
> > index
> > > > is
> > > > > > "hot".
> > > > > >
> > > > > > Will all the terms of a field be read and brought into the cache?
> > If
> > > > so,
> > > > > > then it has an obvious implication to avoid fields like, say
> > > > > > attachment-data from warming up, provided queries don't often
> > include
> > > > > such
> > > > > > fields
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 1, 2013 at 7:58 PM, Aaron McCurry <
> amccurry@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Take a look at this package.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=tree;f=blur-store/src/main/java/org/apache/blur/lucene/warmup;h=f4239b1947965dc7fe8218eaa16e3f39ecffdda0;hb=apache-blur-0.2
> > > > > > >
> > > > > > > Basically when the warmup process starts (which is asynchronous
> > to
> > > > the
> > > > > > rest
> > > > > > > of the application) it flips a thread local switch to allow
for
> > > > tracing
> > > > > > of
> > > > > > > the file accesses.  The sampler will sample each of the
fields
> in
> > > > each
> > > > > > > segment and create a sample file that attempts to detect
the
> > > > boundaries
> > > > > > of
> > > > > > > each field within each file within each segment.  Then
it
> stores
> > > the
> > > > > > sample
> > > > > > > info into the directory beside each segment (so that way
it
> > doesn't
> > > > > have
> > > > > > to
> > > > > > > re-sample the segment).  After the sampling is complete
or
> > loaded,
> > > > the
> > > > > > > warmup just reads the binary data from each file.  The
act of
> > > reading
> > > > > > > brings the data into the block cache and the result is
that the
> > > index
> > > > > is
> > > > > > > "hot".
> > > > > > >
> > > > > > > Hope this helps.
> > > > > > >
> > > > > > > Aaron
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Oct 1, 2013 at 10:09 AM, Ravikumar Govindarajan
<
> > > > > > > ravikumar.govindarajan@gmail.com> wrote:
> > > > > > >
> > > > > > > > As I understand,
> > > > > > > >
> > > > > > > > Lucene will store the files in following way per-segment
> > > > > > > >
> > > > > > > > TIM file
> > > > > > > >      Field1 ---> Some byte[]
> > > > > > > >      Field2 ---> Some byte[]
> > > > > > > >
> > > > > > > > TIP file
> > > > > > > >      Field1 ---> Some byte[]
> > > > > > > >      Field2 ---> Some byte[]
> > > > > > > >
> > > > > > > >
> > > > > > > > Blur will "sample" this lucene-file in the following
way
> > > > > > > >
> > > > > > > > Field1 --> <TIM, start-offset>, <TIP,
start-offset>, ...
> > > > > > > >
> > > > > > > > Field 2 --> <TIM, start-offset>, <TIP,
start-offset>, ...
> > > > > > > >
> > > > > > > > Is my understanding correct?
> > > > > > > >
> > > > > > > > How does Blur warm-up the fields, when it does not
know the
> > > > > > "end-offset"
> > > > > > > or
> > > > > > > > the "length" for each field to warm.
> > > > > > > >
> > > > > > > > Will it by default read all Terms of a field?
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ravi
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message