incubator-blur-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Index Warmup in Blur
Date Wed, 09 Oct 2013 05:57:44 GMT
Yes, I think bringing in a mutable file in lucene-index brings it's own set
of problems to handle. Filters, Caches, Scoring, Snapshots/Commits etc...
will all be affected.

There is on JIRA on writing generation of updatable files, just like
doc-deletes instead of over-writing a single file.[
https://issues.apache.org/jira/browse/LUCENE-4258]. But that is still
in-progress and from what I understand, it could slow searches considerably.

BTW, is it possible to extend BlurPartitioner and load it during start-up?

Also, it would be awesome if Blur supports a per-row auto-complete feature.

--
Ravi


On Sat, Oct 5, 2013 at 2:01 AM, Aaron McCurry <amccurry@gmail.com> wrote:

> I have thought of one possible problem with this approach.  To date the
> mindset I have used in all of the Blur internals is that segments are
> immutable.  This is a fundamental principle that Blur uses and I don't
> really have any ideas on where to behind checking for when this is a
> problem.  I know filters are going to be an issue, not sure where else.
>
> Not saying that it can't be done, it's just not going to be as clean as I
> originally thought.
>
> Aaron
>
>
> On Fri, Oct 4, 2013 at 4:26 PM, Aaron McCurry <amccurry@gmail.com> wrote:
>
> >
> >
> > On Fri, Oct 4, 2013 at 7:15 AM, Ravikumar Govindarajan <
> > ravikumar.govindarajan@gmail.com> wrote:
> >
> >> On a related note, do you think such an approach will fit in Blur
> >>
> >> 1. Store the BDB file in shard-server itself.
> >>
> >
> > Probably not, this would pin the BDB (or whatever the solution would be)
> > to a specific server.  We will have to sync to HDFS.
> >
> >
> >>
> >> 2. Apply all incoming partial doc-updates to local BDB file as well as
> an
> >>     update-transaction log
> >>
> >
> > Blur already has a write ahead log as apart of internals.  It's written
> > and synced to HDFS.
> >
> >
> >>
> >> 3. Periodically sync dirty BDB files to HDFS and roll-over the update-
> >>  transaction log.
> >
> >
> >> Whenever a shard-server goes down, the take-over server can initially
> sync
> >> the BDB file from HDFS to local, replay the update-transaction log and
> >> then
> >> start serving data
> >>
> >
> > Blur already does this internally, it records the mutates and replays
> them
> > if a failure happens before a commit.
> >
> > Aaron
> >
> >
> >>
> >> --
> >> Ravi
> >>
> >>
> >> On Thu, Oct 3, 2013 at 11:14 PM, Ravikumar Govindarajan <
> >> ravikumar.govindarajan@gmail.com> wrote:
> >>
> >> > The mutate APIs are a good fit for individual cols update. BlurCodec
> >> will
> >> > be cool and solve a lot of problems.
> >> >
> >> > There are 3 caveats for such a codec
> >> >
> >> > 1. Scores for affected queries will be wrong, until segment-merge
> >> >
> >> > 2. Responsibility of ordering updates must be on the client.
> >> >
> >> > 3. Repeated updates for the same document can either take a
> generational
> >> > approach [Lucene-4258] or use a single version of storage [Redis/TC
> >> etc..],
> >> > pushing the onus to client, depending on how the Codec shapes up.
> >> >
> >> > The former will be semantically correct but really sluggish while the
> >> > latter will be faster during search
> >> >
> >> >
> >> >
> >> > On Thu, Oct 3, 2013 at 8:53 PM, Aaron McCurry <amccurry@gmail.com>
> >> wrote:
> >> >
> >> >> On Thu, Oct 3, 2013 at 11:08 AM, Ravikumar Govindarajan <
> >> >> ravikumar.govindarajan@gmail.com> wrote:
> >> >>
> >> >> > Yeah, you are correct. A BDB file might probably never be ported
to
> >> >> HDFS.
> >> >> >
> >> >> > Our daily update frequency comes to about 20% of insertion rate.
> >> >> >
> >> >> > Lets say "UPDATE <TABLE> SET COL2=1 WHERE COL1=X".
> >> >> >
> >> >> > This update could potentially span across tens of thousands of
SQL
> >> rows
> >> >> in
> >> >> > our case, where COL2 is just a boolean flip.
> >> >> >
> >> >> > The problem is not with lucene's ability to handle load. Instead
it
> >> is
> >> >> with
> >> >> > the consistent load it puts on our content servers to read and
> >> >> re-tokenize
> >> >> > such huge rows just for a boolean flip. Another big winner is
that
> >> all
> >> >> our
> >> >> > updatable fields are not involved in scoring at all. Just matching
> >> will
> >> >> do.
> >> >> >
> >> >> > The changes also sit in BDB only till the next segment merge,
after
> >> >> which
> >> >> > it is cleaned out. There is very little perf hit here for us,
as
> >> users
> >> >> > don't immediately search after a change.
> >> >> >
> >> >> > I am afraid there is no documentation/code/numbers on this
> currently
> >> in
> >> >> > public, as it is still proprietary but is remarkably similar to
the
> >> >> popular
> >> >> > to RedisCodec.
> >> >> >
> >> >> > "If you really need partial document updates, there would need
to
> be
> >> >> > changes
> >> >> > throughout the entire stack"
> >> >> >
> >> >> > You mean, the entire stack of Blur? In case this is possible,
can
> you
> >> >> give
> >> >> > me 10000-ft overview of what you have in mind?
> >> >> >
> >> >>
> >> >> Interesting, now that I think about it.  The situation that you
> >> describe
> >> >> is
> >> >> very interesting, I'm wondering if we came up with something like
> this
> >> in
> >> >> Blur that it would fix our large Row issue.  Or at the very least
> help
> >> the
> >> >> problem.
> >> >>
> >> >> https://issues.apache.org/jira/browse/BLUR-220
> >> >>
> >> >> Plus the more I think about it, the mutate methods are probably the
> >> right
> >> >> implementation for modifying single columns.  So the API of Blur
> >> probably
> >> >> wouldn't need to be changed.  Maybe just the way it goes about
> dealing
> >> >> with
> >> >> changes.  I thinking maybe we need our own BlurCodec to handle large
> >> Rows
> >> >> as well as Record (Document) updates.
> >> >>
> >> >> As an aside I constantly am having to refer to Records as Documents,
> >> this
> >> >> is why I think we need a rename.
> >> >>
> >> >> Aaron
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > --
> >> >> > Ravi
> >> >> >
> >> >> >
> >> >> > On Thu, Oct 3, 2013 at 5:36 PM, Aaron McCurry <amccurry@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> > > The biggest issue with this is that the shards (the indexes)
> >> inside of
> >> >> > Blur
> >> >> > > actually move from one server to another.  So to support
this
> >> behavior
> >> >> > all
> >> >> > > the indexes are stored in HDFS.  Do due the differences between
> >> HDFS
> >> >> and
> >> >> > > the a normal POSIX file system, I highly doubt that the BDB
file
> >> form
> >> >> in
> >> >> > > TokyoCabinet can ever be supported.
> >> >> > >
> >> >> > > If you really need partial document updates, there would
need to
> be
> >> >> > changes
> >> >> > > throughout the entire stack.  I am curious why you need this
> >> feature?
> >> >>  Do
> >> >> > > you have that many updates to the index?  What is the update
> >> >> frequency?
> >> >> > >  I'm just curious of what kind of performance you get out
of a
> >> setup
> >> >> like
> >> >> > > that?  Since I haven't ever run such a setup I have no idea
how
> to
> >> >> > compare
> >> >> > > that kind of system to a base Lucene setup.
> >> >> > >
> >> >> > > Could you point be to some code or documentation?  I would
to go
> >> and
> >> >> > take a
> >> >> > > look.
> >> >> > >
> >> >> > > Thanks,
> >> >> > > Aaron
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On Thu, Oct 3, 2013 at 7:00 AM, Ravikumar Govindarajan <
> >> >> > > ravikumar.govindarajan@gmail.com> wrote:
> >> >> > >
> >> >> > > > One more help.
> >> >> > > >
> >> >> > > > We also maintain a file by name "BDB", just like the
"Sample"
> >> file
> >> >> for
> >> >> > > > tracing used by Blur.
> >> >> > > >
> >> >> > > > This "BDB" file pertains to TokyoCabinet and is used
purely for
> >> >> > > supporting
> >> >> > > > partial updates to a document.
> >> >> > > > All operations on this file rely on local file-paths
only,
> >> through
> >> >> the
> >> >> > > use
> >> >> > > > of native code.
> >> >> > > > Currently, all update requests are local to the index
files and
> >> it
> >> >> > > becomes
> >> >> > > > trivial to support.
> >> >> > > >
> >> >> > > > Any pointers on how to take this forward in Blur set-up
of
> >> >> > shard-servers
> >> >> > > &
> >> >> > > > controllers?
> >> >> > > >
> >> >> > > > --
> >> >> > > > Ravi
> >> >> > > >
> >> >> > > >
> >> >> > > > On Tue, Oct 1, 2013 at 10:15 PM, Aaron McCurry <
> >> amccurry@gmail.com>
> >> >> > > wrote:
> >> >> > > >
> >> >> > > > > You can control the fields to warmup via:
> >> >> > > > >
> >> >> > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_TableDescriptor
> >> >> > > > >
> >> >> > > > > The preCacheCols field.  The comment is wrong however,
so I
> >> will
> >> >> > > create a
> >> >> > > > > task to correct.  The use of the field is: "family.column"
> just
> >> >> like
> >> >> > > you
> >> >> > > > > would search.
> >> >> > > > >
> >> >> > > > > Aaron
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Tue, Oct 1, 2013 at 12:41 PM, Ravikumar Govindarajan
<
> >> >> > > > > ravikumar.govindarajan@gmail.com> wrote:
> >> >> > > > >
> >> >> > > > > > Thanks Aaron
> >> >> > > > > >
> >> >> > > > > > General sampling and warming is fine and the
code is really
> >> >> concise
> >> >> > > and
> >> >> > > > > > clear.
> >> >> > > > > >
> >> >> > > > > >  The act of reading
> >> >> > > > > > brings the data into the block cache and the
result is that
> >> the
> >> >> > index
> >> >> > > > is
> >> >> > > > > > "hot".
> >> >> > > > > >
> >> >> > > > > > Will all the terms of a field be read and
brought into the
> >> >> cache?
> >> >> > If
> >> >> > > > so,
> >> >> > > > > > then it has an obvious implication to avoid
fields like,
> say
> >> >> > > > > > attachment-data from warming up, provided
queries don't
> often
> >> >> > include
> >> >> > > > > such
> >> >> > > > > > fields
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > > On Tue, Oct 1, 2013 at 7:58 PM, Aaron McCurry
<
> >> >> amccurry@gmail.com>
> >> >> > > > > wrote:
> >> >> > > > > >
> >> >> > > > > > > Take a look at this package.
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=tree;f=blur-store/src/main/java/org/apache/blur/lucene/warmup;h=f4239b1947965dc7fe8218eaa16e3f39ecffdda0;hb=apache-blur-0.2
> >> >> > > > > > >
> >> >> > > > > > > Basically when the warmup process starts
(which is
> >> >> asynchronous
> >> >> > to
> >> >> > > > the
> >> >> > > > > > rest
> >> >> > > > > > > of the application) it flips a thread
local switch to
> allow
> >> >> for
> >> >> > > > tracing
> >> >> > > > > > of
> >> >> > > > > > > the file accesses.  The sampler will
sample each of the
> >> >> fields in
> >> >> > > > each
> >> >> > > > > > > segment and create a sample file that
attempts to detect
> >> the
> >> >> > > > boundaries
> >> >> > > > > > of
> >> >> > > > > > > each field within each file within each
segment.  Then it
> >> >> stores
> >> >> > > the
> >> >> > > > > > sample
> >> >> > > > > > > info into the directory beside each segment
(so that way
> it
> >> >> > doesn't
> >> >> > > > > have
> >> >> > > > > > to
> >> >> > > > > > > re-sample the segment).  After the sampling
is complete
> or
> >> >> > loaded,
> >> >> > > > the
> >> >> > > > > > > warmup just reads the binary data from
each file.  The
> act
> >> of
> >> >> > > reading
> >> >> > > > > > > brings the data into the block cache
and the result is
> that
> >> >> the
> >> >> > > index
> >> >> > > > > is
> >> >> > > > > > > "hot".
> >> >> > > > > > >
> >> >> > > > > > > Hope this helps.
> >> >> > > > > > >
> >> >> > > > > > > Aaron
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > > On Tue, Oct 1, 2013 at 10:09 AM, Ravikumar
Govindarajan <
> >> >> > > > > > > ravikumar.govindarajan@gmail.com>
wrote:
> >> >> > > > > > >
> >> >> > > > > > > > As I understand,
> >> >> > > > > > > >
> >> >> > > > > > > > Lucene will store the files in following
way
> per-segment
> >> >> > > > > > > >
> >> >> > > > > > > > TIM file
> >> >> > > > > > > >      Field1 ---> Some byte[]
> >> >> > > > > > > >      Field2 ---> Some byte[]
> >> >> > > > > > > >
> >> >> > > > > > > > TIP file
> >> >> > > > > > > >      Field1 ---> Some byte[]
> >> >> > > > > > > >      Field2 ---> Some byte[]
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > Blur will "sample" this lucene-file
in the following
> way
> >> >> > > > > > > >
> >> >> > > > > > > > Field1 --> <TIM, start-offset>,
<TIP, start-offset>,
> ...
> >> >> > > > > > > >
> >> >> > > > > > > > Field 2 --> <TIM, start-offset>,
<TIP, start-offset>,
> ...
> >> >> > > > > > > >
> >> >> > > > > > > > Is my understanding correct?
> >> >> > > > > > > >
> >> >> > > > > > > > How does Blur warm-up the fields,
when it does not know
> >> the
> >> >> > > > > > "end-offset"
> >> >> > > > > > > or
> >> >> > > > > > > > the "length" for each field to warm.
> >> >> > > > > > > >
> >> >> > > > > > > > Will it by default read all Terms
of a field?
> >> >> > > > > > > >
> >> >> > > > > > > > --
> >> >> > > > > > > > Ravi
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message