incubator-blur-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Index Warmup in Blur
Date Fri, 04 Oct 2013 11:15:18 GMT
On a related note, do you think such an approach will fit in Blur

1. Store the BDB file in shard-server itself.

2. Apply all incoming partial doc-updates to local BDB file as well as an
    update-transaction log

3. Periodically sync dirty BDB files to HDFS and roll-over the update-
 transaction log.

Whenever a shard-server goes down, the take-over server can initially sync
the BDB file from HDFS to local, replay the update-transaction log and then
start serving data

--
Ravi


On Thu, Oct 3, 2013 at 11:14 PM, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> The mutate APIs are a good fit for individual cols update. BlurCodec will
> be cool and solve a lot of problems.
>
> There are 3 caveats for such a codec
>
> 1. Scores for affected queries will be wrong, until segment-merge
>
> 2. Responsibility of ordering updates must be on the client.
>
> 3. Repeated updates for the same document can either take a generational
> approach [Lucene-4258] or use a single version of storage [Redis/TC etc..],
> pushing the onus to client, depending on how the Codec shapes up.
>
> The former will be semantically correct but really sluggish while the
> latter will be faster during search
>
>
>
> On Thu, Oct 3, 2013 at 8:53 PM, Aaron McCurry <amccurry@gmail.com> wrote:
>
>> On Thu, Oct 3, 2013 at 11:08 AM, Ravikumar Govindarajan <
>> ravikumar.govindarajan@gmail.com> wrote:
>>
>> > Yeah, you are correct. A BDB file might probably never be ported to
>> HDFS.
>> >
>> > Our daily update frequency comes to about 20% of insertion rate.
>> >
>> > Lets say "UPDATE <TABLE> SET COL2=1 WHERE COL1=X".
>> >
>> > This update could potentially span across tens of thousands of SQL rows
>> in
>> > our case, where COL2 is just a boolean flip.
>> >
>> > The problem is not with lucene's ability to handle load. Instead it is
>> with
>> > the consistent load it puts on our content servers to read and
>> re-tokenize
>> > such huge rows just for a boolean flip. Another big winner is that all
>> our
>> > updatable fields are not involved in scoring at all. Just matching will
>> do.
>> >
>> > The changes also sit in BDB only till the next segment merge, after
>> which
>> > it is cleaned out. There is very little perf hit here for us, as users
>> > don't immediately search after a change.
>> >
>> > I am afraid there is no documentation/code/numbers on this currently in
>> > public, as it is still proprietary but is remarkably similar to the
>> popular
>> > to RedisCodec.
>> >
>> > "If you really need partial document updates, there would need to be
>> > changes
>> > throughout the entire stack"
>> >
>> > You mean, the entire stack of Blur? In case this is possible, can you
>> give
>> > me 10000-ft overview of what you have in mind?
>> >
>>
>> Interesting, now that I think about it.  The situation that you describe
>> is
>> very interesting, I'm wondering if we came up with something like this in
>> Blur that it would fix our large Row issue.  Or at the very least help the
>> problem.
>>
>> https://issues.apache.org/jira/browse/BLUR-220
>>
>> Plus the more I think about it, the mutate methods are probably the right
>> implementation for modifying single columns.  So the API of Blur probably
>> wouldn't need to be changed.  Maybe just the way it goes about dealing
>> with
>> changes.  I thinking maybe we need our own BlurCodec to handle large Rows
>> as well as Record (Document) updates.
>>
>> As an aside I constantly am having to refer to Records as Documents, this
>> is why I think we need a rename.
>>
>> Aaron
>>
>>
>>
>>
>>
>>
>> >
>> > --
>> > Ravi
>> >
>> >
>> > On Thu, Oct 3, 2013 at 5:36 PM, Aaron McCurry <amccurry@gmail.com>
>> wrote:
>> >
>> > > The biggest issue with this is that the shards (the indexes) inside of
>> > Blur
>> > > actually move from one server to another.  So to support this behavior
>> > all
>> > > the indexes are stored in HDFS.  Do due the differences between HDFS
>> and
>> > > the a normal POSIX file system, I highly doubt that the BDB file form
>> in
>> > > TokyoCabinet can ever be supported.
>> > >
>> > > If you really need partial document updates, there would need to be
>> > changes
>> > > throughout the entire stack.  I am curious why you need this feature?
>>  Do
>> > > you have that many updates to the index?  What is the update
>> frequency?
>> > >  I'm just curious of what kind of performance you get out of a setup
>> like
>> > > that?  Since I haven't ever run such a setup I have no idea how to
>> > compare
>> > > that kind of system to a base Lucene setup.
>> > >
>> > > Could you point be to some code or documentation?  I would to go and
>> > take a
>> > > look.
>> > >
>> > > Thanks,
>> > > Aaron
>> > >
>> > >
>> > >
>> > > On Thu, Oct 3, 2013 at 7:00 AM, Ravikumar Govindarajan <
>> > > ravikumar.govindarajan@gmail.com> wrote:
>> > >
>> > > > One more help.
>> > > >
>> > > > We also maintain a file by name "BDB", just like the "Sample" file
>> for
>> > > > tracing used by Blur.
>> > > >
>> > > > This "BDB" file pertains to TokyoCabinet and is used purely for
>> > > supporting
>> > > > partial updates to a document.
>> > > > All operations on this file rely on local file-paths only, through
>> the
>> > > use
>> > > > of native code.
>> > > > Currently, all update requests are local to the index files and it
>> > > becomes
>> > > > trivial to support.
>> > > >
>> > > > Any pointers on how to take this forward in Blur set-up of
>> > shard-servers
>> > > &
>> > > > controllers?
>> > > >
>> > > > --
>> > > > Ravi
>> > > >
>> > > >
>> > > > On Tue, Oct 1, 2013 at 10:15 PM, Aaron McCurry <amccurry@gmail.com>
>> > > wrote:
>> > > >
>> > > > > You can control the fields to warmup via:
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_TableDescriptor
>> > > > >
>> > > > > The preCacheCols field.  The comment is wrong however, so I will
>> > > create a
>> > > > > task to correct.  The use of the field is: "family.column" just
>> like
>> > > you
>> > > > > would search.
>> > > > >
>> > > > > Aaron
>> > > > >
>> > > > >
>> > > > > On Tue, Oct 1, 2013 at 12:41 PM, Ravikumar Govindarajan <
>> > > > > ravikumar.govindarajan@gmail.com> wrote:
>> > > > >
>> > > > > > Thanks Aaron
>> > > > > >
>> > > > > > General sampling and warming is fine and the code is really
>> concise
>> > > and
>> > > > > > clear.
>> > > > > >
>> > > > > >  The act of reading
>> > > > > > brings the data into the block cache and the result is that
the
>> > index
>> > > > is
>> > > > > > "hot".
>> > > > > >
>> > > > > > Will all the terms of a field be read and brought into the
>> cache?
>> > If
>> > > > so,
>> > > > > > then it has an obvious implication to avoid fields like,
say
>> > > > > > attachment-data from warming up, provided queries don't
often
>> > include
>> > > > > such
>> > > > > > fields
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Oct 1, 2013 at 7:58 PM, Aaron McCurry <
>> amccurry@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Take a look at this package.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=tree;f=blur-store/src/main/java/org/apache/blur/lucene/warmup;h=f4239b1947965dc7fe8218eaa16e3f39ecffdda0;hb=apache-blur-0.2
>> > > > > > >
>> > > > > > > Basically when the warmup process starts (which is
>> asynchronous
>> > to
>> > > > the
>> > > > > > rest
>> > > > > > > of the application) it flips a thread local switch
to allow
>> for
>> > > > tracing
>> > > > > > of
>> > > > > > > the file accesses.  The sampler will sample each of
the
>> fields in
>> > > > each
>> > > > > > > segment and create a sample file that attempts to detect
the
>> > > > boundaries
>> > > > > > of
>> > > > > > > each field within each file within each segment.  Then
it
>> stores
>> > > the
>> > > > > > sample
>> > > > > > > info into the directory beside each segment (so that
way it
>> > doesn't
>> > > > > have
>> > > > > > to
>> > > > > > > re-sample the segment).  After the sampling is complete
or
>> > loaded,
>> > > > the
>> > > > > > > warmup just reads the binary data from each file. 
The act of
>> > > reading
>> > > > > > > brings the data into the block cache and the result
is that
>> the
>> > > index
>> > > > > is
>> > > > > > > "hot".
>> > > > > > >
>> > > > > > > Hope this helps.
>> > > > > > >
>> > > > > > > Aaron
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Oct 1, 2013 at 10:09 AM, Ravikumar Govindarajan
<
>> > > > > > > ravikumar.govindarajan@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > As I understand,
>> > > > > > > >
>> > > > > > > > Lucene will store the files in following way per-segment
>> > > > > > > >
>> > > > > > > > TIM file
>> > > > > > > >      Field1 ---> Some byte[]
>> > > > > > > >      Field2 ---> Some byte[]
>> > > > > > > >
>> > > > > > > > TIP file
>> > > > > > > >      Field1 ---> Some byte[]
>> > > > > > > >      Field2 ---> Some byte[]
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Blur will "sample" this lucene-file in the following
way
>> > > > > > > >
>> > > > > > > > Field1 --> <TIM, start-offset>, <TIP,
start-offset>, ...
>> > > > > > > >
>> > > > > > > > Field 2 --> <TIM, start-offset>, <TIP,
start-offset>, ...
>> > > > > > > >
>> > > > > > > > Is my understanding correct?
>> > > > > > > >
>> > > > > > > > How does Blur warm-up the fields, when it does
not know the
>> > > > > > "end-offset"
>> > > > > > > or
>> > > > > > > > the "length" for each field to warm.
>> > > > > > > >
>> > > > > > > > Will it by default read all Terms of a field?
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Ravi
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message