incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Optimizing for indexing vs searching
Date Tue, 08 Apr 2014 02:09:56 GMT
On Mon, Apr 7, 2014 at 8:35 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Hi,
>
> I want to refresh my understanding, so just a few imaginary situations.
> Lets say I have a data ingestion of 25000 docs per second (average size
> 10k). There could be situations where I want to optimize for indexing and
> in some cases I want to optimize for speed while searching. How do I
> control these individually?
>

This is the biggest challenge of any search solution.  I would say that
Blur handles this through utilizing the hardware it's given to the best of
it's ability.  That being said, 25K docs per second at 10K in size as a
constant input everyday, data after day is a big problem to deal with on
the search side of the house.  That's north of 2 billion docs a day at over
20 TB a day.  Some questions I would ask.  Are you expecting to keep all of
the data online forever?  What are time to search requirements (visibility
latency)?  What kind of hardware are you expecting to run this on?  All of
this is assuming that the 25K/second is a constant average.  If the
25K/second is a burst from time to time throughout the day, that is likely
a much easier question to answer.

In either case, if you need near real time access to the document (meaning
you can't wait for a MapReduce job to run) then I would use the
enqueueMutate call.  It is similar to the NRT features of most search
engines, it basically will index as fast as it can without causing a large
latency on the client.


> My understanding is that having fewer but bigger shards improves search
> performance. Is this right?
>

In general yes, but fewer is relative.  I have run table in Blur with over
1000 shards on more than 100 shard servers with segments in the 4K-5K
(total) range and the search performance is very except-able.  Of course
the fewer the segments the faster the search executes.  However as the
segments grow in size the merges will take longer to complete.  Take a look
at the TeiredMergePolicy in Lucene, there are a few videos on youtube that
show how merges occur.


> Also does each shard correspond to one segment file (ignoring snapshots)? I
>

No, each shard equals a Lucene index, which will contain 1 or more segments.


> am trying to understand what happens when a shard is being searched and
> someone tries to write to the same shard. Would a new segment be created?
>

Yes, however Blur controls the way Lucene manages the segments within a
given index.  Basically Blur creates a lightweight snapshot of the index,
then executes the query and fetches the results using this lightweight
snapshot.


> (if so how do we control merging of segments within a shard?)
>

All merges are handled by Lucene but Blur implements a shard merge policy
globally per shard server so that the resources that merging consumes can
be managed per process instead of per index/shard.  Also there is merge
throttling builtin so that you can control how much bandwidth each server
takes up during merging.  Of course this means that merging can fall behind
the ingest process.  This is ok, however if the situation remains forever
the index will become slower and slower to search.  Merging also uses the
BlockCache for performance, but does not effect the contents of the
BlockCache.


>
> My apologies if this doesn't make a whole lot of sense.
>

All good questions, let me know if you have more questions.

Aaron


> Thank You.
>
> - Rahul
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message