incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Optimizing for indexing vs searching
Date Tue, 08 Apr 2014 15:56:17 GMT
On Tue, Apr 8, 2014 at 10:54 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Thanks Aaron for your prompt and elaborate responses. You have really been
> a pleasure to collaborate with and always encouraging to ask more questions
> (however dumb they are). Below are a few comments
>
>
You are welcome!  :-)


>
> On Mon, Apr 7, 2014 at 7:09 PM, Aaron McCurry <amccurry@gmail.com> wrote:
>
> > On Mon, Apr 7, 2014 at 8:35 PM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I want to refresh my understanding, so just a few imaginary situations.
> > > Lets say I have a data ingestion of 25000 docs per second (average size
> > > 10k). There could be situations where I want to optimize for indexing
> and
> > > in some cases I want to optimize for speed while searching. How do I
> > > control these individually?
> > >
> >
> > This is the biggest challenge of any search solution.  I would say that
> > Blur handles this through utilizing the hardware it's given to the best
> of
> > it's ability.  That being said, 25K docs per second at 10K in size as a
> > constant input everyday, data after day is a big problem to deal with on
> > the search side of the house.  That's north of 2 billion docs a day at
> over
> > 20 TB a day.  Some questions I would ask.  Are you expecting to keep all
> of
> > the data online forever?  What are time to search requirements
> (visibility
> > latency)?  What kind of hardware are you expecting to run this on?  All
> of
> > this is assuming that the 25K/second is a constant average.  If the
> > 25K/second is a burst from time to time throughout the day, that is
> likely
> > a much easier question to answer.
> >
> > In either case, if you need near real time access to the document
> (meaning
> > you can't wait for a MapReduce job to run) then I would use the
> > enqueueMutate call.  It is similar to the NRT features of most search
> > engines, it basically will index as fast as it can without causing a
> large
> > latency on the client.
> >
>
>    As I told you its just an imaginary situation, however my intention was
> that we would have data ingestion bursts lasting for an hour and happening
> 3
>  times a day. I just wanted to understand how fastly the new data will be
> available for search and the performance of the search when the data
> ingestion is taking place. I did not really consider the hardware (May be a
> 10gig network and each node having 128GB memory?)
>

Ok, with that kind of hardware (assuming you have enough nodes) you should
have plenty of headroom for indexing and search.  With small document
(record) sizes around 1K worth of information a single cluster of 32 nodes
can easily keep up with 10 to 20K records per second for hours at a time
(with a single client writing the updates).  The indexing speed greatly
depends on the information you are indexing, so this is something you will
likely have to try out to really see what it's capable of achieving.  Let
me know if I can help in anyway.


>
> >
> >
> > > My understanding is that having fewer but bigger shards improves search
> > > performance. Is this right?
> > >
> >
> > In general yes, but fewer is relative.  I have run table in Blur with
> over
> > 1000 shards on more than 100 shard servers with segments in the 4K-5K
> > (total) range and the search performance is very except-able.  Of course
> > the fewer the segments the faster the search executes.  However as the
> > segments grow in size the merges will take longer to complete.  Take a
> look
> > at the TeiredMergePolicy in Lucene, there are a few videos on youtube
> that
> > show how merges occur.
> >
> >
> > > Also does each shard correspond to one segment file (ignoring
> > snapshots)? I
> > >
> >
> > No, each shard equals a Lucene index, which will contain 1 or more
> > segments.
> >
> >
> > > am trying to understand what happens when a shard is being searched and
> > > someone tries to write to the same shard. Would a new segment be
> created?
> > >
> >
> > Yes, however Blur controls the way Lucene manages the segments within a
> > given index.  Basically Blur creates a lightweight snapshot of the index,
> > then executes the query and fetches the results using this lightweight
> > snapshot.
> >
> >
> > > (if so how do we control merging of segments within a shard?)
> > >
> >
> > All merges are handled by Lucene but Blur implements a shard merge policy
> > globally per shard server so that the resources that merging consumes can
> > be managed per process instead of per index/shard.  Also there is merge
> > throttling builtin so that you can control how much bandwidth each server
> > takes up during merging.  Of course this means that merging can fall
> behind
> > the ingest process.  This is ok, however if the situation remains forever
> > the index will become slower and slower to search.  Merging also uses the
> > BlockCache for performance, but does not effect the contents of the
> > BlockCache.
> >
>
>   Can you point me to the code that is taking care of this?
> (SharedMergeScheduler?)
>

That's the one.  You can configure the number of work threads per shard
process.

Aaron


>
> >
> >
> > >
> > > My apologies if this doesn't make a whole lot of sense.
> > >
> >
> > All good questions, let me know if you have more questions.
> >
> > Aaron
> >
> >
> > > Thank You.
> > >
> > > - Rahul
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message