lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Oschler <robert.osch...@gmail.com>
Subject Re: Does the latest version of lucene.net contain the Lucene 4.0 speed improvements for fuzzy queries?
Date Thu, 09 Jul 2015 13:50:47 GMT
Thanks for the tips Bart.

>> I have gotten individual Lucene 3.0.3 indexes as high as 30 gig (over
100 million entities) to work on Azure and they perform well.

I assume you mean you did that but not using the AzureDirectory approach?
Just confirming this.

Thanks,
Robert

On Thu, Jul 9, 2015 at 5:49 AM, Bart Czernicki <
Bartosz.Czernicki@microsoft.com> wrote:

> I have used and deployed AzureDirectory into a couple production systems.
> The current version out there does not scale properly for very large
> indexes.  You will need to tweak the code a bit. For Azure, there are some
> additional tweaks I recommend:
> - using the local temp drive for indexes, larger VMs can use the temp SSDs
> or attached SSDs
> - changing the way indexes are synchronized to faster leverage blob storage
> - additional fail safe mechanisms
>
> I have gotten individual Lucene 3.0.3 indexes as high as 30 gig (over 100
> million entities) to work on Azure and they perform well.
>
> Thanks,
> Bart
>
>
>
> > On Jul 9, 2015, at 05:39, Simon Svensson <sisve@devhost.se> wrote:
> >
> > Hi,
> >
> > I believe your mail client is messing up the quotations/identations.
> >
> > Anyhow, the 4.0 branch does refer to Lucene 4.0, but it's an incomplete
> port. The current work is currently focused on 4.8 and is present in the
> master branch. This is expected to be the next release. I'm guessing that
> this is the branch where you found the "using Lucene.Net.Codecs.Lucene40;"
> line.
> >
> > I've not used Azure myself, and have only limited knowledge of the usage
> of blobs. It sounds like it comes down as locally stored indexes vs indexes
> shared over network. I just made up a few points below, from head, while
> dodging my real work tasks... ;)
> >
> > Locally stored indexes:
> > + Easy setup.
> > + Low response times.
> > + One corrupted index will only bring down one searcher.
> > - One index per worker (duplicated indexes == wasted disk space)
> > - Every worker needs to build the index and keep it up-to-date.
> > - Every worker has two roles; both searcher and indexer.
> > - Slower scale-out; a new worker needs to rebuild the index.
> >
> > Network-based indexes:
> > + One index shared between all workers.
> > + Every worker has a dedicated role; either searcher or indexer (You can
> assign resources to match)
> > + One dedicated worker takes care of building and keeping the index
> up-to-date.
> > + Faster scale-out; a new worker just grab the data from the network.
> > - Higher response times (due to network traffic). This is often
> mitigated by locally caching the segments.
> > - Single-point-of-failure. A corrupted index will bring down all
> searchers.
> >
> > I would go with the Azure blobs, while it may be extra maintenance and
> documentation as a introductionary cost; you may sleep sound at night
> knowing that once your service is hit by Slashdot/Reddit you can press a
> button and scale out in a very short time. (That's the theory at least, if
> you configured your web workers correctly...)
> >
> > // Simon
> >
> >> On 09/07/15 11:13, Robert Oschler wrote:
> >> Hello Simon,
> >>
> >> Ok.  I got excited when I saw the following using statement in the 3.0.3
> >> build:
> >>
> >> using Lucene.Net.Codecs.Lucene40;
> >>
> >> But from what you are saying I take it the 4.0 label does not refer to
> >> Lucene 4.0.
> >>
> >>>> You could probably take the FuzzyQuery class from Lucene 4.0 and port
> it
> >> to the Lucene 3.0.3 code base. That's the only way you can get those
> >> improvements while still using a stable version of Lucene.net.
> >>
> >> Thanks.  I'll try grabbing the FuzzyQuery class and converting the Java
> >> code to C#.  Hopefully there aren't too many dependencies in that unit
> that
> >> I'll have to drag in.
> >>
> >>>> Regarding the Azure library... are you using Azure? If so, are you
> using
> >> several worker machines that need to share an index? You've not
> mentioned
> >> anything about your current setup to help you evaluate if switching to
> >> AzureDirectory will be an improvement or not.
> >>
> >> Yes I am using Azure, but I am just getting started so I have not set up
> >> Lucene yet.  I'm trying to decide my current setup right now.  Given my
> >> expected usage profile (The index will probably receive a few hundred
> new
> >> updates over the course of the day.  Over time, there could well be a
> >> hundred thousand sentences or so.), do you have any suggestions?  I'll
> want
> >> the lowest latency I can give my users that I can get when searching the
> >> index.  Note, I am more concerned with the performance of index lookups.
> >> Update/modifications can take a few seconds if needed since they will be
> >> much less frequent.
> >>
> >> Thanks,
> >> Robert
> >>
> >>
> >>
> >>> On Thu, Jul 9, 2015 at 5:01 AM, Simon Svensson <sisve@devhost.se>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> The latest stable version of Lucene.net is v3.0.3, and does not contain
> >>> any code changes from the the higher versioned java code. There's
> currently
> >>> a 4.8 port in progress, but it's not stable enough yet.
> >>>
> >>> You could probably take the FuzzyQuery class from Lucene 4.0 and port
> it
> >>> to the Lucene 3.0.3 code base. That's the only way you can get those
> >>> improvements while still using a stable version of Lucene.net.
> >>>
> >>> Regarding the Azure library... are you using Azure? If so, are you
> using
> >>> several worker machines that need to share an index? You've not
> mentioned
> >>> anything about your current setup to help you evaluate if switching to
> >>> AzureDirectory will be an improvement or not.
> >>>
> >>> // Simon
> >>>
> >>>
> >>>
> >>>> On 09/07/15 10:51, Robert Oschler wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> Does the latest version of lucene.net contain the Lucene 4.0 speed
> >>>> improvements for fuzzy queries?  If not, is there any way to get those
> >>>> improvements?  I saw this experimental Lucene.net 4.0 branch on
> >>>> Apache.org,
> >>>> but it seems to be inactive now and I don't know how stable it is:
> >>>>
> >>>> https://svn.apache.org/repos/asf/lucene.net/branches/Lucene.Net_4e/
> >>>>
> >>>> If so, what is the exact version of lucene.net I should be using to
> have
> >>>> those improvements?
> >>>>
> >>>> Also, I saw this post on using Azure blobs to speed up server side
> >>>> processing:
> >>>>
> >>>>
> https://code.msdn.microsoft.com/windowsazure/Azure-Library-for-83562538
> >>>>
> >>>> Are the improvements using this technique substantial enough to
> warrant
> >>>> using it?  My "documents" are average sized sentences.  The index will
> >>>> probably receive a few hundred new updates over the course of the day.
> >>>> Over time, there could well be a hundred thousand sentences or so.
> >
>



-- 
Thanks,
Robert Oschler
Twitter -> http://twitter.com/roschler
http://www.RobotsRule.com/
http://www.Robodance.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message