lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bart Czernicki <Bartosz.Czerni...@microsoft.com>
Subject Re: Does the latest version of lucene.net contain the Lucene 4.0 speed improvements for fuzzy queries?
Date Thu, 09 Jul 2015 09:49:27 GMT
I have used and deployed AzureDirectory into a couple production systems.  The current version
out there does not scale properly for very large indexes.  You will need to tweak the code
a bit. For Azure, there are some additional tweaks I recommend:
- using the local temp drive for indexes, larger VMs can use the temp SSDs or attached SSDs
- changing the way indexes are synchronized to faster leverage blob storage
- additional fail safe mechanisms

I have gotten individual Lucene 3.0.3 indexes as high as 30 gig (over 100 million entities)
to work on Azure and they perform well.

Thanks,
Bart



> On Jul 9, 2015, at 05:39, Simon Svensson <sisve@devhost.se> wrote:
> 
> Hi,
> 
> I believe your mail client is messing up the quotations/identations.
> 
> Anyhow, the 4.0 branch does refer to Lucene 4.0, but it's an incomplete port. The current
work is currently focused on 4.8 and is present in the master branch. This is expected to
be the next release. I'm guessing that this is the branch where you found the "using Lucene.Net.Codecs.Lucene40;"
line.
> 
> I've not used Azure myself, and have only limited knowledge of the usage of blobs. It
sounds like it comes down as locally stored indexes vs indexes shared over network. I just
made up a few points below, from head, while dodging my real work tasks... ;)
> 
> Locally stored indexes:
> + Easy setup.
> + Low response times.
> + One corrupted index will only bring down one searcher.
> - One index per worker (duplicated indexes == wasted disk space)
> - Every worker needs to build the index and keep it up-to-date.
> - Every worker has two roles; both searcher and indexer.
> - Slower scale-out; a new worker needs to rebuild the index.
> 
> Network-based indexes:
> + One index shared between all workers.
> + Every worker has a dedicated role; either searcher or indexer (You can assign resources
to match)
> + One dedicated worker takes care of building and keeping the index up-to-date.
> + Faster scale-out; a new worker just grab the data from the network.
> - Higher response times (due to network traffic). This is often mitigated by locally
caching the segments.
> - Single-point-of-failure. A corrupted index will bring down all searchers.
> 
> I would go with the Azure blobs, while it may be extra maintenance and documentation
as a introductionary cost; you may sleep sound at night knowing that once your service is
hit by Slashdot/Reddit you can press a button and scale out in a very short time. (That's
the theory at least, if you configured your web workers correctly...)
> 
> // Simon
> 
>> On 09/07/15 11:13, Robert Oschler wrote:
>> Hello Simon,
>> 
>> Ok.  I got excited when I saw the following using statement in the 3.0.3
>> build:
>> 
>> using Lucene.Net.Codecs.Lucene40;
>> 
>> But from what you are saying I take it the 4.0 label does not refer to
>> Lucene 4.0.
>> 
>>>> You could probably take the FuzzyQuery class from Lucene 4.0 and port it
>> to the Lucene 3.0.3 code base. That's the only way you can get those
>> improvements while still using a stable version of Lucene.net.
>> 
>> Thanks.  I'll try grabbing the FuzzyQuery class and converting the Java
>> code to C#.  Hopefully there aren't too many dependencies in that unit that
>> I'll have to drag in.
>> 
>>>> Regarding the Azure library... are you using Azure? If so, are you using
>> several worker machines that need to share an index? You've not mentioned
>> anything about your current setup to help you evaluate if switching to
>> AzureDirectory will be an improvement or not.
>> 
>> Yes I am using Azure, but I am just getting started so I have not set up
>> Lucene yet.  I'm trying to decide my current setup right now.  Given my
>> expected usage profile (The index will probably receive a few hundred new
>> updates over the course of the day.  Over time, there could well be a
>> hundred thousand sentences or so.), do you have any suggestions?  I'll want
>> the lowest latency I can give my users that I can get when searching the
>> index.  Note, I am more concerned with the performance of index lookups.
>> Update/modifications can take a few seconds if needed since they will be
>> much less frequent.
>> 
>> Thanks,
>> Robert
>> 
>> 
>> 
>>> On Thu, Jul 9, 2015 at 5:01 AM, Simon Svensson <sisve@devhost.se> wrote:
>>> 
>>> Hi,
>>> 
>>> The latest stable version of Lucene.net is v3.0.3, and does not contain
>>> any code changes from the the higher versioned java code. There's currently
>>> a 4.8 port in progress, but it's not stable enough yet.
>>> 
>>> You could probably take the FuzzyQuery class from Lucene 4.0 and port it
>>> to the Lucene 3.0.3 code base. That's the only way you can get those
>>> improvements while still using a stable version of Lucene.net.
>>> 
>>> Regarding the Azure library... are you using Azure? If so, are you using
>>> several worker machines that need to share an index? You've not mentioned
>>> anything about your current setup to help you evaluate if switching to
>>> AzureDirectory will be an improvement or not.
>>> 
>>> // Simon
>>> 
>>> 
>>> 
>>>> On 09/07/15 10:51, Robert Oschler wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> Does the latest version of lucene.net contain the Lucene 4.0 speed
>>>> improvements for fuzzy queries?  If not, is there any way to get those
>>>> improvements?  I saw this experimental Lucene.net 4.0 branch on
>>>> Apache.org,
>>>> but it seems to be inactive now and I don't know how stable it is:
>>>> 
>>>> https://svn.apache.org/repos/asf/lucene.net/branches/Lucene.Net_4e/
>>>> 
>>>> If so, what is the exact version of lucene.net I should be using to have
>>>> those improvements?
>>>> 
>>>> Also, I saw this post on using Azure blobs to speed up server side
>>>> processing:
>>>> 
>>>> https://code.msdn.microsoft.com/windowsazure/Azure-Library-for-83562538
>>>> 
>>>> Are the improvements using this technique substantial enough to warrant
>>>> using it?  My "documents" are average sized sentences.  The index will
>>>> probably receive a few hundred new updates over the course of the day.
>>>> Over time, there could well be a hundred thousand sentences or so.
> 

Mime
View raw message