lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Zenkov <azen...@brandwatch.com>
Subject Re: Lucene Index Cloud Replication
Date Thu, 11 Jul 2019 02:11:19 GMT
Another +1. We are also big s3 + lucene users and it is very interesting
what other people came up with. We have an S3 lucene directory that allows
immediate read-only use of lucene indexes stored on s3 with simultaneous
local caching and a prototype of segment based index replication based on
the custom deletion policy. Michael McCandless said it very well that both
Solr and ElasticSearch dont support segment based index distribution and
for large scale indexing this is very nice way of distributing lucene
indexes.

On Tue, Jul 9, 2019 at 8:51 AM Michael McCandless <lucene@mikemccandless.com>
wrote:

> +1 to share code for doing 1) and 3) both of which are tricky!
>
> Safely moving / copying bytes around is a notoriously difficult problem ...
> but Lucene's "end to end checksums" and per-segment-file-GUID make this
> safer.
>
> I think Lucene's replicator module is a good place for this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Jul 3, 2019 at 4:15 PM Michael Froh <msfroh@gmail.com> wrote:
>
> > Hi there,
> >
> > I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
> > storing and retrieving Lucene indexes in S3, and realized that
> "uploading a
> > Lucene directory to the cloud and downloading it on other machines" is a
> > pretty common problem and one that's surprisingly easy to do poorly. In
> my
> > current job, I'm on my third team that needed to do this.
> >
> > In my experience, there are three main pieces that need to be
> implemented:
> >
> > 1. Uploading/downloading individual files (i.e. the blob store), which
> can
> > be eventually consistent if you write once.
> > 2. Describing the metadata for a specific commit point (basically what
> the
> > Replicator module does with the "Revision" class). In particular, we
> want a
> > downloader to reliably be able to know if they already have specific
> files
> > (and don't need to download them again).
> > 3. Sharing metadata with some degree of consistency, so that multiple
> > writers don't clobber each other's metadata, and so readers can discover
> > the metadata for the latest commit/revision and trust that they'll
> > (eventually) be able to download the relevant files.
> >
> > I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB,
> but
> > I'd like to do it with  interfaces that lend themselves to other
> > implementations for blob and metadata storage.
> >
> > Is it worth opening a Jira issue for this? Is this something that would
> > benefit the Lucene community?
> >
> > Thanks,
> > Michael Froh
> >
>


-- 

Anton Zenkov    |    Director Of Engineering

azenkov@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON |  BERLIN   |   STUTTGART   |
SINGAPORE   | SYDNEY | PARIS


<https://www.brandwatch.com/blog/brandwatch-and-crimson-hexagon/>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message