lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiko Aumond <k...@alum.mit.edu>
Subject Re: Stand-alone Index updating using EmbeddedSolrServer
Date Fri, 22 Apr 2011 00:15:27 GMT
Yes, this is a CSV Loader.  This looks like one of those cases where there
are many ways to handle 90% of the requirements but none that solves 100% of
the problem. Which is why the CSV loader also almost solves the problem, but
not quite.

 We're not using solr as a web app, just using the embedded server, which is
why we can't use curl and hence CSVLoader.  So this is a purely command-line
driven application that runs against an embedded Solr server, no web
containers,  for performance reasons.

On Thu, Apr 21, 2011 at 4:47 PM, Yonik Seeley <yonik@lucidimagination.com>wrote:

> On Thu, Apr 21, 2011 at 7:27 PM, Kiko Aumond <kiko@alum.mit.edu> wrote:
> > Yes, I've seen that page, but I went a bit beyond the material there, as
> the
> > code I wrote is able to set parameters such as separators, encapsulators
> and
> > the index columns,  whether to split parameters, auto-commit as well as
> the
> > ability to do incremental or full index reloads.
>
> Is this a CSV loader?
> If so, did you know the CSV loader (and other data loaders) have the
> option to bypass HTTP also and stream directly from a local file (or
> other URL)?
>
> > Also, from what I've seen in DirectSolrConnection (version 1.4.1), you
> have
> > to supply the document body as a String.  We want to avoid havindgto load
> > the entire document into memory, which is why we load the files into
> > ContentStream objects and pass them to the embedded Solr server (I am
> > assuming  ContentStream actually streams the file as its name suggests
> > instead of trying to load it into memory).  The utility I wrote gets a
> path,
> > a Regex expression for all the files to be loaded, as well as the
> parameters
> > mentioned above and it does either a full or incremental upload of
> multiple
> > files with a single command.
> >
> > We run a very high load application with SOLR in the back end that
> requires
> > that we use the Embedded solr server to eliminate the network round-trip.
> > Even a small incremental gain in performance is important for us.
>
> Eliminating the network round-trip is certainly important for good
> bulk indexing performance.  Luckily you don't have to
> embed to do that.  You can use multiple threads (say 16 for a 4 core
> server) that essentially covers up
> any round-trip latency (use persistent connections though!  or use
> SolrJ which does by default),
> or you can use the StreamingUpdateSolrServer that eliminates
> round-trip network delays
> by streaming documents over multiple already open connections.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

Mime
View raw message