lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From P Williams <williams.tricia.l...@gmail.com>
Subject Re: Using data-config.xml from DIH in SolrJ
Date Thu, 14 Nov 2013 18:21:10 GMT
Hi,

I just discovered
UpdateProcessorFactory<http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/package-summary.html>
in
a big way.  How did this completely slip by me?

Working on two ideas.
1. I have used the DIH in a local EmbeddedSolrServer previously.  I could
write a ForwardingUpdateProcessorFactory to take that local update and send
it to a HttpSolrServer.
2. I have code which walks the file-system to compose rough documents but
haven't yet written the part that handles the templated fields and
cross-walking of the source(s) to the schema.  I could configure the update
handler on the Solr server side to do this with the RegexReplace
<http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html>and
DefaultValue<http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/DefaultValueUpdateProcessorFactory.html>
 UpdateProcessorFactor(ies).

Any thoughts on the advantages/disadvantages of these approaches?

Thanks,
Tricia



On Thu, Nov 14, 2013 at 7:49 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> There's nothing that I know of that takes a DIH configuration and
> uses it through SolrJ. You can use Tika directly in SolrJ if you
> need to parse structured documents though, see:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> Yep, you're going to be kind of reinventing the wheel a bit I'm
> afraid.
>
> Best,
> Erick
>
>
> On Wed, Nov 13, 2013 at 1:55 PM, P Williams
> <williams.tricia.list@gmail.com>wrote:
>
> > Hi All,
> >
> > I'm building a utility (Java jar) to create SolrInputDocuments and send
> > them to a HttpSolrServer using the SolrJ API.  The intention is to find
> an
> > efficient way to create documents from a large directory of files (where
> > multiple files make one Solr document) and be sent to a remote Solr
> > instance for update and commit.
> >
> > I've already solved the problem using the DataImportHandler (DIH) so I
> have
> > a data-config.xml that describes the templated fields and cross-walking
> of
> > the source(s) to the schema.  The original data won't always be able to
> be
> > co-located with the Solr server which is why I'm looking for another
> > option.
> >
> > I've also already solved the problem using ant and xslt to create a
> > temporary (and unfortunately a potentially large) document which the
> > UpdateHandler will accept.  I couldn't think of a solution that took
> > advantage of the XSLT support in the UpdateHandler because each document
> is
> > created from multiple files.  Our current dated Java based solution
> > significantly outperforms this solution in terms of disk and time.  I've
> > rejected it based on that and gone back to the drawing board.
> >
> > Does anyone have any suggestions on how I might be able to reuse my DIH
> > configuration in the SolrJ context without re-inventing the wheel (or DIH
> > in this case)?  If I'm doing something ridiculous I hope you'll point
> that
> > out too.
> >
> > Thanks,
> > Tricia
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message