lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: How to "chain" import handlers: import from DB and from file system
Date Mon, 10 Jul 2017 20:06:51 GMT
I did this at Netflix with Solr 1.3, read stuff out of various databases and sent it all to
Solr. I’m not sure DIH even existed then.

At Chegg, we have slightly more elaborate system because we have so many collections and data
sources. Each content owner writes an “extractor” that makes a JSONL feed with the documents
to index. We validate those, then have a common “loader” that reads the JSONL and sends
it to Solr with multiple connections. Solr-specific stuff is done in update request processors.

Document parsing is always in a separate process. I’ve implemented it that way three times
with three different parser packages on two engines. Never on Solr, though.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 10, 2017, at 12:40 PM, Allison, Timothy B. <tallison@mitre.org> wrote:
> 
>> 4. Write an external program that fetches the file, fetches the metadata, combines
them, and send them to Solr.
> 
> I've done this with some custom crawls. Thanks to Erick Erickson, this is a snap:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> With the caveat that Tika should really be in a separate vm in production [1].
> 
> [1] http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf

> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message