lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bjørn Axelsen <bjorn.axel...@fagkommunikation.dk>
Subject Indexing data from multiple sites
Date Mon, 23 Mar 2015 14:13:28 GMT
Hello Solr users!

I need suggestions on the best and most bullet-proof way to index data from
multiple websites.

- different websites,
- running on different CMS systems (Drupal, Plone, Sharepoint, Wordpress)
etc,
- different site owners (somebody else is in control of each of the sites).

Currently we have a setup where:

1) some of the websites push new and updated content directly to Solr and,
2) other websites are crawled by Nutch and the content is pushed from Nutch
to Solr.
(50 websites / 250.000 pages)

This works out pretty well. But only because the first group of sites are
under my own control. So if something goes wrong or I need to upgrade Solr,
I have easy access to login to these sites and to make technical changes
and reindex the full content.

How about if I want to give other sites the same possibility to push
content directly to my Solr index?

This would be nice because:
- some of the websites contain restricted content that is somewhat tricky
to expose to the crawler,
- many CMSes have existing Solr modules that can push content to Solr
out-of-the-box,
- content can be pushed instantly to the Solr index.

But what if something goes wrong in t
his process - and I do not have access to login and make changes to the
CMS, start a re-index etc. In that case the content from a site will be
missing until a CMS coder will have time to help me out.

Can anybody give advice on how to handle this in an easy way? Should I
stick to the model of having a crawler between the websites and Solr? Or
some other kind of proxy service?

Kind regards,
Bjørn Axelsen


 Fagkommunikation   Webbureau som formidler viden
Schillerhuset  ·  Nannasgade 28  ·  2200 København N  ·  +45 60660669  ·
info@fagkommunikation.dk  ·  fagkommunikation.dk

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message