lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Burlison <Alan.Burli...@sun.com>
Subject Re: Handling disparate data sources in Solr
Date Mon, 08 Jan 2007 22:46:46 GMT
Walter Underwood wrote:

> Cracking documents and spidering URLs are both big, big problems.
> PDF is a horrid mess, as are old versions of MS Office. Proxies,
> logins, cookies, all sort of issues show up with fetching URLs,
> along with a fun variety of misbehaving servers.
> 
> I remember crashing one server with 25 GET requests before we
> implemented session cookies in our spider. That used all that
> DB connections and killed the server.
> 
> If you need to do a lot of spidering and parse lots of kinds of
> documents, I don't know of an open source solution for that.
> Products like Ultraseek and the Googlebox are about your only
> choice.

I'm not suggesting that Solr be extended to become a spider, I'm just 
suggesting we provide a mechanism for direct access to source documents 
if they are accessible.  For example if the document being indexed was 
on the same machine as Solr, the href would usually start "file://", not 
"http://"

BTW, this discussion is also occurring on solr-dev, it might be better 
to move all of it over there ;-)

-- 
Alan Burlison
--

Mime
View raw message