httpd-docs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chris <>
Subject Re: Using Solr to index and search the Apache HTTPD Documents
Date Tue, 09 Oct 2007 04:31:54 GMT
Thank you for the responses fellas.

Solr is very fast, extremely flexible, can be deployed in a highly 
available manner, does not have complex requirements, is easy to 
configure and maintain and you guys own it.  I am surprised that you are 
not already using it for all of the ASF's projects documentation and 
doubt Goodle's CSE would ever do the job as well as Solr could with just 
a little elbow grease behind it.  You can make it whatever you want. It 
is after all open source, unlike the Google option.  Why tie yourself to 
a vendor when you can keep complete control yourself, do a better job 
with very little resources, and never have to worry about the issues 
that come along with building a dependency on something outside of your 

You have requirements outlined below for "critical services", some of 
them would obviously not apply if the function were out sourced but some 
would or at least should.  Does Google provide an SLA for their free 
CSE?  Yes, I know how silly that sounds, but work with me here. It's 
tough to compete with Google, I'm throwing out everything I have. ;)

I'll take a stab at approximations of answers to the questions below.  
If it ever comes to the point you need the information with greater 
precision, I will be glad to help with that.

Justin Erenkrantz wrote:
> On Oct 8, 2007 11:51 AM, Vincent Bray <> wrote:
>> I'm very much in favour of seeing how far we can take Solr as the
>> search mechanism for the httpd docs.
> What are the production requirements for Solr?  IOW, what do we need
> to run on to make this happen?  How much disk space?
> How much RAM?  We do not currently run Java on our main web servers,
> so running and maintaining it would have to be sorted out.  I don't
> know if the Solr guys are even interested in helping us maintain a
> local search engine.  (Previously, the Perl guys tried and gave up.)
Solr does not have to run on the web server hosting the search page.  It 
can live on any server reachable by your web server that meets the 
requirements to run it  The import/transform scripts also do not have to 
run on the same server as Solr, as they also submit documents to be 
indexes as a web request to Solr.

Solr needs Java 1.5 and an application server that supports the Servlet 
2.4 standard.  I used Jetty for my demo.

The import/transform script needs Perl with XML::Xpath and 
XML::Xpath::XMLParser, an XSLT tool such as Xalan or Xsltproc,  curl, or 
the curl perl modules, and subversion to check out a copy of http-docs 
and the build stuff.  There needs to be enough space to check out the 
docs and build files, plus temporary space for the transforms.  ~80meg. 

The current Solr index with only the English version of the httpd 
documents is 1.7meg.  Extrapolate that to account for the number of 
languages supported, 5 or 6? and we will call it 15meg.

The Solr application itself comes in under 12meg including the source 
and Jetty.  I am not sure what Tomcat or other options would require but 
I will find out.

Sum that up and disk space wise we have around 30meg for a full language 
Solr install and an additional 80meg either local or decoupled for the 
documents, build files and temporary space for the transforms. 110 meg's 
not so bad.  We'll say 200meg just to be safe and allow for some terse 

Currently running in Jetty and with a nice full query cache but idle, 
looks like this:
 8947 arreyder  25   0  830m 102m  18m S    0  5.1   0:07.11 java

1 gig of ram should be comfortable, but the more the better for the sake 
of query caching.

I have not loaded it up yet to see how it looks during concurrent 
successive connections but am working on a test script to do just that. 
The test script will also be a great tool for preloading the cache.  I 
will do this and report the results if anyone is still interested.

> The ASF infrastructure team has a checklist of things that must be
> satisfied before adding any new 'critical services' (which this falls
> under).  See below for the current list.
> So, I sort of think that just filling out a special account for a
> 'custom search engine' would be a *lot* less work.  =)  -- justin

You may be confusing work with fun and most of the fun has already been 
had by me getting it this far.  Perhaps this work that you were speaking 
of is a hint that you would be the first to volunteer to help in getting 
my Solr implementation formally going? ;)   I doubt you got to where you 
are by passing on good things because they required a little "work" and 
aren't you guys under some sort of eat-your-own-dog-food directive?  If 
you do not use it, who will?

> ---
> This provides a list of requirements and doctrines for web applications
> that wish to be deployed on the Apache Infrastrcture.  It is intended to
> help address many of the recurring issues we see with deployment and
> maintainence of applications.
> Definition of 'system': Any web application or site which will receive
> traffic from public users in any manner.
> Definition of 'critical systems': Any web application or site which runs
> under, or is expected to receive a significant portion of
> traffic.
> 1) All systems must be generally secure and robust. In cases of failure,
> they should not damage the entire machine.

Since Solr is a service that is typically only called by another service 
it enjoys the security advantages of being at least once removed from 
the end user and never directly accessed by them.  You could certainly 
provide rate limiting and other methods to keep load from ever reaching 
the point where it could impact other co located services.  No real 
security or load management challenges here.

> 2) All systems must provide reliable backups, at least once a day, with
> preference to incremental, real time or <1 hour snapshots.

Solr provides an easy method for off site replication via snapshots.  
This could be utilized for backups.  Also it should be mentioned that on 
my low end core2 duo with 2gig of ram it only takes around 70 seconds to 
transform and index the complete English httpd documents from scratch.  
As long as you have the documents available for check out and the 
scripts to do it, you are never far from a freshly created index.

> 3) All systems must be maintainable by multiple active members of the
> infrastructure team.

I am not a member but I am still happy to help.  Any one else like to 
give me a hand? :)

> 4) All systems must come with a 'runbook' describing what to do in event
> of failures, reboots, etc.  (If someone who has root needs to reboot the
> box, what do they need to pay attention to?)

Again no real challenge here, I'd be happy to throw this together.

> 5) All systems must provide at least minimal monitoring via Nagios.

I'll write a plugin to do this or we can just use the check_http one 
already there.  Depends on how deeply you want the service check to go.

> 6) All systems must be restorable and relocatable to other machines
> without significant pain.

Replication of this configuration and packaging it is trivial.  As I 
said before even if we have to re-index the docs, it just takes 
seconds.  I'll build a package and deployment script.

> 7) All systems must have some kind of critical mass.  In general we do
> not want to host one offs of any system.

"If you build it they will come."  Did I mention I am from Iowa?   We 
have this baseball diamond in a cornfield that you really should come see.

> 8) All system configuration files must be checked into Subversion.

Delighted to check in all 5 configuration files/scripts.

> 9) All system source must either be checked into Subversion, be at a
> well-known public location, or is provided by the base OS.  (Hosting
> binary-only webapps is a non-starter.)

Since Solr is an Apache project I am guessing you have this part already 
under control.

> 10) All systems, prior to consideration of deployment, must provide a
> detailed performance impact analysis (bandwidth and CPU).  How are
> techniques like HTTP caching used?  Lack of HTTP caching was MoinMoin's
> initial PITA.

It does cache queries, and with mod_deflate out front bandwidth should 
be minimal.   It's just text.  I still need to get the details on CPU 
load and see how well it scales on a single machine.  I'm working on it.

> 11) All systems must have clearly articulated, defined, and recorded
> dependencies.

This is a very short list that I have for the most part already 
covered.  Perl with XML::Xpath and XML::Xpath::XMLParser, an XSLT tool 
such as Xalan or Xsltproc,  curl, or the curl perl modules, and 
subversion to check out a copy of http-docs and the build stuff.  For 
Solr itself, Java 1.5 and Application Server that Supports 2.4 servelet 

> 12) All critical systems must be replicated across multiple machines,
> with preference to cross-atlantic replication.

Not a problem.  Solr has a multi server replication method using 
snapshots, rsync and hard links.

> 13) All systems must have single command operations to start, restart
> and stop the system.  Support for init scripts used by the base
> operating system is preferred.

You mean I need to do more than "nohup java -jar solr.jar &"?  Sheesh!  
Seriously, since you are probably not planning on running this in Jetty, 
Tomcat, or whatever it lands on is probably already going to have that 
requirement met.  If not, I'm on it.

I left out a few requirements as they are yet to be determined.  I am 
not sure what kind of web front end you guys might want for both the 
query and the results so I cannot speak to the requirements on that 
end.  The updating of documents in the Solr index is currently a manual 
process.  It could be adjusted to either run at an interval via crontab, 
be actively initiated by the formal document builds, or configured to do 
svn diffs and import when it sees a change in a document it has been 
told to index. 

Lastly, I'm doing this Solr apache documents thing with or without you, 
you may as well take advantage of it. :)

chris rhodes

> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message