lucene-ruby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: [acts_as_solr] Few question on usage
Date Sun, 22 Apr 2007 01:42:19 GMT

On Apr 20, 2007, at 2:30 PM, solruser wrote:
> For pure Ruby access to Solr without a database, use solr-ruby.  The
> 0.01 gem is available as "gem install solr-ruby", but if you can I'd
> recommend you tinker with the trunk codebase too.
>
>>>>
> Well I say, considering use of solr with rails application. Whats  
> the ideal
> approach?.

"rails application" is a pretty broad category of applications at  
this point.  If we're talking about a database-backed application  
being searchable by Solr, I'd go for the RubyForge acts_as_solr  
first.  However, I suspect that it needs work in terms of  
facilitating access to facets, highlighting, and other types of  
custom query handlers.

If your application is backed by other datastores, like in my cases a  
bunch of MARC records in binary format, or a flat delimited file, a  
ZIP file full of RDF/XML files, or even more interestingly another  
Solr instance that we wanted to repurpose in another Solr-based  
application, then go with solr-ruby.

It's my intention to bridge this gap in the near future somehow, I  
just haven't formulated an exact plan.  acts_as_solr fits nicely and  
very very easily on top of solr-ruby.  I envision acts_as_solr simply  
being part of solr-ruby and it'd only hook in if you have  
ActiveRecord installed, otherwise it'd be transparent, only taking up  
a few 10's of lines of code in an un-required .rb file.

The first step could be to patch the RubyForge acts_as_solr to use  
solr-ruby to kick start collaboration.  As for where my effort fits  
into a calendar, within the next few weeks I'll be delving into it  
deeply and can speak more definitively.


>>>>
> Since there are many flavors floating around which is most sought  
> after and
> supported. And I agree that definitive version will help ROR  
> community to
> accept solr with much larger level of confidence.
>  And since ROR application are addressing
> web2.0 the need for search and collaborate information is much  
> higher. So I
> personally believe addressing this will definately go long way.

That's the plan!   No question about it.  I personally am running on  
all cylinders, and will make progress on these technologies as my  
real-world needs require them, which is increasing all the time.  All  
savvy SolRubyists are invited to jump in!

I've not documented this stuff on the wiki to the standards set by  
the Solr engine itself, but there is some pretty amazing power going  
on with solr-ruby right now.  For example, the data mapping / indexer  
framework makes this easy to import a dataset into Solr using Ruby:

source = DataSource.new

mapping = {
   :id => :isbn,
   :name => :author,
   :source => "BOOKS",
   :year => Proc.new {|record| record.date[0,4] },
}

Solr::Indexer.index(source, mapper) do |orig_data, solr_document|
   solr_document[:timestamp] = Time.now
end

This showcases the simplistic data source facility (*quack* -  
anything that has a #each method) [with a contrived DataSource bogus  
class], and the mapping capabilities.  The mapping is a hash of Solr  
field names to value mapping.  A value mapping can be a String  
("BOOKS"), a Symbol (:isbn, :author) which looks up that field from  
(uh, #)each of the objects yielded to the each block.  This lookup  
simply means again *quack* that the data object needs a [] method  
defined.  The Proc example is a bit more advanced Ruby voodoo for  
embedded a bit of code into the mapping to be executed later with  
actual record passed into it, and in the example it strips off the  
first four characters of the records date property.  And one more bit  
of Ruby coolness is the do ... end block for the indexer method.  The  
indexer takes a data source and a mapper melding them together as  
described, and allowing you one final chance to affect the  
solr_document before it gets indexed, of course also provided the  
original data object.

We now already have a simple mapper, an XPath mapper, and an Hpricot  
mapper available.  We also have some handy data sources including a  
tab-delimited file source (obsoleted in my play book by the CSV  
importer now built in).  I'm also using a simple custom MARC binary  
data source and mapper specific to ruby-marc objects, and I just put  
together a SolrSource that takes a query (and filters) for one Solr  
instance in a configurable paging way, that feeds documents returned  
from that query successively out.  Apply a mapper to that data source  
and you can pipe data from one Solr to another like this:

solr_source = Solr::Importer::SolrSource.new("http://localhost:8420/ 
solr", "*:*", ["year:[1776 TO 1918]", 'author:smith'])
count = 0
Solr::Indexer.index(source_solr, mapper, {:debug => false, :timeout  
=> 120, :solr_url => "http://localhost:8983/solr"}) do |orig_data,  
solr_document|
   count = count + 1
   if count % 100 == 0
     puts "#{count}"
   end
end

The count junk is just to see console progress on how many records  
have been indexed.

So I'm working the Ruby/Solr thing as much as possible right now.   
There is something to what we've got there, but its not packaged as  
nicely as needed for a community to flourish, and for that I  
apologize.  But there is also enough goodness there now to lure folks  
in to want to get involved.

Right now in RoR with the Flare plugin installed, you can have a  
controller that looks like this:

    class SearchController < ApplicationController
       flare
     end

And with some copy/pasting of templates (that we can build in as  
defaults somehow I'm sure) you have a faceted browsing Ajax tricked  
out (well, inplace editor and Ajax suggest) experience with how many  
lines of code?   (the devil is in the details though, and that is why  
I don't yet recommend flare to folks that just want it to just work  
and also be configurable)  Flare cuts a lot of corners by hard-coding  
some thing that need to be made configurable, etc.  Typical  
prototyping approach, tinker, tinker, tinker, distill.  I'm still in  
the first tinker phase with Flare right now.  But folks interested in  
rolling up their sleeves and don't mind getting a little grubby with  
code are more than invited to delve into Flare now, with the  
forewarning that the flare you see today will not be at all near the  
Flare that spawns from the ashes.  Pioneering spirit required.

>> : 3. performance benchmark for acts_as_solr plugin available if any
>
> What kind of numbers are you after?  acts_as_solr searches Solr, and
> then will fetch the records from the database to bring back model
> objects, so you have to account for the database access in the
> picture as well as Solr.
>
>>>>
> Well to be specific I am keen to know about creation and update of  
> indexes
> when you run into large number of documents. Since database is used to
> populate the models and definately it will be the commulative  
> effect of
> retrieval of document from solr with lucene, network issues (since  
> its a web
> service) and locally on database (depends on configuration).

Again we need to be clear about "large".  I've got near 4M indexes  
under my belt now, but many others have gone to 10M+.  Lucene and  
Solr both scale very well in the 10's of millions and even further up  
into the hundreds of millions I've heard.

Certainly those other latencies you mention are valid questions, but  
in my experience they've not been show-stopping concerns performance  
with Solr + Ruby has been more than acceptable... it's been just  
fine, even with several spots for improvement in all those areas in  
my applications.  First rule of optimization: Don't.  Second rule of  
optimization: Don't optimize yet.

	Erik



Mime
View raw message