I guess I've been called to the chalkboard...
I haven't looked specifically at putting the taxonomy in Lucene/Solr,
but it is an interesting idea. In reading the paper you mentioned,
there are some interesting ideas there and Solr could obviously just
as easily be used as Lucene, I think.
One of the things I am interested in is the marriage of Solr and
Mahout (which has some Genetic Algorithms support) and other ML (Weka,
etc.) tools. So, for instance in the paper, they have multiple
indexes, one for negative and positive sets, well that could be done
with Solr cores or just through intelligent filtering. Then, you
could have Mahout work do it's training/clustering/whatever in the
background as needed just by sending a ReqHandler commands and output
it's model that can be shared on the "output" side so that you can
nicely serve up your results as part of search results or even
standalone, so either as a SearchComponent or from the ReqHandler. Of
course, the tricky part is in the implementation and managing the
memory, threading, etc.
Things that can help with all this: LukeReqHandler,
TermVectorComponent, TermsComponent, others
As for Hannes question about "Why Solr" I think you can still get
close to the metal w/ Solr just as Lucene, but now you have the built
in framework that makes experimentation so much easier, IMO, plus you
have all the features that Solr has to offer. For instance, a
reasonable thing to do with the output from the classification is, of
course, to facet on them.
Neal, what did you have in mind for a JIRA issue? I'd love to see a
patch.
On Jan 26, 2009, at 12:29 PM, Neal Richter wrote:
> Hey all,
>
> I'm in the processing of implementing a system to do 'text
> classification' with Solr. The basic idea is to take an
> ontology/taxonomy like dmoz of {label: "X", tags: "a,b,c,d,e"}, index
> it and then classify documents into the taxonomy by pushing parsed
> document into the Solr search API. Why? Lucene/Solr's ability to do
> weighted term boosting at both search and index time has lots of
> obvious uses here.
>
> Has anyone worked on this or a similar project yet? I've seen some
> talk on the list about this area but it's pretty thin... December
> thread "Taxonomy Support on Solr". I'm assuming Grant Ingersoll is
> looking at similar things with his 'taming text' project.
>
> I store the 'documents' in another repository and they are far too
> dynamic (write intensive) for direct indexing in Solr... so the
> previously suggested procedure of 1) store document 2) execute
> more-like-this and 3) delete document would be too slow.
>
> If people are interested I could start a JIRA issue on this (I do not
> see anything there at the moment).
>
> Thanks - Neal Richter
> http://aicoder.blogspot.com
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
|