lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen Densmore <o...@backspaces.net>
Subject PHP-Lucene Integration
Date Tue, 22 Mar 2005 18:21:14 GMT
[Sorry if this is received twice .. I tried earlier but didn't see it 
in the list!]

A while back I asked folks how they deployed Lucene in a PHP 
environment.  This summarizes how we proceeded with doing so.

The response to the initial question was quite helpful. Kelvin Tan 
mentioned "How about XML-RPC/SOAP, or REST?" while pedja did a great 
job of presenting the use of the "PHP-Java-Bridge".  Maurits suggested 
a way to use a proxy approach.  Great example of how useful this list 
is!

The solution we (http://redfish.com) chose was REST ..i.e. build a 
servlet which provides access to the index with a few bells and 
whistles unique to our application.  This servlet then is accessed via 
PHP using the enhanced fopen(url,'r') which allows the filename to be a 
url.  The PHP code then just reads in the result of searches line by 
line and makes them available as dynamic web pages.  The php is used 
two ways: one as a fairly standard text search capability, and more 
creatively, as the feed to a Flash graphical interface which lets you 
"fly" through the collection. The servlet itself emits only plain text, 
no html.  It likely should convert to XML.

The reason we chose this approach is because it fits into a broader 
desire of the client to form a general "institutional repository".  
Each group could have such a servlet exporting their data as a "web 
service" that others can listen in on.  An example of a study this 
would enable is studying co-authorship (collaboration) in relation to 
the "events" group -- folks putting on workshops and conferences.  It 
would be interesting to see whether or not the event attendees do 
eventually increase their collaboration with others due to the event.  
So we would link the events data with the working papers data to see 
whether or not there are increased collaborations.  Loosely coupled, 
tightly aligned.

The collection we're providing access to is a very innovative 
scientific set .. 1200 working papers of the Santa Fe Institute.  
"Similarity" searching has proven very useful.  A user looks at a 
document and can then ask for similar ones.  Another extremely useful 
secondary search is for co-authors: search for all of the documents by 
a given author, collect all their collaborators, and provide that as a 
result.

These secondary searches are done with a general interface which uses 
two searches: a primary search which is then used as input to the 
second batch.  So for co-author searches, we perform a primary search 
for an author.  We collect all their documents, stripping out the 
authors for each document.  This list of authors forms a secondary 
search which in effect returns all the documents with authors who have 
co-authored with the initial search.

This is extremely general and lets us perform a poor man's clustering.  
We find the documents most representative of a set of documents our 
client wants to use as a cluster.  We use the similarity searching 
above, with the primary search being the documents representative of a 
cluster.  The secondary search is much like in the Lucene book's 
example: give the author's of the retrieved documents a boost of 2, and 
then tack on a search of all the relative text terms.

We wanted to provide additional examples of clustering, so we got some 
earlier work done by the institute's library and information technology 
experts, and created a second set of similarity searches.  These worked 
quite well, and the similarity technique helped bridge a two year gap 
caused by dropping the professional classification project.  Indeed, it 
may breathe life back into that project due to our showing how useful 
it was.

For comedy relief we provided a third classification built upon 
astrological signs!  We captured the 12 signs descriptions, and used 
them as our primary search.  Then we used the documents recovered by 
these searches and used there terms to find similar documents.  It was 
great fun and naturally enough helped make the technique understandable 
by the clients.  We can't wait to find out which authors are "aries" 
and so on.

The improvement over the traditional searching used by the institute is 
quite dramatic.  My partner and I find ourselves getting lost for tens 
of minutes tracking down papers we simply didn't know were there.

We are hoping the institute can afford to have us work on true 
clustering techniques such as Carrot2 uses. (Thanks to Dawid and all 
the Poznan University folks who's papers were so stimulating!)  We did 
do a quick LSA SVD on a random set of the papers to see what the 
performance (both CPU and good clustering) would be like.  Our results 
are encouraging, and I think the frequent phrases approach would be 
best for this collection.  This collection is quite a clustering 
challenge due to its extreme cross-discipline nature.

BTW: My partner uncovered an interesting solution which allows us to 
mix the "keyword" and "text" world nicely.  The papers use key-phrases 
which are entirely author derived.  One my use "evolution" and another 
"human evolution".  We liked the looseness of letting them be text.  
But we also need to search for exact phrases as used by the authors.  A 
simple solution was to create a set of relations in the RDB sense 
during the indexing phase.  Then evolution might have a keyphrase index 
of 22, say.  We can then use that for unambiguous keyphrase searching 
when we want.  (Note that phrase quoting does not work for evolution: 
searching for "evolution" will still hit on "human evolution".  The 
relations remove that problem.)

I just want to take this as an opportunity to thank everyone for all 
the help.  Thanks!

Owen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message