lucene-pylucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Caleb Burns <ca...@ridersdiscount.com>
Subject Re: PyLucene use JCC shared object by default
Date Wed, 18 Apr 2012 20:16:48 GMT
Hi Thomas,

Our primary motivation was performance and secondary was a "pythonic" api.
Our needs were simpler than the complexity of the whole lucene.facet
package. On the Lucene side of things, it looks like we have something
similar to CategoryPath (statically 2 deep: "/Field/Value") and
FacetRequest (only allow searching at root level, optionally only on
filtered docs set and fields). Specifically, we implemented an index/cache
of all documents and their terms. As far as I know SOLR uses caching of the
Lucene index to perform faceting.

Our implementation is based on
http://lucene.apache.org/solr/api/org/apache/solr/request/UnInvertedField.html
and
the interface in Python is almost identical. You pass our object an
IndexReader and by default all Terms with TermVectors are indexed. You can
then selectively retrieve fields. Here's an example of use:
http://pastebin.com/Lq3LZKMp. The whole module is ~2000 lines (python
interface, c++ implementation, comments). With initial tests, the algorithm
is about 100 faster in C++ than when implemented in Python.

On Wed, Apr 18, 2012 at 9:31 AM, Thomas Koch <koch@orbiteam.de> wrote:

> Hi,
> sounds like an interesting project – may I ask what you actually
> implemented and what’s the motivation (e.g. performance?)?
>
> I’ve started to experiment with the Facet support in Lucene (actually in
> PyLucene – ported an example to Python) and found that facetted search
> support in Lucene looks powerful (though API is still said to be
> ‘experimental’ and I can’t say anything about performance yet).  I’m
> talking about the org.apache.lucene.facet.* packages – part of the contrib
> part of Lucene and available as JARs that’s accessible in PyLucene as well.
> I’m not that familiar with Solr but AFAIK it’s based on Lucene (Java) and
> should (hopefully) use the same Java code for its facet search support. Of
> course Solr adds some nice configuration support and web GUI to Lucene, but
> the ‘core’ search is built on Lucene (to my knowledge). So did you
> re-implement the Lucene facet search/index code (like
> TaxonomyReader/Writer, FacetRequest stuff etc.) in C++ or what part of
> Solr??
>
> Regarding Facet support in PyLucene I can share the samples I’ve ‘ported’
> to Python so far. There’s still a patch pending for JavaList (required by
> facet features) which I come back to later on this list (still some open
> issues). Hopefully this can be included in the PyLucene 3.6 version …
>
> Regards
> Thomas
> --
> OrbiTeam Software GmbH & Co. KG
> Germany  http://www.orbiteam.de
>
>
> Von: Caleb Burns [mailto:caleb@ridersdiscount.com]
> Gesendet: Dienstag, 17. April 2012 21:16
> An: pylucene-dev@lucene.apache.org
> Betreff: PyLucene use JCC shared object by default
>
> Hi,
>
> I've finished the process at my organization of re-implementing SOLR's
> faceting algorithm (in C++).
>
> We would like the public at large to have access to the work we've done
> and plan to do. In order for this to be a real possibility the code needs
> to be built against and use the same JVM as the PyLucene installation does.
> The most logical way we feel to have this accomplished is by having
> PyLucenes' default installation use JCC as a Shared Object.
>
> We have yet more plans to extend and provide utilities that work with
> PyLucene, but this all hinges on having the shared object. The only
> alternative methodology would require the bundling of our source with the
> PyLucene project itself as a fork.
>
> We are eager to start open sourcing our work, so please let us know what
> would be the best way to integrate our work.
>



-- 
Caleb Burns
Developer | Riders Discount

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message