lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "patrick o'leary" <pj...@pjaol.com>
Subject Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
Date Tue, 02 Mar 2010 08:26:30 GMT
Here's my view on it..

Developing GIS support for lucene took a little bit of time and patience and
a couple of iterations from a basic concept to get buy to spend more time
working on it, to an OMG this does what we need, build more more more...

The lucene version of this was easy enough to support, however Solr support
was a different kettle of fish.
>From really crude duplication or query handlers and write templates to
inject distance features to today where it's a little more componentised,
but still a little crude unless you want to cut back on scalability and
functionality by using function queries.

I guess my point is that Solr has always required more effort, and solutions
that constantly drove me further away from the initial lucene
implementation.

In my mind if I make something work in lucene, it should be easy to just
'plug-in' to Solr, but that is definitely not the case, leaf index readers,
NumericalUtils, Trie all came at major development costs that were not
present in lucene development.

The spatial efforts going on in Solr, who knows if they will make it back to
lucene, but at the same time has this gap between both systems grown to the
point that porting is not a worthwhile effort?

I honestly don't want to maintain both systems, but find that to allow for
solr support I have to do a lot more "hacking"




On Mon, Mar 1, 2010 at 10:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> This looks great!
>
> But, the goal is to make a standalone toolkit exposing GIS functions,
> right?
>
> My original question (integrating this into Lucene/Solr) remains.
>
> EG there's alot of good working happening now in Solr to make spatial
> search available.  How will that find its way back to Lucene?  Lucene
> has its own (now duplicate) spatial package that was already
> developed.  Users will now be confused about the two, each have
> different bugs/features, etc.
>
> If we had shared development then the ongoing effort would result in a
> spatial package that direct Lucene users and Solr users would be able
> to use.
>
> Mike
>
> On Mon, Mar 1, 2010 at 1:28 PM, Mattmann, Chris A (388J)
> <chris.a.mattmann@jpl.nasa.gov> wrote:
> > I'm glad that you brought that up! :)
> >
> > Check out:
> >
> > http://incubator.apache.org/projects/sis.html
> >
> > We're just starting to tackle that very issue right
> now...patches/ideas/contributions welcome.
> >
> > Cheers,
> > Chris
> >
> >
> >
> > On 3/1/10 11:25 AM, "Michael McCandless" <lucene@mikemccandless.com>
> wrote:
> >
> > Because the code dup with analyzers is only one of the problems to
> > solve.  In fact, it's the easiest of the problems to solve (that's why
> > I proposed it, only, first).
> >
> > A more differentiating example is a much less mature module....
> >
> > EG take spatial -- if Solr were its own TLP, how could spatial be
> > built out in a way that we don't waste effort, and so that both direct
> > Lucene and Solr users could use it when it's released?
> >
> > Mike
> >
> > On Mon, Mar 1, 2010 at 1:07 PM, Mattmann, Chris A (388J)
> > <chris.a.mattmann@jpl.nasa.gov> wrote:
> >> Hi Mike,
> >>
> >> I'm not sure I follow this line of thinking: how would Solr being a TLP
> affect the creation of a separate project/module for Analyzers any more so
> than it not being a TLP? Both Lucene-java and Solr (as a TLP) could depend
> on the newly created refactored Analysis project.
> >>
> >> Chris
> >>
> >>
> >>
> >> On 3/1/10 10:44 AM, "Michael McCandless" <lucene@mikemccandless.com>
> wrote:
> >>
> >> If we don't somehow first address the code duplication across the 2
> >> projects, making Solr a TLP will make things worse.
> >>
> >> I started here with analysis because I think that's the biggest pain
> >> point: it seemed like an obvious first step to fixing the code
> >> duplication and thus the most likely to reach some consensus.  And
> >> it's also very timely: Robert is right now making all kinds of great
> >> fixes to our collective analyzers (in between bouts of fuzzy DFA
> >> debugging).
> >>
> >> But it goes beyond analyzers: I'd like to see other modules, now in
> >> Solr, eventually moved to Lucene, because they really are "core"
> >> functionality (eg facets, function (and other?) queries, spatial,
> >> maybe improvements to spellchecker/highlighter).  How can we do this?
> >>
> >> And how can we do this so that it "lasts" over time?  If new cool
> >> "core" things are born in Solr-land (which of course happens alot --
> >> lots of good healthy usage), how will they find their way back to
> >> Lucene?
> >>
> >> Yonik's proposal (merging development of Solr/Lucene, but keeping all
> >> else separate) would achieve this.
> >>
> >> If we do the opposite (Solr -> TLP), how could we possibly achieve
> >> this?
> >>
> >> I guess one possibility is to just suck it up and duplicate the code.
> >> Meaning, each project will have to manually merge fixes in from the
> >> other project (so long as there's someone around with the itch to do
> >> so).  Lucene would copy in all of Solr's analysis, and vice-versa (and
> >> likewise other dup'd functionality).  I really dislike this
> >> solution... it will confuse the daylights out of users, its error
> >> proned, it's a waste of dev effort, there will always be little
> >> differences... but maybe it is in fact the lesser evil?
> >>
> >> I would much prefer merging Solr/Lucene development...
> >>
> >> Mike
> >>
> >> On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J)
> >> <chris.a.mattmann@jpl.nasa.gov> wrote:
> >>> Hi Grant,
> >>>
> >>>> On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote:
> >>>>
> >>>>> Hi Robert,
> >>>>>
> >>>>> I think my proposal (Solr->TLP) is sort of orthogonal to the
whole
> analyzers
> >>>>> issue - I was in favor, at the very least, of having a separate
> >>>>> module/project/whatever that both Solr/Lucene (and whatever project)
> can
> >>>>> depend on for the shared analyzer code...
> >>>>
> >>>> Not really.  They are intimately linked.
> >>>
> >>> Ummm, how so? Making project A called "Apache Super Analyzers" and then
> >>> making Lucene(-java) and Solr depend on Apache Super Analyzers is
> separate
> >>> of whether or not Lucene(-java) and Solr are TLPs or not...
> >>>
> >>> Cheers,
> >>> Chris
> >>>
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Chris
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 3/1/10 9:12 AM, "Robert Muir" <rcmuir@gmail.com> wrote:
> >>>>>
> >>>>> this will make the analyzers duplication problem even worse
> >>>>>
> >>>>> On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) <
> >>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>>>>
> >>>>>> Hi Mark,
> >>>>>>
> >>>>>> Thanks for your message. I respect your viewpoint, but I
> respectfully
> >>>>>> disagree. It just seems (to me at least based on the discussion)
> like a TLP
> >>>>>> for Solr is the way to go.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Chris
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 3/1/10 8:54 AM, "Mark Miller" <markrmiller@gmail.com>
wrote:
> >>>>>>
> >>>>>> On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote:
> >>>>>>> Hi Mark,
> >>>>>>>
> >>>>>>>
> >>>>>>>> That would really be no real world change from how things
work
> today.
> >>>>>> The fact
> >>>>>>>> is, today, Solr already operates essentially as an independent
> project.
> >>>>>>>>
> >>>>>>> Well if that's the case, then it would lead me to think
that it's
> more of
> >>>>>> a
> >>>>>>> TLP more than anything else per best practices.
> >>>>>>>
> >>>>>> That depends. It could be argued it should be a top level project
or
> >>>>>> that it should be closer to the Lucene project. Some people
are
> arguing
> >>>>>> for both approaches right now. There are two directions we could
> move in.
> >>>>>>>
> >>>>>>>> The only real difference is that it shares the same
PMC with
> Lucene now
> >>>>>> and
> >>>>>>>> wouldn't with this change. This would address none of
the issues
> that
> >>>>>>>> triggered
> >>>>>>>> the idea for a possible merge.
> >>>>>>>>
> >>>>>>> I don't agree -- you're looking to bring together two communities
> that
> >>>>>> are
> >>>>>>> "fairly separate" as you put it. The separation likely didn't
> spring up
> >>>>>> over
> >>>>>>> night and has been this way for a while (as least to my
knowledge).
> This
> >>>>>> is
> >>>>>>> exactly the type of situation that typically leads to TLP
creation
> from
> >>>>>> what
> >>>>>>> I've seen.
> >>>>>>>
> >>>>>> It also causes negatives between Solr/Lucene that some are looking
> to
> >>>>>> address. Hence the birth of this proposal. Going TLP with Solr
will
> only
> >>>>>> aggravate those negatives, not help them.
> >>>>>>
> >>>>>> While the communities operate fairly separately at the moment,
the
> >>>>>> people in the communities are not so separate. The committer
list
> has
> >>>>>> huge overlap. Many committers on one project but not the other
do a
> lot
> >>>>>> of work on both projects.
> >>>>>>
> >>>>>> There is already a strong link with the personal - merging the
> >>>>>> management of the projects addresses many of the concerns that
have
> >>>>>> prompted this discussion. TLP'ing Solr only makes those concerns
> >>>>>> multiply. They would diverge further, and incompatible overlap
> between
> >>>>>> them would increase.
> >>>>>>
> >>>>>>> Cheers,
> >>>>>>> Chris
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 03/01/2010 10:04 AM, Mattmann, Chris A (388J) wrote:
> >>>>>>>>
> >>>>>>>>> Hey Grant,
> >>>>>>>>>
> >>>>>>>>> I'd like to explore this<   does this imply that
the Lucene
> >>>>>> sub-projects will
> >>>>>>>>> go away and Lucene will turn into Lucene-java and
maintain its
> Apache
> >>>>>> TLP,
> >>>>>>>>> and then you'd have say, solr.apache.org, tika.apache.org,
> >>>>>> mahout.apache.org
> >>>>>>>>> (already started), etc. etc.? If so, that may be
the best of all
> >>>>>> worlds,
> >>>>>>>>> allowing project independence, but also not following
the Apache
> >>>>>>>>> "antipattern" as Doug put it...
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Chris
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 3/1/10 7:28 AM, "Grant Ingersoll"<gsingers@apache.org>
> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Also, as Doug alluded to, the Board is likely
to ask us to
> consider
> >>>>>> less
> >>>>>>>>>> subprojects in the future, so we may be consolidating
and
> spinning off
> >>>>>>>>>> anyway.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>> Chris Mattmann, Ph.D.
> >>>>>>>>> Senior Computer Scientist
> >>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109
USA
> >>>>>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>>>>> Email: Chris.Mattmann@jpl.nasa.gov
> >>>>>>>>> Phone: +1 (818) 354-8810
> >>>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>>>>>> University of Southern California, Los Angeles,
CA 90089 USA
> >>>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> http://www.lucidimagination.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>> Chris Mattmann, Ph.D.
> >>>>>>> Senior Computer Scientist
> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>>> Email: Chris.Mattmann@jpl.nasa.gov
> >>>>>>> WWW:   http://sunset.usc.edu/~mattmann/<
> http://sunset.usc.edu/%7Emattmann/>
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>>>> University of Southern California, Los Angeles, CA 90089
USA
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> - Mark
> >>>>>>
> >>>>>> http://www.lucidimagination.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> Chris Mattmann, Ph.D.
> >>>>>> Senior Computer Scientist
> >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>> Email: Chris.Mattmann@jpl.nasa.gov
> >>>>>> WWW:   http://sunset.usc.edu/~mattmann/<
> http://sunset.usc.edu/%7Emattmann/>
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Robert Muir
> >>>>> rcmuir@gmail.com
> >>>>>
> >>>>>
> >>>>>
> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>> Chris Mattmann, Ph.D.
> >>>>> Senior Computer Scientist
> >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>> Office: 171-266B, Mailstop: 171-246
> >>>>> Email: Chris.Mattmann@jpl.nasa.gov
> >>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Senior Computer Scientist
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 171-266B, Mailstop: 171-246
> >>> Email: Chris.Mattmann@jpl.nasa.gov
> >>> WWW:   http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Assistant Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: Chris.Mattmann@jpl.nasa.gov
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: Chris.Mattmann@jpl.nasa.gov
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message