Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 51278 invoked from network); 1 Mar 2010 18:29:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Mar 2010 18:29:06 -0000 Received: (qmail 4377 invoked by uid 500); 1 Mar 2010 18:29:04 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 4307 invoked by uid 500); 1 Mar 2010 18:29:04 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 4299 invoked by uid 99); 1 Mar 2010 18:29:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Mar 2010 18:29:04 +0000 X-ASF-Spam-Status: No, hits=-1.8 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [128.149.139.105] (HELO mail.jpl.nasa.gov) (128.149.139.105) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Mar 2010 18:28:52 +0000 Received: from mail.jpl.nasa.gov (altvirehtstap02.jpl.nasa.gov [128.149.137.73]) by smtp.jpl.nasa.gov (Switch-3.4.2/Switch-3.4.1) with ESMTP id o21ISSnQ008826 (using TLSv1/SSLv3 with cipher RC4-MD5 (128 bits) verified FAIL) for ; Mon, 1 Mar 2010 10:28:29 -0800 Received: from ALTPHYEMBEVSP20.RES.AD.JPL ([172.16.0.21]) by ALTVIREHTSTAP02.RES.AD.JPL ([128.149.137.73]) with mapi; Mon, 1 Mar 2010 10:28:28 -0800 From: "Mattmann, Chris A (388J)" To: "general@lucene.apache.org" Date: Mon, 1 Mar 2010 10:28:25 -0800 Subject: Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene? Thread-Topic: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene? Thread-Index: Acq5bKDtBrK86UZyTIuH2XQeZ1qE2AAAFnxy Message-ID: In-Reply-To: <9ac0c6aa1003011025q34a6e0e2p36e862c5df336cc7@mail.gmail.com> Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_C7B15559D32DChrisAMattmannjplnasagov_" MIME-Version: 1.0 X-Source-IP: altvirehtstap02.jpl.nasa.gov [128.149.137.73] X-Source-Sender: chris.a.mattmann@jpl.nasa.gov X-AUTH: Authorized X-Virus-Checked: Checked by ClamAV on apache.org --_000_C7B15559D32DChrisAMattmannjplnasagov_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I'm glad that you brought that up! :) Check out: http://incubator.apache.org/projects/sis.html We're just starting to tackle that very issue right now...patches/ideas/con= tributions welcome. Cheers, Chris On 3/1/10 11:25 AM, "Michael McCandless" wrote: Because the code dup with analyzers is only one of the problems to solve. In fact, it's the easiest of the problems to solve (that's why I proposed it, only, first). A more differentiating example is a much less mature module.... EG take spatial -- if Solr were its own TLP, how could spatial be built out in a way that we don't waste effort, and so that both direct Lucene and Solr users could use it when it's released? Mike On Mon, Mar 1, 2010 at 1:07 PM, Mattmann, Chris A (388J) wrote: > Hi Mike, > > I'm not sure I follow this line of thinking: how would Solr being a TLP a= ffect the creation of a separate project/module for Analyzers any more so t= han it not being a TLP? Both Lucene-java and Solr (as a TLP) could depend o= n the newly created refactored Analysis project. > > Chris > > > > On 3/1/10 10:44 AM, "Michael McCandless" wrot= e: > > If we don't somehow first address the code duplication across the 2 > projects, making Solr a TLP will make things worse. > > I started here with analysis because I think that's the biggest pain > point: it seemed like an obvious first step to fixing the code > duplication and thus the most likely to reach some consensus. And > it's also very timely: Robert is right now making all kinds of great > fixes to our collective analyzers (in between bouts of fuzzy DFA > debugging). > > But it goes beyond analyzers: I'd like to see other modules, now in > Solr, eventually moved to Lucene, because they really are "core" > functionality (eg facets, function (and other?) queries, spatial, > maybe improvements to spellchecker/highlighter). How can we do this? > > And how can we do this so that it "lasts" over time? If new cool > "core" things are born in Solr-land (which of course happens alot -- > lots of good healthy usage), how will they find their way back to > Lucene? > > Yonik's proposal (merging development of Solr/Lucene, but keeping all > else separate) would achieve this. > > If we do the opposite (Solr -> TLP), how could we possibly achieve > this? > > I guess one possibility is to just suck it up and duplicate the code. > Meaning, each project will have to manually merge fixes in from the > other project (so long as there's someone around with the itch to do > so). Lucene would copy in all of Solr's analysis, and vice-versa (and > likewise other dup'd functionality). I really dislike this > solution... it will confuse the daylights out of users, its error > proned, it's a waste of dev effort, there will always be little > differences... but maybe it is in fact the lesser evil? > > I would much prefer merging Solr/Lucene development... > > Mike > > On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J) > wrote: >> Hi Grant, >> >>> On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote: >>> >>>> Hi Robert, >>>> >>>> I think my proposal (Solr->TLP) is sort of orthogonal to the whole ana= lyzers >>>> issue - I was in favor, at the very least, of having a separate >>>> module/project/whatever that both Solr/Lucene (and whatever project) c= an >>>> depend on for the shared analyzer code... >>> >>> Not really. They are intimately linked. >> >> Ummm, how so? Making project A called "Apache Super Analyzers" and then >> making Lucene(-java) and Solr depend on Apache Super Analyzers is separa= te >> of whether or not Lucene(-java) and Solr are TLPs or not... >> >> Cheers, >> Chris >> >> >>> >>> >>>> >>>> Cheers, >>>> Chris >>>> >>>> >>>> >>>> On 3/1/10 9:12 AM, "Robert Muir" wrote: >>>> >>>> this will make the analyzers duplication problem even worse >>>> >>>> On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) < >>>> chris.a.mattmann@jpl.nasa.gov> wrote: >>>> >>>>> Hi Mark, >>>>> >>>>> Thanks for your message. I respect your viewpoint, but I respectfully >>>>> disagree. It just seems (to me at least based on the discussion) like= a TLP >>>>> for Solr is the way to go. >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>> >>>>> >>>>> On 3/1/10 8:54 AM, "Mark Miller" wrote: >>>>> >>>>> On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote: >>>>>> Hi Mark, >>>>>> >>>>>> >>>>>>> That would really be no real world change from how things work toda= y. >>>>> The fact >>>>>>> is, today, Solr already operates essentially as an independent proj= ect. >>>>>>> >>>>>> Well if that's the case, then it would lead me to think that it's mo= re of >>>>> a >>>>>> TLP more than anything else per best practices. >>>>>> >>>>> That depends. It could be argued it should be a top level project or >>>>> that it should be closer to the Lucene project. Some people are argui= ng >>>>> for both approaches right now. There are two directions we could move= in. >>>>>> >>>>>>> The only real difference is that it shares the same PMC with Lucene= now >>>>> and >>>>>>> wouldn't with this change. This would address none of the issues th= at >>>>>>> triggered >>>>>>> the idea for a possible merge. >>>>>>> >>>>>> I don't agree -- you're looking to bring together two communities th= at >>>>> are >>>>>> "fairly separate" as you put it. The separation likely didn't spring= up >>>>> over >>>>>> night and has been this way for a while (as least to my knowledge). = This >>>>> is >>>>>> exactly the type of situation that typically leads to TLP creation f= rom >>>>> what >>>>>> I've seen. >>>>>> >>>>> It also causes negatives between Solr/Lucene that some are looking to >>>>> address. Hence the birth of this proposal. Going TLP with Solr will o= nly >>>>> aggravate those negatives, not help them. >>>>> >>>>> While the communities operate fairly separately at the moment, the >>>>> people in the communities are not so separate. The committer list has >>>>> huge overlap. Many committers on one project but not the other do a l= ot >>>>> of work on both projects. >>>>> >>>>> There is already a strong link with the personal - merging the >>>>> management of the projects addresses many of the concerns that have >>>>> prompted this discussion. TLP'ing Solr only makes those concerns >>>>> multiply. They would diverge further, and incompatible overlap betwee= n >>>>> them would increase. >>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> On 03/01/2010 10:04 AM, Mattmann, Chris A (388J) wrote: >>>>>>> >>>>>>>> Hey Grant, >>>>>>>> >>>>>>>> I'd like to explore this< does this imply that the Lucene >>>>> sub-projects will >>>>>>>> go away and Lucene will turn into Lucene-java and maintain its Apa= che >>>>> TLP, >>>>>>>> and then you'd have say, solr.apache.org, tika.apache.org, >>>>> mahout.apache.org >>>>>>>> (already started), etc. etc.? If so, that may be the best of all >>>>> worlds, >>>>>>>> allowing project independence, but also not following the Apache >>>>>>>> "antipattern" as Doug put it... >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Chris >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 3/1/10 7:28 AM, "Grant Ingersoll" wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Also, as Doug alluded to, the Board is likely to ask us to consid= er >>>>> less >>>>>>>>> subprojects in the future, so we may be consolidating and spinnin= g off >>>>>>>>> anyway. >>>>>>>>> >>>>>>>>> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Chris Mattmann, Ph.D. >>>>>>>> Senior Computer Scientist >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>>>> Email: Chris.Mattmann@jpl.nasa.gov >>>>>>>> Phone: +1 (818) 354-8810 >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> - Mark >>>>>>> >>>>>>> http://www.lucidimagination.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Senior Computer Scientist >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 171-266B, Mailstop: 171-246 >>>>>> Email: Chris.Mattmann@jpl.nasa.gov >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Assistant Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> - Mark >>>>> >>>>> http://www.lucidimagination.com >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Senior Computer Scientist >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 171-266B, Mailstop: 171-246 >>>>> Email: Chris.Mattmann@jpl.nasa.gov >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Assistant Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>> >>>> >>>> -- >>>> Robert Muir >>>> rcmuir@gmail.com >>>> >>>> >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: Chris.Mattmann@jpl.nasa.gov >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: Chris.Mattmann@jpl.nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: Chris.Mattmann@jpl.nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: Chris.Mattmann@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ --_000_C7B15559D32DChrisAMattmannjplnasagov_--