Return-Path: Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 32174 invoked from network); 23 Sep 2003 13:48:32 -0000 Received: from unknown (HELO germany.prod.thop-ny.triplehop.com) (216.74.150.80) by daedalus.apache.org with SMTP; 23 Sep 2003 13:48:32 -0000 Received: from hui2000 ([208.246.29.6]) by germany.prod.thop-ny.triplehop.com with Microsoft SMTPSVC(5.0.2195.6713); Tue, 23 Sep 2003 09:53:41 -0400 Message-ID: <010a01c381d9$71119590$9e0010ac@thopny.triplehop.com> From: "hui" To: "Lucene Users List" References: <5A4393E2-EB42-11D7-83C0-000393A564E6@ehatchersolutions.com> <02b801c3810f$07d46b40$9e0010ac@thopny.triplehop.com> <00f301c381b6$82d0f870$680010ac@teck> Subject: Re: Proposition :adding minMergeDoc to IndexWriter Date: Tue, 23 Sep 2003 09:48:59 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1158 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-OriginalArrivalTime: 23 Sep 2003 13:53:41.0046 (UTC) FILETIME=[1880E560:01C381DA] X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N It is a great. Julien. Thanks. Next time I am going to post the requests to the developer groups. Regards, Hui ----- Original Message ----- From: "Julien Nioche" To: "Lucene Users List" Sent: Tuesday, September 23, 2003 5:38 AM Subject: Proposition :adding minMergeDoc to IndexWriter > Hui, > > Concerning an other point of your request list I proposed a patch this week > end on the lucene-dev list and i totally forgot that this feature was > requested on the user list. > > This new feature should help you to set a number of Documents to be merged > in memory independently of the mergeFactor. > > Any comments would be appreciated > > Best regards > > Julien Nioche > http://www.lingway.com > > ---------- Debut du message initial ----------- > > De : "fp235-5" > A : "lucene-dev" > Copies : > Date : Sat, 20 Sep 2003 16:06:06 +0200 > Sujet : [PATCH] IndexWriter : controling the number of Docs merged > > Hello, > > Someone made a suggestion yesterday about adding a variable to IndexWriter > in > order to control the number of Documents merged in RAMDirectory > independently of > the mergeFactor. (I'm sorry I don't remember who exactly and the mail > arrived at > my office). > I'm proposing a tiny modification of IndexWriter to add this functionality. > A > variable minMergeDocs specifies the number of Documents to be merged in > memory > before starting a new Segment. The mergeFactor still control the number of > Segments created in the Directory and thus it's possible to avoid the file > number limitation problem. > > The diff file is attached. > > As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to > write > a JUnit test for this feature. The problem is that the SegmentInfos field is > private in IndexWriter and can't be used to check the number and size of the > Segments. I ran a test using the infoStream variable of IndexWriter - > everything > seems to be OK. > > Any comments / suggestions are welcome. > > Regards > > Julien > > > > > > > > > > ----- Original Message ----- > From: "hui" > To: "Lucene Users List" > Sent: Monday, September 22, 2003 3:40 PM > Subject: Re: per-field Analyzer (was Re: some requests) > > > > Good work, Erik. > > > > Hui > > > > ----- Original Message ----- > > From: "Erik Hatcher" > > To: "Lucene Users List" > > Sent: Saturday, September 20, 2003 4:13 AM > > Subject: per-field Analyzer (was Re: some requests) > > > > > > > On Friday, September 19, 2003, at 07:45 PM, Erik Hatcher wrote: > > > > On Friday, September 19, 2003, at 11:15 AM, hui wrote: > > > >> 1. Move the Analyzer down to field level from document level so some > > > >> fields > > > >> could be applied a specail analyzer.Other fields still use the > default > > > >> analyzer from the document level. > > > >> For example, I do not need to index the number for the "content" > > > >> field. It > > > >> helps me reduce the index size a lot when I have some excel files. > > > >> But I > > > >> always need the "created_date" to be indexed though it is a number > > > >> field. > > > >> > > > >> I know there are some workarounds put in the group, but I think it > > > >> should be > > > >> a good feature to have. > > > > > > > > The "workaround" is to write a custom analyzer and and have it do the > > > > desired thing per-field. > > > > > > > > Hmmm.... just thinking out loud here without knowing if this is > > > > possible, but could a generic "wrapper" Analyzer be written that > > > > allows other analyzers to be used under the covers based on a field > > > > name/analyzer mapping? If so, that would be quite cool and save > > > > folks from having to write custom analyzers as much to handle this > > > > pretty typical use-case. I'll look into this more in the very near > > > > future personally, but feel free to have a look at this yourself and > > > > see what you can come up with. > > > > > > What about something like this? > > > > > > public class PerFieldWrapperAnalyzer extends Analyzer { > > > private Analyzer defaultAnalyzer; > > > private Map analyzerMap = new HashMap(); > > > > > > > > > public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) { > > > this.defaultAnalyzer = defaultAnalyzer; > > > } > > > > > > public void addAnalyzer(String fieldName, Analyzer analyzer) { > > > analyzerMap.put(fieldName, analyzer); > > > } > > > > > > public TokenStream tokenStream(String fieldName, Reader reader) { > > > Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName); > > > if (analyzer == null) { > > > analyzer = defaultAnalyzer; > > > } > > > > > > return analyzer.tokenStream(fieldName, reader); > > > } > > > } > > > > > > This would allow you to construct a single analyzer out of others, on a > > > per-field basis, including a default one for any fields that do not > > > have a special one. Whether the constructor should take the map or the > > > addAnalyzer method is implemented is debatable, but I prefer the > > > addAnalyzer way. Maybe addAnalyzer could return 'this' so you could > > > chain: new PerFieldWrapperAnalyzer(new > > > StandardAnalyzer).addAnalyzer("field1", new > > > WhitespaceAnalyzer()).addAnalyzer(.....). And I'm more inclined to > > > call this thing PerFieldAnalyzerWrapper instead. Any naming > > > suggestions? > > > > > > This simple little class would seem to be the answer to a very common > > > question asked. > > > > > > Thoughts? Should this be made part of the core? > > > > > > Erik > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > ---------------------------------------------------------------------------- ---- > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org