From java-user-return-54205-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Tue Nov 20 09:15:26 2012 Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 931B6E245 for ; Tue, 20 Nov 2012 09:15:26 +0000 (UTC) Received: (qmail 23922 invoked by uid 500); 20 Nov 2012 09:15:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 23863 invoked by uid 500); 20 Nov 2012 09:15:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 23840 invoked by uid 99); 20 Nov 2012 09:15:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Nov 2012 09:15:23 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [193.196.8.10] (HELO linux3.ids-mannheim.de) (193.196.8.10) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Nov 2012 09:15:16 +0000 Received: from linux2.ids-mannheim.de ([10.0.1.1]) by linux3.ids-mannheim.de with smtp (Exim 4.72) (envelope-from ) id 1Tajuw-0008Lj-HQ for java-user@lucene.apache.org; Tue, 20 Nov 2012 10:14:56 +0100 Received: (qmail 8916 invoked from network); 20 Nov 2012 09:14:54 -0000 Received: from unknown (HELO ?10.99.1.49?) (10.99.1.49) by linux2.ids-mannheim.de with SMTP; 20 Nov 2012 09:14:54 -0000 Message-ID: <50AB4A0E.80204@ids-mannheim.de> Date: Tue, 20 Nov 2012 10:14:54 +0100 From: Carsten Schnober Organization: Institut =?ISO-8859-15?Q?f=FCr_Deutsche_Sprache?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: java-user@lucene.apache.org References: <50AA61F6.20504@ids-mannheim.de> In-Reply-To: <50AA61F6.20504@ids-mannheim.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit X-SA-Do-Not-Run: Yes X-SA-Exim-Connect-IP: 10.0.1.1 X-SA-Exim-Rcpt-To: java-user@lucene.apache.org X-SA-Exim-Mail-From: schnober@ids-mannheim.de X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on linux3.ids-mannheim.de X-Spam-Level: Subject: Re: TokenStreamComponents in Lucene 4.0 X-SA-Exim-Version: 4.2.1 (built Mon, 03 Jul 2006 09:34:15 +0200) X-SA-Exim-Scanned: Yes (on linux3.ids-mannheim.de) X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-1.1 required=3.0 tests=BAYES_00,RDNS_NONE, TO_NO_BRKTS_NORDNS autolearn=no version=3.3.2 Am 19.11.2012 17:44, schrieb Carsten Schnober: Hi, > However, after switching to Lucene 4 and TokenStreamComponents, I'm > getting a strange behaviour: only the first document in the collection > is tokenized properly. The others do appear in the index, but > un-tokenized, although I have tried not to change anything in the logic. > The Analyzer now has this createComponents() method calling the custom > TokenStreamComponents class with my custom Tokenizer: After some debugging, it turns out that the Analyer method createComponents() is called only once, for the first document. This seems to be the problem, the other documents are just not analyzed. Here's the loop that creates the fields and supposedly calls the analyzer. Does anyone have a hint why this does only happend for the first document; the loop itself runs once for every document though: --------------------------------------------------------------- List documents; Version lucene_version = Version.LUCENE_40; Analyzer analyzer = new KoraAnalyzer(); IndexWriterConfig config = new IndexWriterConfig(lucene_version, analyzer); IndexWriter writer = new IndexWriter(dir, config); [...] for (de.ids_mannheim.korap.main.Document doc : documents) { luceneDocument = new Document(); /* Store document name/ID */ Field idField = new StringField(titleFieldName, doc.getDocid(), Field.Store.YES); /* Store tokens */ String layerFile = layer.getFile(); Field textFieldAnalyzed = new TextField(textFieldName, layerFile, Field.Store.YES); luceneDocument.add(textFieldAnalyzed); luceneDocument.add(idField); try { writer.addDocument(luceneDocument); } catch (IOException e) { jlog.error("Error adding document "+doc.getDocid()+":\n"+e.getLocalizedMessage()); } } [...] writer.close(); ------------------------------------------------------------------- The class de.ids_mannheim.korap.main.Document defines our own document objects from which the relevant information can be read as shown in the loop. The list 'documents' is filled in in intermediately called method. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org