Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 78787 invoked from network); 25 Oct 2005 18:59:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 25 Oct 2005 18:59:41 -0000 Received: (qmail 68068 invoked by uid 500); 25 Oct 2005 18:59:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 68054 invoked by uid 500); 25 Oct 2005 18:59:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 68043 invoked by uid 99); 25 Oct 2005 18:59:25 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2005 11:59:25 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [128.230.18.5] (HELO mailbox.syr.edu) (128.230.18.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2005 11:59:22 -0700 Received: from [128.230.84.109] (istcnlpd9hb3r31.syr.edu [128.230.84.109]) by mailbox.syr.edu (8.12.10/8.12.10) with ESMTP id j9PIwxq0001030 for ; Tue, 25 Oct 2005 14:59:00 -0400 (EDT) Message-ID: <435E8073.8000701@syr.edu> Date: Tue, 25 Oct 2005 14:58:59 -0400 From: Grant Ingersoll User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Lucene and SAX References: <20051025161350.45387.qmail@web26006.mail.ukl.yahoo.com> <00e001c5d988$7d0f6df0$0301a8c0@MALCOLM> In-Reply-To: <00e001c5d988$7d0f6df0$0301a8c0@MALCOLM> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I am not familiar with the INEX collection, could you post a sample? Malcolm Clark wrote: > Hi again, > I am desperately asking for aid!! > > I have used the sandbox demo to parse the INEX collection.The problem > being it points to a volume file which references 50 other xml > articles.Lucene only treats this as one document.Is there any method > of which I'm overlooking that halts after each reference? > Could somebody please help and I wont post again until I submit > something useful. > > The code is: > public class XMLDocumentHandlerSAX > extends HandlerBase > { > /** A buffer for each XML element */ > private StringBuffer elementBuffer = new StringBuffer(); > > private Document mDocument; > > // constructor > public XMLDocumentHandlerSAX(File xmlFile) > throws ParserConfigurationException, SAXException, IOException > { > SAXParserFactory spf = SAXParserFactory.newInstance(); > > SAXParser parser = spf.newSAXParser(); > parser.parse(xmlFile, this); > } > > // call at document start > public void startDocument() > { > mDocument = new Document(); > //mDocument = new Document(); > elementBuffer.setLength(0); > } > > // call at element start > public void startElement(String localName, AttributeList atts) > throws SAXException > { > > if (localName.equals("article")) { > elementBuffer.setLength(0); > } > > } > // call when cdata found > public void characters(char[] text, int start, int length) > { > > elementBuffer.append(text, start, length); > > } > > // call at element end > public void endElement(String localName) > throws SAXException > { > > if (localName.equals("article")) { > System.out.println("Article: "+elementBuffer.length()); > elementBuffer.setLength(0); > } > > mDocument.add(Field.Text(localName,elementBuffer.toString())); > System.out.println("EB: "+elementBuffer); > elementBuffer.setLength(0); > > } > > > public Document getDocument() > { > > return mDocument; > } > > public static void main(String[] args) > throws Exception > { > try > { > Date start = new Date(); > String indexDir = "C:\\LuceneDemo\\index"; > IndexWriter writer = new IndexWriter(indexDir, new > StandardAnalyzer(), true); > indexDocs(writer, new File("C:\\1995\\volume.xml")); > > > writer.optimize(); > writer.close(); > > Date end = new Date(); > > } > catch (Exception e) > { > System.out.println(" caught a " + e.getClass() + "\n with message: > " + e.getMessage()); > throw e; > } > } > > public static void indexDocs(IndexWriter writer, File file) > throws Exception > { > > if (file.isDirectory()) > > { > String[] files = file.list(); > for (int i = 0; i < files.length; i++) > indexDocs(writer, new File(file, files[i])); > > } > else > { > System.out.println("adding " + file); > > XMLDocumentHandlerSAX hdlr = new XMLDocumentHandlerSAX(file); > StandardAnalyzer anal = new StandardAnalyzer(); > writer.addDocument(hdlr.getDocument(),anal); > System.out.println("Documents added to Index: "+writer.docCount()); > > > > } > } > } > Thanks very much again. > MC > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org