Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 3719 invoked from network); 18 Mar 2007 17:35:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Mar 2007 17:35:05 -0000 Received: (qmail 80464 invoked by uid 500); 18 Mar 2007 17:35:07 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 79813 invoked by uid 500); 18 Mar 2007 17:35:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 79801 invoked by uid 99); 18 Mar 2007 17:35:05 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Mar 2007 10:35:05 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of grant.ingersoll@gmail.com designates 66.249.82.228 as permitted sender) Received: from [66.249.82.228] (HELO wx-out-0506.google.com) (66.249.82.228) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Mar 2007 10:34:56 -0700 Received: by wx-out-0506.google.com with SMTP id i29so943189wxd for ; Sun, 18 Mar 2007 10:34:36 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer; b=Y1tbd0vqRrmoVn1XnFIXHOAG9A8pasX1KxH50uCYRYDOQnyVDSBSPP7FzPLgO0/6/HuxmVyZcnSys6yRmTh+XZu2lE2RS5r5OAdVwtO4u0mYL8ynSr07qcH6NYK+1aAvBuUFlxc6WYQbwYbeotf5XdtxPSQWWGsqliQbHk2Qj8g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer; b=IRfIQoE56qvUpF2sV+02zdAK10cFtG+UqHdhkQKyuU7J/jOfCxJkDprcISVmO+cYMVP7HBvem4kPHWULVhTflpGouRtPklox9WMPvFAU4/c+Ix2rVNwS4eVyihJdRpnJzibLCFc270m2/2BXyWwsVMs7PROHzUgSY9yQZI8tL0k= Received: by 10.70.9.4 with SMTP id 4mr6967458wxi.1174239275655; Sun, 18 Mar 2007 10:34:35 -0700 (PDT) Received: from ?192.168.0.2? ( [74.229.189.244]) by mx.google.com with ESMTP id 33sm4730269wra.2007.03.18.10.34.33; Sun, 18 Mar 2007 10:34:34 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: <9540232.post@talk.nabble.com> References: <9526527.post@talk.nabble.com> <359a92830703170829q16f3bc57u1654ff70cc59e3e9@mail.gmail.com> <9536099.post@talk.nabble.com> <714E8804-CD19-4477-98FA-E56F95977EEE@apache.org> <359a92830703180639q4ad0264dif5220f844867204@mail.gmail.com> <9540232.post@talk.nabble.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <5BD5D3FF-1A33-465F-BFD4-B312731EFDB7@gmail.com> Content-Transfer-Encoding: 7bit From: Grant Ingersoll Subject: Re: Issue while parsing XML files due to control characters, help appreciated. Date: Sun, 18 Mar 2007 13:34:29 -0400 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.2) X-Virus-Checked: Checked by ClamAV on apache.org Move index writer creation, optimization and closure outside of your loop. I would also use a SAX parser. Take a look at the demo code to see an example of indexing. Cheers, Grant On Mar 18, 2007, at 12:31 PM, Lokeya wrote: > > > Erick Erickson wrote: >> >> Grant: >> >> I think that "Parsing 70 files totally takes 80 minutes" really >> means parsing 70 metadata files containing 10,000 XML >> files each..... >> >> One Metadata File is split into 10,000 XML files which looks as >> below: >> >> >> >>
>> oai:CiteSeerPSU:1 >> 1993-08-11 >> CiteSeerPSUset >>
>> >> > xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/" >> xmlns:dc >> ="http://purl.org/dc/elements/1.1/" >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/ >> http://copper.ist.psu.edu/oai/oai_citeseer.xsd "> >> 36 Problems for Semantic Interpretation >> >>
80290 Munchen , Germany
>> Institut fur Informatik; Technische Universitat >> Munchen >>
>> Gabriele Scheler 36 Problems for Semantic >> Interpretation >> This paper presents a collection of problems for >> natural >> language analysisderived mainly from theoretical linguistics. Most of >> these problemspresent major obstacles for computational systems of >> language interpretation.The set of given sentences can easily be >> scaled up >> by introducing moreexamples per problem. The construction of >> computational >> systems couldbenefit from such a collection, either using it >> directly for >> training andtesting or as a set of benchmarks to qualify the >> performance >> of a NLPsystem.1 IntroductionThe main part of this paper consists >> of a >> collection of problems for semanticanalysis of natural language. The >> problems are arranged in the following way:example sentencesconcise >> description of the problemkeyword for the type of problemThe sources >> (first appearance in print) of the sentences have been left >> out,because >> they are sometimes hard to track and will usually not be of much >> use,as >> they indicate a starting-point of discussion only. The keywords >> howeve... >> >> The Pennsylvania State University CiteSeer >> Archives >> unknown >> 1993-08-11 >> ps >> http://citeseer.ist.psu.edu/1.html >> ftp://flop.informatik.tu-muenchen.de/pub/fki/ >> fki-179-93.ps.gz >> en >> unrestricted >>
>>
>>
>>
>> >> >> From the above I will extract the Title and the Description tags >> to index. >> >> Code to do this: >> >> 1. I have 70 directories with the name like oai_citeseerXYZ/ >> 2. Under each of above directory, I have 10,000 xml files each having >> above xml data. >> 3. Program does the following >> >> File dir = new File(dirName); >> String[] children = dir.list(); >> if (children == null) { >> // Either dir does not exist or is not a directory >> } >> else >> { >> for (int ii=0; ii> { >> // Get filename of file or directory >> String file = children[ii]; >> //System.out.println("The name of file parsed now ==> "+file); >> nl = ReadDump.getNodeList(filename+"/"+file, "metadata"); >> if(nl == null) >> { >> //System.out.println("Error shoudlnt be thrown ..."); >> } >> //Get the metadata element tags from xml file >> ReadDump rd = new ReadDump(); >> >> //Get the Extracted Tags Title, Identifier and Description >> ArrayList alist_Title = rd.getElements(nl, "dc:title"); >> ArrayList alist_Descr = rd.getElements(nl, "dc:description"); >> >> //Create an Index under DIR >> IndexWriter writer = new IndexWriter("./FINAL/", new >> StopStemmingAnalyzer(),false); >> Document doc = new Document(); >> >> //Get Array List Elements and add them as fileds to doc >> for(int k=0; k < alist_Title.size(); k++) >> { >> doc.add(new Field("Title",alist_Title.get(k).toString(), >> Field.Store.YES, Field.Index.UN_TOKENIZED)); >> } >> >> for(int k=0; k < alist_Descr.size(); k++) >> { >> doc.add(new Field("Description",alist_Descr.get(k).toString >> (), >> Field.Store.YES, Field.Index.UN_TOKENIZED)); >> } >> >> //Add the document created out of those fields to the >> IndexWriter which >> will create and index >> writer.addDocument(doc); >> writer.optimize(); >> writer.close(); >> } >> >> >> This is the main file which does indexing. >> >> Hope this will give you an idea. >> >> >> Lokeya: >> Can you confirm my supposition? And I'd still post the code >> Grant requested if you can..... >> >> So, you're talking about indexing 10,000 xml files in 2-3 hours, >> 8 minutes or so which is spent reading/parsing, right? It'll be >> important to know how much data you're indexing and now, so >> the code snippet is doubly important.... >> >> Erick >> >> On 3/18/07, Grant Ingersoll wrote: >>> >>> Can you post the relevant indexing code? Are you doing things like >>> optimizing after every file? Both the parsing and the indexing >>> sound >>> really long. How big are these files? >>> >>> Also, I assume you machine is at least somewhat current, right? >>> >>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote: >>> >>>> >>>> Thanks for your reply. I tried to check if the I/O and Parsing is >>>> taking time >>>> separately and Indexing time also. I observed that I/O and Parsing >>>> 70 files >>>> totally takes 80 minutes where as when I combine this with Indexing >>>> for a >>>> single Metadata file it nearly 2 to 3 hours. So looks like >>>> IndexWriter takes >>>> time that too when we are appending to the Index file this happens. >>>> >>>> So what is the best approach to handle this? >>>> >>>> Thanks in Advance. >>>> >>>> >>>> Erick Erickson wrote: >>>>> >>>>> See below... >>>>> >>>>> On 3/17/07, Lokeya wrote: >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to index the content from XML files which are >>>>>> basically the >>>>>> metadata collected from a website which have a huge collection of >>>>>> documents. >>>>>> This metadata xml has control characters which causes errors >>>>>> while trying >>>>>> to >>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but >>>>>> looks >>>>>> like >>>>>> it doesn't cover all the unicode characters and I get error. Also >>>>>> when I >>>>>> tried to use UTF-16, I am getting Prolog content not allowed >>>>>> here. So my >>>>>> guess is there is no enoding which is going to cover almost all >>>>>> unicode >>>>>> characters. So I tried to split my metadata files into small >>>>>> files and >>>>>> processing records which doesnt throw parsing error. >>>>>> >>>>>> But by breaking metadata file into smaller files I get, 10,000 >>>>>> xml files >>>>>> per >>>>>> metadata file. I have 70 metadata files, so altogether it becomes >>>>>> 7,00,000 >>>>>> files. Processing them individually takes really long time using >>>>>> Lucene, >>>>>> my >>>>>> guess is I/O is time consuing, like opening every small xml file >>>>>> loading >>>>>> in >>>>>> DOM extracting required data and processing. >>>>> >>>>> >>>>> >>>>> So why don't you measure and find out before trying to make the >>>>> indexing >>>>> step more efficient? You simply cannot optimize without knowing >>>>> where >>>>> you're spending your time. I can't tell you how often I've been >>>>> wrong >>>>> about >>>>> "why my program was slow" . >>>>> >>>>> In this case, it should be really simple. Just comment out the >>>>> part where >>>>> you index the data and run, say, one of your metadata files.. I >>>>> suspect >>>>> that >>>>> Cheolgoo Kang's response is cogent, and you indeed are spending >>>>> your >>>>> time parsing the XML. I further suspect that the problem is not >>>>> disk IO, >>>>> but the time spent parsing. But until you measure, you have no >>>>> clue >>>>> whether you should mess around with the Lucene parameters, or find >>>>> another parser, or just live with it.. Assuming that you >>>>> comment out >>>>> Lucene and things are still slow, the next step would be to just >>>>> read in >>>>> each file and NOT parse it to figure out whether it's the IO or >>>>> the >>>>> parsing. >>>>> >>>>> Then you can worry about how to fix it.. >>>>> >>>>> Best >>>>> Erick >>>>> >>>>> >>>>> Qn 1: Any suggestion to get this indexing time reduced? It >>>>> would be >>>>> really >>>>>> great. >>>>>> >>>>>> Qn 2 : Am I overlooking something in Lucene with respect to >>>>>> indexing? >>>>>> >>>>>> Right now 12 metadata files take 10 hrs nearly which is really a >>>>>> long >>>>>> time. >>>>>> >>>>>> Help Appreciated. >>>>>> >>>>>> Much Thanks. >>>>>> -- >>>>>> View this message in context: >>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to- >>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527 >>>>>> Sent from the Lucene - Java Users mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> >>>>>> ----------------------------------------------------------------- >>>>>> --- >>>>>> - >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> View this message in context: http://www.nabble.com/Issue-while- >>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- >>>> tf3418085.html#a9536099 >>>> Sent from the Lucene - Java Users mailing list archive at >>>> Nabble.com. >>>> >>>> >>>> ------------------------------------------------------------------- >>>> -- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>> >>> -------------------------- >>> Grant Ingersoll >>> Center for Natural Language Processing >>> http://www.cnlp.org >>> >>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ >>> LuceneFAQ >>> >>> >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> >> > > -- > View this message in context: http://www.nabble.com/Issue-while- > parsing-XML-files-due-to-control-characters%2C-help-appreciated.- > tf3418085.html#a9540232 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > ------------------------------------------------------ Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org