lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: Issue while parsing XML files due to control characters, help appreciated.
Date Sun, 18 Mar 2007 17:34:29 GMT
Move index writer creation, optimization and closure outside of your  
loop.  I would also use a SAX parser.  Take a look at the demo code  
to see an example of indexing.

Cheers,
Grant

On Mar 18, 2007, at 12:31 PM, Lokeya wrote:

>
>
> Erick Erickson wrote:
>>
>> Grant:
>>
>>  I think that "Parsing 70 files totally takes 80 minutes" really
>> means parsing 70 metadata files containing 10,000 XML
>> files each.....
>>
>> One Metadata File is split into 10,000 XML files which looks as  
>> below:
>>
>> <root>
>> 	<record>
>> 	<header>
>> 	<identifier>oai:CiteSeerPSU:1</identifier>
>> 	<datestamp>1993-08-11</datestamp>
>> 	<setSpec>CiteSeerPSUset</setSpec>
>> 	</header>
>> 		<metadata>
>> 		<oai_citeseer:oai_citeseer
>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"  
>> xmlns:dc
>> ="http://purl.org/dc/elements/1.1/"
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>> 		<dc:title>36 Problems for Semantic Interpretation</dc:title>
>> 		<oai_citeseer:author name="Gabriele Scheler">
>> 			<address>80290 Munchen , Germany</address>
>> 			<affiliation>Institut fur Informatik; Technische Universitat
>> Munchen</affiliation>
>> 		</oai_citeseer:author>
>> 		<dc:subject>Gabriele Scheler 36 Problems for Semantic
>> Interpretation</dc:subject>
>> 		<dc:description>This paper presents a collection of problems for  
>> natural
>> language analysisderived mainly from theoretical linguistics. Most of
>> these problemspresent major obstacles for computational systems of
>> language interpretation.The set of given sentences can easily be  
>> scaled up
>> by introducing moreexamples per problem. The construction of  
>> computational
>> systems couldbenefit from such a collection, either using it  
>> directly for
>> training andtesting or as a set of benchmarks to qualify the  
>> performance
>> of a NLPsystem.1 IntroductionThe main part of this paper consists  
>> of a
>> collection of problems for semanticanalysis of natural language. The
>> problems are arranged in the following way:example sentencesconcise
>> description of the problemkeyword for the type of problemThe sources
>> (first appearance in print) of the sentences have been left  
>> out,because
>> they are sometimes hard to track and will usually not be of much  
>> use,as
>> they indicate a starting-point of discussion only. The keywords  
>> howeve...
>> 		</dc:description>
>> 		<dc:contributor>The Pennsylvania State University CiteSeer
>> Archives</dc:contributor>
>> 		<dc:publisher>unknown</dc:publisher>
>> 		<dc:date>1993-08-11</dc:date>
>> 		<dc:format>ps</dc:format>
>> 		<dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier>
>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/ 
>> fki-179-93.ps.gz</dc:source>
>> 		<dc:language>en</dc:language>
>> 		<dc:rights>unrestricted</dc:rights>
>> 		</oai_citeseer:oai_citeseer>
>> 		</metadata>
>> 	</record>
>> </root>
>>
>>
>> From the above I will extract the Title and the Description tags  
>> to index.
>>
>> Code to do this:
>>
>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>> 2. Under each of above directory, I have 10,000 xml files each having
>> above xml data.
>> 3. Program does the following
>>
>> 					File dir = new File(dirName);
>> 					String[] children = dir.list();
>> 					if (children == null) {
>> 						// Either dir does not exist or is not a directory
>> 					}
>> 					else
>> 					{
>> 						for (int ii=0; ii<children.length; ii++)
>> 						{							
>> 							// Get filename of file or directory
>> 							String file = children[ii];
>> 							//System.out.println("The name of file parsed now ==> "+file);
>> 							nl = ReadDump.getNodeList(filename+"/"+file, "metadata");
>> 							if(nl == null)
>> 							{
>> 								//System.out.println("Error shoudlnt be thrown ...");
>> 							}	
>> 							//Get the metadata element tags from xml file
>> 							ReadDump rd = new ReadDump();
>>
>> 							//Get the Extracted Tags Title, Identifier and Description
>> 							ArrayList alist_Title = rd.getElements(nl, "dc:title");
>> 							ArrayList alist_Descr = rd.getElements(nl, "dc:description");
>>
>> 							//Create an Index under DIR
>> 							IndexWriter writer = new IndexWriter("./FINAL/", new
>> StopStemmingAnalyzer(),false);
>> 							Document doc = new Document();
>>
>> 							//Get Array List Elements  and add them as fileds to doc
>> 							for(int k=0; k < alist_Title.size(); k++)
>> 							{
>> 								doc.add(new Field("Title",alist_Title.get(k).toString(),
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>> 							}
>> 							
>> 							for(int k=0; k < alist_Descr.size(); k++)
>> 							{
>> 								doc.add(new Field("Description",alist_Descr.get(k).toString 
>> (),
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>> 							}							
>>
>> 			//Add the document created out of those fields to the  
>> IndexWriter which
>> will create and index
>> 							writer.addDocument(doc);
>> 							writer.optimize();
>> 							writer.close();
>>                }
>> 							
>>
>> This is the main file which does indexing.
>>
>> Hope this will give you an idea.
>>
>>
>> Lokeya:
>> Can you confirm my supposition? And I'd still post the code
>> Grant requested if you can.....
>>
>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>> 8 minutes or so which is spent reading/parsing, right? It'll be
>> important to know how much data you're indexing and now, so
>> the code snippet is doubly important....
>>
>> Erick
>>
>> On 3/18/07, Grant Ingersoll <gsingers@apache.org> wrote:
>>>
>>> Can you post the relevant indexing code?  Are you doing things like
>>> optimizing after every file?  Both the parsing and the indexing  
>>> sound
>>> really long.  How big are these files?
>>>
>>> Also, I assume you machine is at least somewhat current, right?
>>>
>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>>
>>>>
>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>>>> taking time
>>>> separately and Indexing time also. I observed that I/O and Parsing
>>>> 70 files
>>>> totally takes 80 minutes where as when I combine this with Indexing
>>>> for a
>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>>>> IndexWriter takes
>>>> time that too when we are appending to the Index file this happens.
>>>>
>>>> So what is the best approach to handle this?
>>>>
>>>> Thanks in Advance.
>>>>
>>>>
>>>> Erick Erickson wrote:
>>>>>
>>>>> See below...
>>>>>
>>>>> On 3/17/07, Lokeya <lokeya@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to index the content from XML files which are
>>>>>> basically the
>>>>>> metadata collected from a website which have a huge collection of
>>>>>> documents.
>>>>>> This metadata xml has control characters which causes errors
>>>>>> while trying
>>>>>> to
>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>>>>>> looks
>>>>>> like
>>>>>> it doesn't cover all the unicode characters and I get error. Also
>>>>>> when I
>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>>>>>> here. So my
>>>>>> guess is there is no enoding which is going to cover almost all
>>>>>> unicode
>>>>>> characters. So I tried to split my metadata files into small
>>>>>> files and
>>>>>> processing records which doesnt throw parsing error.
>>>>>>
>>>>>> But by breaking metadata file into smaller files I get, 10,000
>>>>>> xml files
>>>>>> per
>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
>>>>>> 7,00,000
>>>>>> files. Processing them individually takes really long time using
>>>>>> Lucene,
>>>>>> my
>>>>>> guess is I/O is time consuing, like opening every small xml file
>>>>>> loading
>>>>>> in
>>>>>> DOM extracting required data and processing.
>>>>>
>>>>>
>>>>>
>>>>> So why don't you measure and find out before trying to make the
>>>>> indexing
>>>>> step more efficient? You simply cannot optimize without knowing  
>>>>> where
>>>>> you're spending your time. I can't tell you how often I've been  
>>>>> wrong
>>>>> about
>>>>> "why my program was slow" <G>.
>>>>>
>>>>> In this case, it should be really simple. Just comment out the
>>>>> part where
>>>>> you index the data and run, say, one of your metadata files.. I
>>>>> suspect
>>>>> that
>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending  
>>>>> your
>>>>> time parsing the XML. I further suspect that the problem is not
>>>>> disk IO,
>>>>> but the time spent parsing. But until you measure, you have no  
>>>>> clue
>>>>> whether you should mess around with the Lucene parameters, or find
>>>>> another parser, or just live with it.. Assuming that you  
>>>>> comment out
>>>>> Lucene and things are still slow, the next step would be to just
>>>>> read in
>>>>> each file and NOT parse it to figure out whether it's the IO or  
>>>>> the
>>>>> parsing.
>>>>>
>>>>> Then you can worry about how to fix it..
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>>
>>>>> Qn  1: Any suggestion to get this indexing time reduced? It  
>>>>> would be
>>>>> really
>>>>>> great.
>>>>>>
>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>>>>>> indexing?
>>>>>>
>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>>>>>> long
>>>>>> time.
>>>>>>
>>>>>> Help Appreciated.
>>>>>>
>>>>>> Much Thanks.
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------

>>>>>> ---
>>>>>> -
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context: http://www.nabble.com/Issue-while-
>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>>> tf3418085.html#a9536099
>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>> Nabble.com.
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> Center for Natural Language Processing
>>> http://www.cnlp.org
>>>
>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>> LuceneFAQ
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Issue-while- 
> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- 
> tf3418085.html#a9540232
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message