lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lokeya <lok...@gmail.com>
Subject Re: Issue while parsing XML files due to control characters, help appreciated.
Date Sun, 18 Mar 2007 05:00:57 GMT

Thanks for your reply. I tried to check if the I/O and Parsing is taking time
separately and Indexing time also. I observed that I/O and Parsing 70 files
totally takes 80 minutes where as when I combine this with Indexing for a
single Metadata file it nearly 2 to 3 hours. So looks like IndexWriter takes
time that too when we are appending to the Index file this happens. 

So what is the best approach to handle this?

Thanks in Advance.


Erick Erickson wrote:
> 
> See below...
> 
> On 3/17/07, Lokeya <lokeya@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> I am trying to index the content from XML files which are basically the
>> metadata collected from a website which have a huge collection of
>> documents.
>> This metadata xml has control characters which causes errors while trying
>> to
>> parse using the DOM parser. I tried to use encoding = UTF-8 but looks
>> like
>> it doesn't cover all the unicode characters and I get error. Also when I
>> tried to use UTF-16, I am getting Prolog content not allowed here. So my
>> guess is there is no enoding which is going to cover almost all unicode
>> characters. So I tried to split my metadata files into small files and
>> processing records which doesnt throw parsing error.
>>
>> But by breaking metadata file into smaller files I get, 10,000 xml files
>> per
>> metadata file. I have 70 metadata files, so altogether it becomes
>> 7,00,000
>> files. Processing them individually takes really long time using Lucene,
>> my
>> guess is I/O is time consuing, like opening every small xml file loading
>> in
>> DOM extracting required data and processing.
> 
> 
> 
> So why don't you measure and find out before trying to make the indexing
> step more efficient? You simply cannot optimize without knowing where
> you're spending your time. I can't tell you how often I've been wrong
> about
> "why my program was slow" <G>.
> 
> In this case, it should be really simple. Just comment out the part where
> you index the data and run, say, one of your metadata files.. I suspect
> that
> Cheolgoo Kang's response is cogent, and you indeed are spending your
> time parsing the XML. I further suspect that the problem is not disk IO,
> but the time spent parsing. But until you measure, you have no clue
> whether you should mess around with the Lucene parameters, or find
> another parser, or just live with it.. Assuming that you comment out
> Lucene and things are still slow, the next step would be to just read in
> each file and NOT parse it to figure out whether it's the IO or the
> parsing.
> 
> Then you can worry about how to fix it..
> 
> Best
> Erick
> 
> 
> Qn  1: Any suggestion to get this indexing time reduced? It would be
> really
>> great.
>>
>> Qn 2 : Am I overlooking something in Lucene with respect to indexing?
>>
>> Right now 12 metadata files take 10 hrs nearly which is really a long
>> time.
>>
>> Help Appreciated.
>>
>> Much Thanks.
>> --
>> View this message in context:
>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9536099
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message