lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolgoo Kang" <app...@gmail.com>
Subject Re: Issue while parsing XML files due to control characters, help appreciated.
Date Sat, 17 Mar 2007 05:25:19 GMT
On 3/17/07, Lokeya <lokeya@gmail.com> wrote:
>
> Hi,
>
> I am trying to index the content from XML files which are basically the
> metadata collected from a website which have a huge collection of documents.
> This metadata xml has control characters which causes errors while trying to
> parse using the DOM parser. I tried to use encoding = UTF-8 but looks like
> it doesn't cover all the unicode characters and I get error. Also when I
> tried to use UTF-16, I am getting Prolog content not allowed here. So my
> guess is there is no enoding which is going to cover almost all unicode
> characters. So I tried to split my metadata files into small files and
> processing records which doesnt throw parsing error.

Are you using CDATA section in your XML documents?

>
> But by breaking metadata file into smaller files I get, 10,000 xml files per
> metadata file. I have 70 metadata files, so altogether it becomes 7,00,000
> files. Processing them individually takes really long time using Lucene, my
> guess is I/O is time consuing, like opening every small xml file loading in
> DOM extracting required data and processing.

I think DOM is not appropriate for a batch job like your case. Why
don't you try SAX or XPP? And also, try to adjust IndexWriter's
parameters like mergeFactor, minMergeDocs.

>
> Qn  1: Any suggestion to get this indexing time reduced? It would be really
> great.
>
> Qn 2 : Am I overlooking something in Lucene with respect to indexing?
>
> Right now 12 metadata files take 10 hrs nearly which is really a long time.
>
> Help Appreciated.
>
> Much Thanks.
> --
> View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Cheolgoo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message