Hi,
I am trying to index the content from XML files which are basically the
metadata collected from a website which have a huge collection of documents.
This metadata xml has control characters which causes errors while trying to
parse using the DOM parser. I tried to use encoding = UTF-8 but looks like
it doesn't cover all the unicode characters and I get error. Also when I
tried to use UTF-16, I am getting Prolog content not allowed here. So my
guess is there is no encoding which is going to cover almost all unicode
characters. So I tried to split my metadata files into small files and
processing records which doesnt throw parsing error.
But by breaking metadata file into smaller files I get, 10,000 xml files per
metadata file. I have 70 metadata files, so altogether it becomes 7,00,000
files. Processing them individually takes really long time using Lucene, my
guess is I/O is time consuming, like opening every small xml file loading in
DOM extracting required data and processing.
Qn 1: Any suggestion to get this indexing time reduced? It would be really
great.
Qn 2 : Am I overlooking something in Lucene with respect to indexing?
Right now 12 metadata files take 10 hrs nearly which is really a long time.
Help Appreciated.
Much Thanks.
--
View this message in context: http://www.nabble.com/Indexing-time-taken-is-too-long---Help-Appreciated.-tf3418090.html#a9526539
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|