lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Issue while parsing XML files due to control characters, help appreciated.
Date Sun, 18 Mar 2007 13:39:20 GMT
Grant:

I think that "Parsing 70 files totally takes 80 minutes" really
means parsing 70 metadata files containing 10,000 XML
files each.....

Lokeya:
Can you confirm my supposition? And I'd still post the code
Grant requested if you can.....

So, you're talking about indexing 10,000 xml files in 2-3 hours,
8 minutes or so which is spent reading/parsing, right? It'll be
important to know how much data you're indexing and now, so
the code snippet is doubly important....

Erick

On 3/18/07, Grant Ingersoll <gsingers@apache.org> wrote:
>
> Can you post the relevant indexing code?  Are you doing things like
> optimizing after every file?  Both the parsing and the indexing sound
> really long.  How big are these files?
>
> Also, I assume you machine is at least somewhat current, right?
>
> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>
> >
> > Thanks for your reply. I tried to check if the I/O and Parsing is
> > taking time
> > separately and Indexing time also. I observed that I/O and Parsing
> > 70 files
> > totally takes 80 minutes where as when I combine this with Indexing
> > for a
> > single Metadata file it nearly 2 to 3 hours. So looks like
> > IndexWriter takes
> > time that too when we are appending to the Index file this happens.
> >
> > So what is the best approach to handle this?
> >
> > Thanks in Advance.
> >
> >
> > Erick Erickson wrote:
> >>
> >> See below...
> >>
> >> On 3/17/07, Lokeya <lokeya@gmail.com> wrote:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I am trying to index the content from XML files which are
> >>> basically the
> >>> metadata collected from a website which have a huge collection of
> >>> documents.
> >>> This metadata xml has control characters which causes errors
> >>> while trying
> >>> to
> >>> parse using the DOM parser. I tried to use encoding = UTF-8 but
> >>> looks
> >>> like
> >>> it doesn't cover all the unicode characters and I get error. Also
> >>> when I
> >>> tried to use UTF-16, I am getting Prolog content not allowed
> >>> here. So my
> >>> guess is there is no enoding which is going to cover almost all
> >>> unicode
> >>> characters. So I tried to split my metadata files into small
> >>> files and
> >>> processing records which doesnt throw parsing error.
> >>>
> >>> But by breaking metadata file into smaller files I get, 10,000
> >>> xml files
> >>> per
> >>> metadata file. I have 70 metadata files, so altogether it becomes
> >>> 7,00,000
> >>> files. Processing them individually takes really long time using
> >>> Lucene,
> >>> my
> >>> guess is I/O is time consuing, like opening every small xml file
> >>> loading
> >>> in
> >>> DOM extracting required data and processing.
> >>
> >>
> >>
> >> So why don't you measure and find out before trying to make the
> >> indexing
> >> step more efficient? You simply cannot optimize without knowing where
> >> you're spending your time. I can't tell you how often I've been wrong
> >> about
> >> "why my program was slow" <G>.
> >>
> >> In this case, it should be really simple. Just comment out the
> >> part where
> >> you index the data and run, say, one of your metadata files.. I
> >> suspect
> >> that
> >> Cheolgoo Kang's response is cogent, and you indeed are spending your
> >> time parsing the XML. I further suspect that the problem is not
> >> disk IO,
> >> but the time spent parsing. But until you measure, you have no clue
> >> whether you should mess around with the Lucene parameters, or find
> >> another parser, or just live with it.. Assuming that you comment out
> >> Lucene and things are still slow, the next step would be to just
> >> read in
> >> each file and NOT parse it to figure out whether it's the IO or the
> >> parsing.
> >>
> >> Then you can worry about how to fix it..
> >>
> >> Best
> >> Erick
> >>
> >>
> >> Qn  1: Any suggestion to get this indexing time reduced? It would be
> >> really
> >>> great.
> >>>
> >>> Qn 2 : Am I overlooking something in Lucene with respect to
> >>> indexing?
> >>>
> >>> Right now 12 metadata files take 10 hrs nearly which is really a
> >>> long
> >>> time.
> >>>
> >>> Help Appreciated.
> >>>
> >>> Much Thanks.
> >>> --
> >>> View this message in context:
> >>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
> >>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
> >>> Sent from the Lucene - Java Users mailing list archive at
> >>> Nabble.com.
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> >>
> >
> > --
> > View this message in context: http://www.nabble.com/Issue-while-
> > parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
> > tf3418085.html#a9536099
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message