lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aslam bari <iamasla...@yahoo.co.in>
Subject Re: Big size xml file indexing
Date Mon, 22 Jan 2007 05:45:49 GMT
Hi Saikrishna,
Unluckily my xml structure is not the same, some times it goes too long and some times too
small on nodes. It may be one element go throught the whole document or there may be many
elements of different types come. So need your help on it how to parse in good and efficient
way so that less memory use and fast processing.

i read that SAXBuilder is slower and memory consuming. How to replace it with other, like
my code is :-

SAXBuilder builder  = new SAXBuilder();

//content is ByteArrayinputstram, Can i change it to any better way
(JDOM type ) Document doc = builder.build(content);



for loop()
{
    // it is for getting nodes for given many xpath query
    XPATH xpath = new XPATH(...);
     xpath.selectNodes(doc);
...................................
}
 
Thanks...

----- Original Message ----
From: saikrishna venkata pendyala <pvsaikrishna@gmail.com>
To: java-user@lucene.apache.org
Sent: Monday, 22 January, 2007 10:44:50 AM
Subject: Re: Big size xml file indexing


Hai ,
       Nothing to change in Indexing process. What requires is a little
pre-processing.
       If the structure of ur xml file is same as what I said earlier,then
split the 35MB file into small files and make sure that new small files
generated are of correct xml syntax.
       Now Index small files{more than one} generated instead of one large
file.

       Could you say the sturcture of ur xml file and what ur trying to
index.

On 1/22/07, aslam bari <iamaslamok@yahoo.co.in> wrote:
>
> Hi Saikrishna,
> Thanks for reply,
> But i don't know how i can go with this. Here is my code sample, let me
> know where to change.
>
> SAXBuilder builder = new SAXBuilder();
>
> //CONTENT here is bytearrayinputstream , i know i can give here file url
> also. Let me know whta is best.
> Document doc = builder.build(CONTENT);
>
> loop(---)
> {
>     doc.selectNodes(xpathquery);
> }
>
> Thanks...
> ----- Original Message ----
> From: saikrishna venkata pendyala <pvsaikrishna@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Monday, 22 January, 2007 10:07:27 AM
> Subject: Re: Big size xml file indexing
>
>
> Hai ,
>        I have indexed 6.2 gb xml file using lucene. What I did was
>         1 .  I have splitted the 6.2gb file into small files each of size
> 10mb.
>         2 .  And then I worte a python script to quantize number
> no.ofdocuments in each file.
>
>         Structure of my xml file is """
>        <document>
>         -----
>         -----
>         </document>
>         <document>
>         -----
>         -----
>         </document> """
>
> Since you cannot go beyond 500MB this technique might help you of course
> if
> file sturcture is the same.
>
> On 1/22/07, aslam bari <iamaslamok@yahoo.co.in> wrote:
> >
> > Dear all,
> > I m using lucene to index xml files. For parsing i m using JDOM to get
> > XPATH nodes and do some manipulation on them and indexed them. All
> things
> > work well but when the file size is very big about 35 - 50 MB. Then it
> goes
> > out of memory or take a lot of time. How can i set some parameters to
> speed
> > up and took less memory to parse the file. The problem is that i cannot
> > increase much high Heap Size. So i have to limit to use heap size of 300
> -
> > 500 MB. Has anybody some solution for this.
> >
> > Thanks...
> >
> >
> >
> > __________________________________________________________
> > Yahoo! India Answers: Share what you know. Learn something new
> > http://in.answers.yahoo.com/
> >
>
>
>
> __________________________________________________________
> Yahoo! India Answers: Share what you know. Learn something new
> http://in.answers.yahoo.com/
>


		
__________________________________________________________
Yahoo! India Answers: Share what you know. Learn something new
http://in.answers.yahoo.com/
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message