Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of ian.lea@gmail.com designates
 209.85.216.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=QK925oMXVtXlGZICTg1sQBZ6hfbKySgPQ+vnC3baHvvfYhVEMe/OYfpnB52NYNxXOG
         sTWM+sIfYw7bHGq0dD/xV+cLNqax4jpkVPKoJx7DbzdooRx+jWJ6Cev+VoW0t2M98/3k
         mmybOrM1ihMAAcKsmEsO0pYpC8R4Y9VWsMYvk=
MIME-Version: 1.0
In-Reply-To: <AANLkTimmPA_AmTgue4qytgrJ6ixoq2y=M_DX8mEoBK=4@mail.gmail.com>
References: <AANLkTinO5rVdXSk4=b9qsdOkimm_jDR56Ze9jdUnoptx@mail.gmail.com>
 <AANLkTimYqJzvBMo8KxFbR7SDv8CL=B0m5a35t_G03GOc@mail.gmail.com>
 <AANLkTin6X=o3OQEtC4bJ49kXfGyHZmmWwN9iSPDjnvSV@mail.gmail.com>
 <AANLkTi=m_CjXvkwdYH8a_cOJujnpzf_Z=ugVojs6VtHX@mail.gmail.com>
 <AANLkTi=V4wvzSWdvBdgdXc_qDBYGJ2k1vNiOnFV7bT-M@mail.gmail.com>
 <AANLkTik0XcyxTS2EqvrQDKW-GpsUs1tpR+7MsYO1psuo@mail.gmail.com>
 <AANLkTinYHM0Yek4QqSNhHvmjFhhiTR2eGGrPfRXVxvAs@mail.gmail.com>
 <AANLkTinK6ZVBG0qS1Cd1nMeM_SqazYBxPX5JouCWngHq@mail.gmail.com>
 <AANLkTikgefiBcHuh7rn848bdeN2C_3aYDSnk3unkHLV1@mail.gmail.com>
 <AANLkTimmPA_AmTgue4qytgrJ6ixoq2y=M_DX8mEoBK=4@mail.gmail.com>
From: Ian Lea <ian.lea@gmail.com>
Date: Thu, 21 Oct 2010 09:39:11 +0100
Message-ID: <AANLkTimhS__T2=EHAy1Texm3B=g58htj6=CVUM1CL0qs@mail.gmail.com>
Subject: Re: how to index large number of files?
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Maybe mostly 1K, but you only need 1 very large doc to cause a problem.

I haven't been following this thread, so apologies if I've missed
things, but you seem to be having problems running what should be a
simple job of the sort that lucene handles every day without  breaking
sweat.

Does it always fail on the same doc or after indexing the same number
of docs?  What exactly is the exception and stack trace?  Are you sure
the problem is with your lucene calls, not something else you are
doing in the same program?

Cutting things down to the smallest, simplest possible standalone
program or test case that demonstrates the problem often helps.  If
you do that and can't find the problem yourself, post the code here
along with the stack trace and details such as how many docs it
processes before failure and the size of the doc it fails on.


--
Ian.

On Thu, Oct 21, 2010 at 4:01 AM, Sahin Buyrukbilen
<sahin.buyrukbilen@gmail.com> wrote:
> by the ways file size is not big. mostly 1kB. I =A0am working on wikipedi=
a
> articles in txt format.
>
> On Wed, Oct 20, 2010 at 11:01 PM, Sahin Buyrukbilen <
> sahin.buyrukbilen@gmail.com> wrote:
>
>> Unfortunately both methods didnt go through. I am getting memory error e=
ven
>> at reading the directory contents.
>>
>> Now, I am thinking this: What if I split 4.5million files into 100.000 (=
or
>> less depending on java error) files directories, index each of them
>> separately and merge those indexes(if possible).
>>
>> Any suggestions?
>>
>>
>> On Wed, Oct 20, 2010 at 5:47 PM, Erick Erickson <erickerickson@gmail.com=
>wrote:
>>
>>> My first guess is that you're accumulating too many documents in
>>> before the flush gets triggered. The quick-n-dirty way to test this is
>>> to do an IndexWriter.flush after every addDocument. This will slow
>>> down indexing, but it will also tell you whether this is the problem an=
d
>>> you can look for more elegant solutions...
>>>
>>> You can also get some stats by IndexWriter.getRamBufferSizeMB, and
>>> can force automatic flushes given a particular ram buffer size via
>>> IndexWriter.setRamBufferSizeMB.
>>>
>>> One thing I'd be interested in is how big your files are. It might be, =
are
>>> you trying to process a humongous file when it blows?
>>>
>>> And if none of that helps, please post your stack trace.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen <
>>> sahin.buyrukbilen@gmail.com> wrote:
>>>
>>> > with the different parameters I still got the same error. My code is
>>> very
>>> > simple, indeed I am only concerned with creating the index and then I
>>> will
>>> > do some private information retrieval experiments on the inverted ind=
ex
>>> > file, which I created with the information extracted from the index.
>>> That
>>> > is
>>> > why I didnt go over optimization until now. the database size I had w=
as
>>> ery
>>> > small compared to 4.5million.
>>> > My code is as follows:
>>> >
>>> > public static void createIndex() throws CorruptIndexException,
>>> > LockObtainFailedException, IOException {
>>> > =A0 =A0 =A0 =A0Analyzer analyzer =3D new StandardAnalyzer(Version.LUC=
ENE_30);
>>> > =A0 =A0 =A0 =A0Directory indexDir =3D FSDirectory.open(new
>>> > File("/media/work/WIKI/indexes/"));
>>> > =A0 =A0 =A0 =A0boolean recreateIndexIfExists =3D true;
>>> > =A0 =A0 =A0 =A0IndexWriter indexWriter =3D new IndexWriter(indexDir, =
analyzer,
>>> > recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED);
>>> > =A0 =A0 =A0 =A0indexWriter.setUseCompoundFile(false);
>>> > =A0 =A0 =A0 =A0File dir =3D new File(FILES_TO_INDEX_DIRECTORY);
>>> > =A0 =A0 =A0 =A0File[] files =3D dir.listFiles();
>>> > =A0 =A0 =A0 =A0for (File file : files) {
>>> > =A0 =A0 =A0 =A0 =A0 =A0Document document =3D new Document();
>>> >
>>> > =A0 =A0 =A0 =A0 =A0 =A0//String path =3D file.getCanonicalPath();
>>> > =A0 =A0 =A0 =A0 =A0 =A0//document.add(new Field(FIELD_PATH, path, Fie=
ld.Store.YES,
>>> > Field.Index.NOT_ANALYZED));
>>> >
>>> > =A0 =A0 =A0 =A0 =A0 =A0Reader reader =3D new FileReader(file);
>>> > =A0 =A0 =A0 =A0 =A0 =A0document.add(new Field(FIELD_CONTENTS, reader)=
);
>>> >
>>> > =A0 =A0 =A0 =A0 =A0 =A0indexWriter.addDocument(document);
>>> > =A0 =A0 =A0 =A0}
>>> > =A0 =A0 =A0 =A0indexWriter.optimize();
>>> > =A0 =A0 =A0 =A0indexWriter.close();
>>> > =A0 =A0 }
>>> >
>>> >
>>> > On Wed, Oct 20, 2010 at 2:39 PM, Qi Li <alertli@gmail.com> wrote:
>>> >
>>> > > 1. What is the difference when you used different vm parameters?
>>> > > 2 =A0What merge policy and optimization strategy did you use?
>>> > > 3. How did you use the commit or flush ?
>>> > >
>>> > > Qi
>>> > >
>>> > > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen <
>>> > > sahin.buyrukbilen@gmail.com> wrote:
>>> > >
>>> > > > Thank you so much for this infor. it looks pretty complicated for=
 me
>>> > but
>>> > > I
>>> > > > will try.
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang <
>>> johnbin.wang@gmail.com
>>> > > > >wrote:
>>> > > >
>>> > > > > You can start a fixedThreadPool to index all these files in the
>>> > > multhread
>>> > > > > way. Every thread execute an index task which could index a par=
t
>>> of
>>> > all
>>> > > > the
>>> > > > > files. In the index task, when indexing 10000 files, you need
>>> execute
>>> > > the
>>> > > > > indexWrite.commit() method to flush all the index add operation=
 to
>>> > disk
>>> > > > > file.
>>> > > > >
>>> > > > > If you need index all these files into only one index file, you
>>> need
>>> > to
>>> > > > > hold
>>> > > > > only one indexWriter instance among all the index thread.
>>> > > > >
>>> > > > > Hope it's helpful.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen <
>>> > > > > sahin.buyrukbilen@gmail.com> wrote:
>>> > > > >
>>> > > > > > Thank you Johnbin,
>>> > > > > > do you know which parameter I have to play with?
>>> > > > > >
>>> > > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang <
>>> > > johnbin.wang@gmail.com
>>> > > > > > >wrote:
>>> > > > > >
>>> > > > > > > I think you can write index file once every 10,000 files or
>>> less
>>> > > have
>>> > > > > > been
>>> > > > > > > read.
>>> > > > > > >
>>> > > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen <
>>> > > > > > > sahin.buyrukbilen@gmail.com> wrote:
>>> > > > > > >
>>> > > > > > > > Hi all,
>>> > > > > > > >
>>> > > > > > > > I have to index about 4.5Million txt files. When I run th=
e
>>> my
>>> > > > > indexing
>>> > > > > > > > application through Eclipse, I get this error : "Exceptio=
n
>>> in
>>> > > > thread
>>> > > > > > > "main"
>>> > > > > > > > java.lang.OutOfMemoryError: Java heap space"
>>> > > > > > > >
>>> > > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k
>>> > > > > > > >
>>> > > > > > > > eclipse -vmargs -Xms40M -Xmx2G
>>> > > > > > > >
>>> > > > > > > > =A0I tried running Eclipse with above memory parameters, =
but
>>> > still
>>> > > > had
>>> > > > > > the
>>> > > > > > > > same error. The architecture of my computer is AMD x2 64b=
it
>>> > 2GHz
>>> > > > > > > processor,
>>> > > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk.
>>> > > > > > > >
>>> > > > > > > > Anybody has a suggestion?
>>> > > > > > > >
>>> > > > > > > > thank you.
>>> > > > > > > > Sahin.
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > --
>>> > > > > > > cheers,
>>> > > > > > > Johnbin Wang
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > --
>>> > > > > cheers,
>>> > > > > Johnbin Wang
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org