lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sahin Buyrukbilen <sahin.buyrukbi...@gmail.com>
Subject Re: how to index large number of files?
Date Thu, 21 Oct 2010 03:01:51 GMT
by the ways file size is not big. mostly 1kB. I  am working on wikipedia
articles in txt format.

On Wed, Oct 20, 2010 at 11:01 PM, Sahin Buyrukbilen <
sahin.buyrukbilen@gmail.com> wrote:

> Unfortunately both methods didnt go through. I am getting memory error even
> at reading the directory contents.
>
> Now, I am thinking this: What if I split 4.5million files into 100.000 (or
> less depending on java error) files directories, index each of them
> separately and merge those indexes(if possible).
>
> Any suggestions?
>
>
> On Wed, Oct 20, 2010 at 5:47 PM, Erick Erickson <erickerickson@gmail.com>wrote:
>
>> My first guess is that you're accumulating too many documents in
>> before the flush gets triggered. The quick-n-dirty way to test this is
>> to do an IndexWriter.flush after every addDocument. This will slow
>> down indexing, but it will also tell you whether this is the problem and
>> you can look for more elegant solutions...
>>
>> You can also get some stats by IndexWriter.getRamBufferSizeMB, and
>> can force automatic flushes given a particular ram buffer size via
>> IndexWriter.setRamBufferSizeMB.
>>
>> One thing I'd be interested in is how big your files are. It might be, are
>> you trying to process a humongous file when it blows?
>>
>> And if none of that helps, please post your stack trace.
>>
>> Best
>> Erick
>>
>> On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen <
>> sahin.buyrukbilen@gmail.com> wrote:
>>
>> > with the different parameters I still got the same error. My code is
>> very
>> > simple, indeed I am only concerned with creating the index and then I
>> will
>> > do some private information retrieval experiments on the inverted index
>> > file, which I created with the information extracted from the index.
>> That
>> > is
>> > why I didnt go over optimization until now. the database size I had was
>> ery
>> > small compared to 4.5million.
>> > My code is as follows:
>> >
>> > public static void createIndex() throws CorruptIndexException,
>> > LockObtainFailedException, IOException {
>> >        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
>> >        Directory indexDir = FSDirectory.open(new
>> > File("/media/work/WIKI/indexes/"));
>> >        boolean recreateIndexIfExists = true;
>> >        IndexWriter indexWriter = new IndexWriter(indexDir, analyzer,
>> > recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED);
>> >        indexWriter.setUseCompoundFile(false);
>> >        File dir = new File(FILES_TO_INDEX_DIRECTORY);
>> >        File[] files = dir.listFiles();
>> >        for (File file : files) {
>> >            Document document = new Document();
>> >
>> >            //String path = file.getCanonicalPath();
>> >            //document.add(new Field(FIELD_PATH, path, Field.Store.YES,
>> > Field.Index.NOT_ANALYZED));
>> >
>> >            Reader reader = new FileReader(file);
>> >            document.add(new Field(FIELD_CONTENTS, reader));
>> >
>> >            indexWriter.addDocument(document);
>> >        }
>> >        indexWriter.optimize();
>> >        indexWriter.close();
>> >     }
>> >
>> >
>> > On Wed, Oct 20, 2010 at 2:39 PM, Qi Li <alertli@gmail.com> wrote:
>> >
>> > > 1. What is the difference when you used different vm parameters?
>> > > 2  What merge policy and optimization strategy did you use?
>> > > 3. How did you use the commit or flush ?
>> > >
>> > > Qi
>> > >
>> > > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen <
>> > > sahin.buyrukbilen@gmail.com> wrote:
>> > >
>> > > > Thank you so much for this infor. it looks pretty complicated for
me
>> > but
>> > > I
>> > > > will try.
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang <
>> johnbin.wang@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > You can start a fixedThreadPool to index all these files in the
>> > > multhread
>> > > > > way. Every thread execute an index task which could index a part
>> of
>> > all
>> > > > the
>> > > > > files. In the index task, when indexing 10000 files, you need
>> execute
>> > > the
>> > > > > indexWrite.commit() method to flush all the index add operation
to
>> > disk
>> > > > > file.
>> > > > >
>> > > > > If you need index all these files into only one index file, you
>> need
>> > to
>> > > > > hold
>> > > > > only one indexWriter instance among all the index thread.
>> > > > >
>> > > > > Hope it's helpful.
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen <
>> > > > > sahin.buyrukbilen@gmail.com> wrote:
>> > > > >
>> > > > > > Thank you Johnbin,
>> > > > > > do you know which parameter I have to play with?
>> > > > > >
>> > > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang <
>> > > johnbin.wang@gmail.com
>> > > > > > >wrote:
>> > > > > >
>> > > > > > > I think you can write index file once every 10,000
files or
>> less
>> > > have
>> > > > > > been
>> > > > > > > read.
>> > > > > > >
>> > > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen
<
>> > > > > > > sahin.buyrukbilen@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > I have to index about 4.5Million txt files. When
I run the
>> my
>> > > > > indexing
>> > > > > > > > application through Eclipse, I get this error
: "Exception
>> in
>> > > > thread
>> > > > > > > "main"
>> > > > > > > > java.lang.OutOfMemoryError: Java heap space"
>> > > > > > > >
>> > > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k
>> > > > > > > >
>> > > > > > > > eclipse -vmargs -Xms40M -Xmx2G
>> > > > > > > >
>> > > > > > > >  I tried running Eclipse with above memory parameters,
but
>> > still
>> > > > had
>> > > > > > the
>> > > > > > > > same error. The architecture of my computer is
AMD x2 64bit
>> > 2GHz
>> > > > > > > processor,
>> > > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk.
>> > > > > > > >
>> > > > > > > > Anybody has a suggestion?
>> > > > > > > >
>> > > > > > > > thank you.
>> > > > > > > > Sahin.
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > cheers,
>> > > > > > > Johnbin Wang
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > cheers,
>> > > > > Johnbin Wang
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message