lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <ishalymi...@yandex-team.ru>
Subject OutOfMemoryError while indexing
Date Sun, 17 Mar 2013 15:52:58 GMT
Hi!

I'm trying to make an index of several text documents.
Their content is just field tab-separated strings:
word<\t>w1<\t>w2<\t>...<\t>wn
pos<\t>pos1<\t>pos2_a:pos2_b:pos2_c<\t>...<\t>posn_a:posn_b
...

There are 5 documents with the total of 10 MB in size.
While indexing, java uses about 2 GB of RAM and finally thows an OOM error.

        String join_token = tok.nextToken();
        // atomic tokens correspond to separate parses
        String[] atomic_tokens = StringUtils.split(join_token, ':');
        // marking each token with the parse number
        for (int token_index = 0; token_index < atomic_tokens.length; ++token_index) {
          atomic_tokens[token_index] += String.format("|%d", token_index);
        }
        String join_token_with_payloads = StringUtils.join(atomic_tokens, " ");
>>>>        TokenStream stream = new WhitespaceTokenizer(Version.LUCENE_41, <<<<
the line where the leak appears
                                                     new StringReader(join_token_with_payloads));
        // all these parses belong to the same position in the document
        stream = new PositionFilter(stream, 0);
        stream = new DelimitedPayloadTokenFilter(stream, '|', new IntegerEncoder());
        stream.addAttribute(OffsetAttribute.class);
        stream.addAttribute(CharTermAttribute.class);
        feature = new Field(name,
                            join_token,
                            attributeFieldType);
        feature.setTokenStream(stream);
        inDocument.add(feature);

What is wrong with this code from the memory point of view, and how to do indexing with as
little data as possible held in RAM?

-- 
Best Regards,
Igor Shalyminov

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message