lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Glen Newton <glen.new...@gmail.com>
Subject Re: IndexWriter croaks on large file
Date Fri, 14 Feb 2014 21:20:14 GMT
You should consider making each _line_ of the log file a (Lucene)
document (assuming it is a log-per-line log file)

-Glen

On Fri, Feb 14, 2014 at 4:12 PM, John Cecere <john.cecere@oracle.com> wrote:
> I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At
> any rate, I don't have control over the size of the documents that go into
> my database. Sometimes my customer's log files end up really big. I'm
> willing to have huge indexes for these things.
>
> Wouldn't just changing from int to long for the offsets solve the problem ?
> I'm sure it would probably have to be changed in a lot of places, but why
> impose such a limitation ? Especially since it's using an InputStream and
> only dealing with a block of data at a time.
>
> I'll take a look at your suggestion.
>
> Thanks,
> John
>
>
> On 2/14/14 3:20 PM, Michael McCandless wrote:
>>
>> Hmm, why are you indexing such immense documents?
>>
>> In 3.x Lucene never sanity checked the offsets, so we would silently
>> index negative (int overflow'd) offsets into e.g. term vectors.
>>
>> But in 4.x, we now detect this and throw the exception you're seeing,
>> because it can lead to index corruption when you index the offsets
>> into the postings.
>>
>> If you really must index such enormous documents, maybe you could
>> create a custom tokenizer  (derived from StandardTokenizer) that
>> "fixes" the offset before setting them?  Or maybe just doesn't even
>> set them.
>>
>> Note that position can also overflow, if your documents get too large.
>>
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Feb 14, 2014 at 1:36 PM, John Cecere <john.cecere@oracle.com>
>> wrote:
>>>
>>> I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a
>>> file >
>>> 2GB in size, it dies with the following exception:
>>>
>>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>>> endOffset must be >= startOffset,
>>> startOffset=-2147483648,endOffset=-2147483647
>>>
>>> Essentially, I'm doing this:
>>>
>>> Directory directory = new MMapDirectory(indexPath);
>>> Analyzer analyzer = new StandardAnalyzer();
>>> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,
>>> analyzer);
>>> IndexWriter iw = new IndexWriter(directory, iwc);
>>>
>>> InputStream is = <my input stream>;
>>> InputStreamReader reader = new InputStreamReader(is);
>>>
>>> Document doc = new Document();
>>> doc.add(new StoredField("fileid", fileid));
>>> doc.add(new StoredField("pathname", pathname));
>>> doc.add(new TextField("content", reader));
>>>
>>> iw.addDocument(doc);
>>>
>>> It's the IndexWriter addDocument method that throws the exception. In
>>> looking at the Lucene source code, it appears that the offsets being used
>>> internally are int, which makes it somewhat obvious why this is
>>> happening.
>>>
>>> This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
>>> capable of handling a file over 2GB in this manner. What has changed and
>>> how
>>> do I get around this ? Is Lucene no longer capable of handling files this
>>> large, or is there some other way I should be doing this ?
>>>
>>> Here's the full stack trace sans my code:
>>>
>>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>>> endOffset must be >= startOffset,
>>> startOffset=-2147483648,endOffset=-2147483647
>>>          at
>>>
>>> org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
>>>          at
>>>
>>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
>>>          at
>>>
>>> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
>>>          at
>>>
>>> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>>>          at
>>>
>>> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>>>          at
>>>
>>> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>>>          at
>>>
>>> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>>>          at
>>>
>>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
>>>          at
>>>
>>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
>>>          at
>>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
>>>          at
>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
>>>          at
>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)
>>>
>>> Thanks,
>>> John
>>>
>>> --
>>> John Cecere
>>> Principal Engineer - Oracle Corporation
>>> 732-987-4317 / john.cecere@oracle.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --
> John Cecere
> Principal Engineer - Oracle Corporation
> 732-987-4317 / john.cecere@oracle.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message