lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Cecere <john.cec...@oracle.com>
Subject Re: IndexWriter croaks on large file
Date Wed, 19 Feb 2014 12:51:38 GMT
Thanks Tri. I've tried a variation of the approach you suggested here and it appears to work
well. Just one question. Will there be 
a problem with adding multiple Document objects to the IndexWriter that have the same field
names and values for the StoredFields ? 
They all have different TextFields (the content). I've tried doing this and haven't found
any problems with it, but I'm just 
wondering if there's anything I should be aware of.

Regards,
John

On 2/14/14 4:37 PM, Tri Cao wrote:
> As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index
though with
> that approach though :)
>
> I do agree that indexing huge documents doesn't seem to have a lot of value, even when
you
> know a doc is a hit for a certain query, how are you going to display the results to
users?
>
> John, for huge data set, it's usually a good idea to roll your own distributed indexes,
and model
> you data schema very carefully. For example, if you are going to index log files, one
reasonable
> idea is to make every 5 minutes of logs a document.
>
> Regards,
> Tri
>
> On Feb 14, 2014, at 01:20 PM, Glen Newton <glen.newton@gmail.com> wrote:
>
>> You should consider making each _line_ of the log file a (Lucene)
>> document (assuming it is a log-per-line log file)
>>
>> -Glen
>>
>> On Fri, Feb 14, 2014 at 4:12 PM, John Cecere <john.cecere@oracle.com <mailto:john.cecere@oracle.com>>
wrote:
>>> I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At
>>> any rate, I don't have control over the size of the documents that go into
>>> my database. Sometimes my customer's log files end up really big. I'm
>>> willing to have huge indexes for these things.
>>> Wouldn't just changing from int to long for the offsets solve the problem ?
>>> I'm sure it would probably have to be changed in a lot of places, but why
>>> impose such a limitation ? Especially since it's using an InputStream and
>>> only dealing with a block of data at a time.
>>> I'll take a look at your suggestion.
>>> Thanks,
>>> John
>>> On 2/14/14 3:20 PM, Michael McCandless wrote:
>>>> Hmm, why are you indexing such immense documents?
>>>> In 3.x Lucene never sanity checked the offsets, so we would silently
>>>> index negative (int overflow'd) offsets into e.g. term vectors.
>>>> But in 4.x, we now detect this and throw the exception you're seeing,
>>>> because it can lead to index corruption when you index the offsets
>>>> into the postings.
>>>> If you really must index such enormous documents, maybe you could
>>>> create a custom tokenizer (derived from StandardTokenizer) that
>>>> "fixes" the offset before setting them? Or maybe just doesn't even
>>>> set them.
>>>> Note that position can also overflow, if your documents get too large.
>>>> Mike McCandless
>>>> http://blog.mikemccandless.com
>>>> On Fri, Feb 14, 2014 at 1:36 PM, John Cecere <john.cecere@oracle.com <mailto:john.cecere@oracle.com>>
>>>> wrote:
>>>>> I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a
>>>>> file >
>>>>> 2GB in size, it dies with the following exception:
>>>>> java.lang.IllegalArgumentException: startOffset must be non-negative,
and
>>>>> endOffset must be >= startOffset,
>>>>> startOffset=-2147483648,endOffset=-2147483647
>>>>> Essentially, I'm doing this:
>>>>> Directory directory = new MMapDirectory(indexPath);
>>>>> Analyzer analyzer = new StandardAnalyzer();
>>>>> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,
>>>>> analyzer);
>>>>> IndexWriter iw = new IndexWriter(directory, iwc);
>>>>> InputStream is = <my input stream>;
>>>>> InputStreamReader reader = new InputStreamReader(is);
>>>>> Document doc = new Document();
>>>>> doc.add(new StoredField("fileid", fileid));
>>>>> doc.add(new StoredField("pathname", pathname));
>>>>> doc.add(new TextField("content", reader));
>>>>> iw.addDocument(doc);
>>>>> It's the IndexWriter addDocument method that throws the exception. In
>>>>> looking at the Lucene source code, it appears that the offsets being
used
>>>>> internally are int, which makes it somewhat obvious why this is
>>>>> happening.
>>>>> This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
>>>>> capable of handling a file over 2GB in this manner. What has changed
and
>>>>> how
>>>>> do I get around this ? Is Lucene no longer capable of handling files
this
>>>>> large, or is there some other way I should be doing this ?
>>>>> Here's the full stack trace sans my code:
>>>>> java.lang.IllegalArgumentException: startOffset must be non-negative,
and
>>>>> endOffset must be >= startOffset,
>>>>> startOffset=-2147483648,endOffset=-2147483647
>>>>> at
>>>>> org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
>>>>> at
>>>>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
>>>>> at
>>>>> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
>>>>> at
>>>>> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>>>>> at
>>>>> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>>>>> at
>>>>> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>>>>> at
>>>>> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>>>>> at
>>>>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
>>>>> at
>>>>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
>>>>> at
>>>>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
>>>>> at
>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
>>>>> at
>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)
>>>>> Thanks,
>>>>> John
>>>>> --
>>>>> John Cecere
>>>>> Principal Engineer - Oracle Corporation
>>>>> 732-987-4317 / john.cecere@oracle.com <mailto:john.cecere@oracle.com>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org <mailto:java-user-unsubscribe@lucene.apache.org>
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org <mailto:java-user-help@lucene.apache.org>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org <mailto:java-user-unsubscribe@lucene.apache.org>
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org <mailto:java-user-help@lucene.apache.org>
>>> --
>>> John Cecere
>>> Principal Engineer - Oracle Corporation
>>> 732-987-4317 / john.cecere@oracle.com <mailto:john.cecere@oracle.com>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org <mailto:java-user-unsubscribe@lucene.apache.org>
>>> For additional commands, e-mail: java-user-help@lucene.apache.org <mailto:java-user-help@lucene.apache.org>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org <mailto:java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.org <mailto:java-user-help@lucene.apache.org>
>>

-- 
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cecere@oracle.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message