lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bamford <ch...@bammers.net>
Subject Re: Size of Document
Date Thu, 05 Jul 2018 06:22:27 GMT
Yes I see, I originally missed Terry’s response which is probably the source of the confusion.

So to clarify: I already know the size of the source document. As you say, this bears little
resemblance to what actually gets written when indexed. It is this latter figure I was hoping
to get.

Thanks everyone.

Chris



> On 5 Jul 2018, at 03:31, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> I think we're not talking about the same thing.
> 
> You asked "How can I calculate the total size of a Lucene Document"...
> 
> I was responding to the Terry's comment "In the document types I
> usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> called "stream_size" that contains the size of the document on disk. "
> 
> Two totally different beasts. One is the source document, the other is
> what you choose to put into the index from that document. Not to even
> mention that you could, for instance, choose to index only the title
> and throw everything else away so the size of the raw document on disk
> doesn't seem useful for your case.
> 
> Best,
> Erick
> 
>> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <chris@bammers.net> wrote:
>> Hi Erick
>> 
>> Yes, size on disk is what I’m after as it will feed into an eventual calculation
regarding actual bytes written (not interested in the source data document size, just real
disk usage).
>> Thanks
>> 
>> Chris
>> 
>> Sent from my iPhone
>> 
>>> On 4 Jul 2018, at 17:08, Erick Erickson <erickerickson@gmail.com> wrote:
>>> 
>>> But does size on disk help? If the doc has a zillion
>>> images in it, those aren't part of the resulting index
>>> (I'm excluding stored data here)....
>>> 
>>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <terry@net-frame.com>
wrote:
>>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>>> exists a metadata field called "stream_size" that contains the size of
>>>> the document on disk.  You don't have to compute it.  Thus, when you
>>>> retrieve each document you can pull out the contents of this field and,
>>>> if you like, include it in each hitlist entry.
>>>> 
>>>> 
>>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>>> Hi there,
>>>>> 
>>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>>> to write to an index so I know how many bytes I am writing please?  I
>>>>> need it for some external metrics collection.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> - Chris
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message