lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Localize the largest fields (content) in index
Date Thu, 29 Mar 2012 19:54:30 GMT
I don't think there's really any reason SolrCloud won't work with
Tomcat, the setup is
probably just tricky. See:
It's about a year old, but might prove helpful.


On Thu, Mar 29, 2012 at 3:41 PM, Vadim Kisselmann
<> wrote:
> Yes, i think so, too :)
> MLT doesn´t need termVectors really, but it´s faster with them. I
> found out, what
> MLT works better on the title field in my case, instead of big text fields.
> Sharding is in planning, but my setup with SolrCloud, ZK and Tomcat
> doesn´t work,
> see here:
> I split my huge index (150GB-index in this case is my test-index), and
> want use SolrCloud,
> but it´s not runnable with tomcat at this time.
> Best regards
> Vadim
> 2012/3/29 Erick Erickson <>:
>> Yeah, it's worth a try. The term vectors aren't entirely necessary for
>> highlighting,
>> although they do make things more efficient.
>> As far as MLT, does MLT really need such a big field?
>> But you may be on your way to sharding your index if you remove this info
>> and testing shows problems....
>> Best
>> Erick
>> On Thu, Mar 29, 2012 at 9:32 AM, Vadim Kisselmann
>> <> wrote:
>>> Hi Erick,
>>> thanks:)
>>> The admin UI give me the counts, so i can identify fields with big
>>> bulks of unique terms.
>>> I known this wiki-page, but i read it one more time.
>>> List of my file extensions with size in GB(Index size ~150GB):
>>> tvf 90GB
>>> fdt 30GB
>>> tim 18GB
>>> prx 15GB
>>> frq 12GB
>>> tip 200MB
>>> tvx 150MB
>>> tvf is my biggest file extension.
>>> Wiki :This file contains, for each field that has a term vector
>>> stored, a list of the terms, their frequencies and, optionally,
>>> position and offest information.
>>> Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
>>> But i think i should test my performance without termVectors. Good Idea? :)
>>> What do you think about my file extension sizes?
>>> Best regards
>>> Vadim
>>> 2012/3/29 Erick Erickson <>:
>>>> The admin UI (schema browser) will give you the counts of unique terms
>>>> in your fields, which is where I'd start.
>>>> I suspect you've already seen this page, but if not:
>>>> the .fdt and .fdx file extensions are where data goes when
>>>> you set 'stored="true" '. These files don't affect search speed,
>>>> they just contain the verbatim copy of the data.
>>>> The relative sizes of the various files above should give
>>>> you a hint as to what's using the most space, but it'll be a bit
>>>> of a hunt for you to pinpoint what's actually up. TermVectors
>>>> and norms are often sources of using up space.
>>>> Best
>>>> Erick
>>>> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
>>>> <> wrote:
>>>>> Hello folks,
>>>>> i work with Solr 4.0 r1292064 from trunk.
>>>>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>>>>> (25% stored, 75% indexed).
>>>>> I want to find out, which fields(content) are too large, to consider
>>>>> How can i localize/discover the largest fields in my index?
>>>>> Luke(latest from trunk) doesn't work
>>>>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>>>>> this these, but i get many errors
>>>>> and can't build it.
>>>>> What other options do i have?
>>>>> Thanks and best regards
>>>>> Vadim

View raw message