lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-1931) Schema Browser does not scale with large indexes
Date Sat, 31 Dec 2011 14:36:30 GMT

     [ https://issues.apache.org/jira/browse/SOLR-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Erick Erickson updated SOLR-1931:
---------------------------------

    Attachment: SOLR-1931-trunk.patch
                SOLR-1931-3x.patch

Well, there are a couple of issues here. I've attached patches for trunk and 3x for consideration.

I fixed a structural flaw that traversed all the terms in all the fields twice, once to get
the total number of terms across all the fields and once to get the individual counts.

But that's not where the bulk of the time gets spent. It turns out that getting the count
of documents in which each term appears is the culprit. These two lines are executed for each
field
  Query q = new TermRangeQuery(fieldName, null, null, false, false);
  TopDocs top = searcher.search(q, 1);

and top.totalHits is reported. I have an index with 99M documents, mostly integer data that
takes 360 seconds to return data when the above is executed and 150 without. Both versions
traverse all the terms once, so these times would be greater without the patch due to the
second traversal.

So the attached patches default to NOT doing the above and there's a new parameter reportDocCount
that can be set to true to collect that information. What do people think? And is there a
better way to get the count of documents in which the term appears? And do any alternate methods
respect deleted docs like this one does?

I tried spinning through using TermDocs (3.6) but soon realized that the people who wrote
TermRangeQuery probably got there first.

So I guess my real question is whether people object to the change in behavior, that users
must explicitly request doc counts. Which also means that the admin/schema browser doesn't
report this by default and I haven't made it optional from that interface. I'm not inclined
to since that interface is going away, but if people feel strongly I might be persuaded. That
info is available by admin/luke?fl=myfield&reportDocCount=true in a less painful fashion
for a particular field anyway.

Along the way I alphabetized the fields without my other kludge of putting comparators in
other classes. I'll kill that JIRA if this one goes forward.

Note that this still doesn't scale all that well, on my test index it's still a 5 minute wait.
But then I guess that this kind of data gathering will take time by its nature.

If nobody objects, I'll commit this early next week after I've had a chance to put it down
for a while and look at it with fresh eyes and do some more testing. I think there's some
inefficiencies in the single pass that I can wring out (about 30 seconds is spent just gathering
the data in the single term enumeration loop).
                
> Schema Browser does not scale with large indexes
> ------------------------------------------------
>
>                 Key: SOLR-1931
>                 URL: https://issues.apache.org/jira/browse/SOLR-1931
>             Project: Solr
>          Issue Type: Improvement
>          Components: web gui
>    Affects Versions: 3.6, 4.0
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: SOLR-1931-3x.patch, SOLR-1931-trunk.patch
>
>
> The Schema  Browser JSP by default causes the Luke handler to "scan the world". In large
indexes this make the UI useless.
> On an index with 64m documents & 8gb of disk space, the Schema Browser took 6 minutes
to open and hogged all disk I/O, making Solr useless.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message