lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smiley, David W." <dsmi...@mitre.org>
Subject Re: custom field type plugin
Date Wed, 24 Jul 2013 02:45:30 GMT
Kevin,

Those are some good query response times but they could be better.  You've
configured the field type sub-optimally.  Look again at
http://wiki.apache.org/solr/SpatialForTimeDurations and note in particular
maxDistErr.  You've left it at the value that comes pre-configured with
Solr, 0.000000009, which is ~1 meter measured in degrees, and this value
makes no sense when your numeric range is in whole numbers.  I suspect you
inherited this value from Hoss's slides.  **Instead use 1.** (as shown on
the wiki). This affects performance in a big way since you've configured
the prefixTree to hold 2.22e18 values (calculated via (max-min) /
maxDistErr) as opposed to "just" 2e10.  Your log shows maxLevels is 50 for
quad tree.  The comments in QuadPrefixTree (and I put them there once)
indicate maxLevels of 50 is about as much as is supported.  But again, I'm
not certain what the limit really is without validating.  Hopefully you
can stay clear of 50.  To do some tests, try querying just on the edge on
either side of an indexed value to make sure you match the point and then
don't match the indexed point as you would expect based on the
instructions.  Also, be sure to read more of the details on "Search" on
this wiki page in which you are advised to buffer the query shape
slightly; you didn't do this in your examples below.  This is all a bit of
a hack when using a field that internally is using floating point instead
of fixed precision.

~ David Smiley

On 7/23/13 9:32 PM, "Kevin Stone" <Kevin.Stone@jax.org> wrote:

>Sorry for the late response. I needed to find the time to load a lot of
>extra data (closer to what we're anticipating). I have an index with close
>to 220,000 documents, each with at least two coordinate regions anywhere
>between -10 billion to +10 billion, but could potentially have up to maybe
>half dozen regions in one document. The reason for the negatives, is
>because you can read a chromosome either backwards or forwards, so many
>coordinates can be minus.
>
>Here is the schema field definition:
>
>        <fieldType name="geneticLocation"
>         class="solr.SpatialRecursivePrefixTreeFieldType"
>         multiValued="true"
>         geo="false"
>         worldBounds="-100000000000 -100000000000 100000000000
>100000000000"
>         distErrPct="0"
>         maxDistErr="0.000000009"
>         units="degrees"
>         />
>
>
>Here is the first query in the log:
>
>INFO: 
>geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFie
>l
>dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={dist
>E
>rrPct=0, geo=false, multiValued=true, worldBounds=-100000000000
>-100000000000 100000000000 100000000000, maxDistErr=0.000000009,
>units=degrees}} strat:
>RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(max
>L
>evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc,
>worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)})))
>maxLevels: 50
>Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute
>INFO: [testIndex] webapp=/solr path=/select
>params={wt=xml&q=humanCoordinate:"Intersects(0+60330+6033041244+1000000000
>0
>)"&rows=100} hits=81112 status=0 QTime=122
>
>
>
>
>
>Here are some other queries to give different timings (the one above
>brings back quite a lot):
>
>INFO: [testIndex] webapp=/solr path=/select
>params={wt=xml&q=humanCoordinate:"Intersects(0+6000000000+6900000000+10000
>0
>00000)"&rows=100} hits=6031 status=0 QTime=10
>Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute
>INFO: [testIndex] webapp=/solr path=/select
>params={wt=xml&q=humanCoordinate:"Intersects(0+0+10000000+10000000000)"&ro
>w
>s=100} hits=500 status=0 QTime=15
>Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute
>INFO: [testIndex] webapp=/solr path=/select
>params={wt=xml&q=humanCoordinate:"Intersects(0+7831329+7831329+10000000000
>)
>"&rows=100} hits=4 status=0 QTime=17
>INFO: [testIndex] webapp=/solr path=/select
>params={wt=xml&q=humanCoordinate:"Intersects(-10000000000+-1051057963+-100
>1
>057963+0)"&rows=100} hits=661 status=0 QTime=8
>
>
>
>The query times look pretty fast to me. Certainly I'm pretty impressed.
>Our other backup solutions (involving SQL) likely wouldn't even touch this
>in terms of speed.
>
>
>
>We will be testing this more in depth in the coming month. I am sort of
>jumping ahead of our team to research possible solutions, since this is
>something that worried us. Looks like it might work!
>
>Thanks,
>-Kevin
>
>On 7/23/13 1:47 PM, "David Smiley (@MITRE.org)" <DSMILEY@mitre.org> wrote:
>
>>Oh cool!  I'm glad it at least seemed to work.  Can you post your
>>configuration of the field type and report from Solr's logs what the
>>"maxLevels" is used for this field, which is logged the first time you
>>use
>>the field type?
>>
>>Maybe there isn't a limit under 10B after all.  Some quick'n'dirty
>>calculations I just did indicate there shouldn't be a problem but
>>real-world
>>usage will be a better proof.  Indexing probably won't be terribly slow,
>>queries could get pretty slow if the amount of indexed data is really
>>high. 
>>I'd love to hear how it works out for you.  Your use-case would benefit a
>>lot from an improved prefix tree implementation.
>>
>>I don't gather how a 3rd dimension would play into this.  Support for
>>multi-dimensional spatial is on the drawing board.
>>
>>~ David
>>
>>
>>Kevin Stone wrote
>>> What are the dangers of trying to use a range of 10 billion? Simply a
>>> slower index time? Or will I get inaccurate results?
>>> I have tried it on a very small sample of documents, and it seemed to
>>> work. I could spend some time this week trying to get a more robust
>>>(and
>>> accurate) dataset loaded to play around with. The reason for the 10
>>> billion is to support being able to query for a region on a chromosome.
>>> 
>>> A user might want to know what genes overlap a point on a specific
>>> chromosome. Unless I can use 3 dimensional coordinates (which gave an
>>> error when I tried it), I'll need to multiply the coordinates by some
>>> offset for each chromosome to be able to normalise the data (at both
>>>index
>>> and query time). The largest chromosome (chr 1) has almost 250,000,000
>>> base pairs. I could probably squeeze the rest a bit smaller, but I'd
>>> rather use one size for all chromosomes, since we have more than just
>>> human data to deal with. It would get quite messy otherwise.
>>> 
>>> 
>>> On 7/22/13 11:50 AM, "David Smiley (@MITRE.org)" &lt;
>>
>>> DSMILEY@
>>
>>> &gt; wrote:
>>> 
>>>>Like Hoss said, you're going to have to solve this using
>>>>http://wiki.apache.org/solr/SpatialForTimeDurations
>>>>Using PointType is *not* going to work because your durations are
>>>>multi-valued per document.
>>>>
>>>>It would be useful to create a custom field type that wraps the
>>>>capability
>>>>outlined on the wiki to make it easier to use without requiring the
>>>>user
>>>>to
>>>>think spatially.
>>>>
>>>>You mentioned that these numeric ranges extend upwards of 10 billion or
>>>>so.
>>>>Unfortunately, the current "prefix tree" implementation under the hood
>>>>for
>>>>non-geodetic spatial, the QuadTree, is unlikely to scale to numbers
>>>>that
>>>>big.  I don't know where the boundary is, but I doubt 10B.  You could
>>>>try
>>>>and see what happens.  I'm working (very slowly on very little spare
>>>>time)
>>>>on improving the PrefixTree implementations to scale to such large
>>>>numbers;
>>>>I hope something will be available this fall.
>>>>
>>>>~ David Smiley
>>>>
>>>>
>>>>Kevin Stone wrote
>>>>> I have a particular use case that I think might require a custom
>>>>>field
>>>>> type, however I am having trouble getting the plugin to work.
>>>>> My use case has to do with genetics data, and we are running into
>>>>>several
>>>>> situations were we need to be able to query multiple regions of a
>>>>> chromosome (or gene, or other object types). All that really boils
>>>>>down
>>>>>to
>>>>> is being able to give a number, e.g. 10234, and return documents that
>>>>>have
>>>>> regions containing the number. So you'd have a document with a list
>>>>>like
>>>>> ["10000:16090","400:8000","40123:43564"], and it should come back
>>>>>because
>>>>> 10234 falls between "10000:16090". If there is a better or easier way
>>>>>to
>>>>> do this please speak up. I'd rather not have to use a "join" on
>>>>>another
>>>>> index, because 1) it's more complex to set up, and 2) we might need
>>>>>to
>>>>> join against something else and you can only do one join at a time.
>>>>>
>>>>> AnywayŠ I tried creating a field type similar to a PointType just to
>>>>>see
>>>>> if I could get one working. I added the following jars to get it to
>>>>> compile:
>>>>>
>>>>>apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-s
>>>>>o
>>>>>lr
>>>>>-solrj-4.0.0.
>>>>> I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib
>>>>> folder, and specified it in my solr.xml (I have multiple cores).
>>>>>
>>>>> After starting up solr, I got the line that it picked up the jar:
>>>>> INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader
>>>>>
>>>>> But I get this error about it not being able to find the
>>>>> AbstractSubTypeFieldType class.
>>>>> Here is the first bit of the trace:
>>>>>
>>>>> SEVERE: null:java.lang.NoClassDefFoundError:
>>>>> org/apache/solr/schema/AbstractSubTypeFieldType
>>>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
>>>>> at
>>>>>java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142
>>>>>)
>>>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>>>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>>> ...etcŠ
>>>>>
>>>>>
>>>>> Any hints as to what I did wrong? I can provide source code, or a
>>>>>fuller
>>>>> stack trace, config settings, etc.
>>>>>
>>>>> Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib,
>>>>>then
>>>>> repack. However, when I did that, I get a NoClassDefFoundError for my
>>>>> plugin itself.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>> The information in this email, including attachments, may be
>>>>>confidential
>>>>> and is intended solely for the addressee(s). If you believe you
>>>>>received
>>>>> this email by mistake, please notify the sender by return email as
>>>>>soon
>>>>>as
>>>>> possible.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>-----
>>>> Author:
>>>>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>>>>--
>>>>View this message in context:
>>>>http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086p4
>>>>0
>>>>79
>>>>494.html
>>>>Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>>> 
>>> The information in this email, including attachments, may be
>>>confidential
>>> and is intended solely for the addressee(s). If you believe you
>>>received
>>> this email by mistake, please notify the sender by return email as soon
>>>as
>>> possible.
>>
>>
>>
>>
>>
>>-----
>> Author: 
>>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>>--
>>View this message in context:
>>http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086p407
>>9
>>822.html
>>Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>The information in this email, including attachments, may be confidential
>and is intended solely for the addressee(s). If you believe you received
>this email by mistake, please notify the sender by return email as soon
>as possible.


Mime
View raw message