lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Stone <Kevin.St...@jax.org>
Subject Re: custom field type plugin
Date Wed, 24 Jul 2013 01:32:04 GMT
Sorry for the late response. I needed to find the time to load a lot of
extra data (closer to what we're anticipating). I have an index with close
to 220,000 documents, each with at least two coordinate regions anywhere
between -10 billion to +10 billion, but could potentially have up to maybe
half dozen regions in one document. The reason for the negatives, is
because you can read a chromosome either backwards or forwards, so many
coordinates can be minus.

Here is the schema field definition:

        <fieldType name="geneticLocation"
         class="solr.SpatialRecursivePrefixTreeFieldType"
         multiValued="true"
         geo="false"
         worldBounds="-100000000000 -100000000000 100000000000
100000000000"
         distErrPct="0"
         maxDistErr="0.000000009"
         units="degrees"
         />


Here is the first query in the log:

INFO: 
geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFiel
dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={distE
rrPct=0, geo=false, multiValued=true, worldBounds=-100000000000
-100000000000 100000000000 100000000000, maxDistErr=0.000000009,
units=degrees}} strat:
RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(maxL
evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc,
worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)})))
maxLevels: 50
Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+60330+6033041244+10000000000
)"&rows=100} hits=81112 status=0 QTime=122





Here are some other queries to give different timings (the one above
brings back quite a lot):

INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+6000000000+6900000000+100000
00000)"&rows=100} hits=6031 status=0 QTime=10
Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+0+10000000+10000000000)"&row
s=100} hits=500 status=0 QTime=15
Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(0+7831329+7831329+10000000000)
"&rows=100} hits=4 status=0 QTime=17
INFO: [testIndex] webapp=/solr path=/select
params={wt=xml&q=humanCoordinate:"Intersects(-10000000000+-1051057963+-1001
057963+0)"&rows=100} hits=661 status=0 QTime=8



The query times look pretty fast to me. Certainly I'm pretty impressed.
Our other backup solutions (involving SQL) likely wouldn't even touch this
in terms of speed. 



We will be testing this more in depth in the coming month. I am sort of
jumping ahead of our team to research possible solutions, since this is
something that worried us. Looks like it might work!

Thanks,
-Kevin

On 7/23/13 1:47 PM, "David Smiley (@MITRE.org)" <DSMILEY@mitre.org> wrote:

>Oh cool!  I'm glad it at least seemed to work.  Can you post your
>configuration of the field type and report from Solr's logs what the
>"maxLevels" is used for this field, which is logged the first time you use
>the field type?
>
>Maybe there isn't a limit under 10B after all.  Some quick'n'dirty
>calculations I just did indicate there shouldn't be a problem but
>real-world
>usage will be a better proof.  Indexing probably won't be terribly slow,
>queries could get pretty slow if the amount of indexed data is really
>high. 
>I'd love to hear how it works out for you.  Your use-case would benefit a
>lot from an improved prefix tree implementation.
>
>I don't gather how a 3rd dimension would play into this.  Support for
>multi-dimensional spatial is on the drawing board.
>
>~ David
>
>
>Kevin Stone wrote
>> What are the dangers of trying to use a range of 10 billion? Simply a
>> slower index time? Or will I get inaccurate results?
>> I have tried it on a very small sample of documents, and it seemed to
>> work. I could spend some time this week trying to get a more robust (and
>> accurate) dataset loaded to play around with. The reason for the 10
>> billion is to support being able to query for a region on a chromosome.
>> 
>> A user might want to know what genes overlap a point on a specific
>> chromosome. Unless I can use 3 dimensional coordinates (which gave an
>> error when I tried it), I'll need to multiply the coordinates by some
>> offset for each chromosome to be able to normalise the data (at both
>>index
>> and query time). The largest chromosome (chr 1) has almost 250,000,000
>> base pairs. I could probably squeeze the rest a bit smaller, but I'd
>> rather use one size for all chromosomes, since we have more than just
>> human data to deal with. It would get quite messy otherwise.
>> 
>> 
>> On 7/22/13 11:50 AM, "David Smiley (@MITRE.org)" &lt;
>
>> DSMILEY@
>
>> &gt; wrote:
>> 
>>>Like Hoss said, you're going to have to solve this using
>>>http://wiki.apache.org/solr/SpatialForTimeDurations
>>>Using PointType is *not* going to work because your durations are
>>>multi-valued per document.
>>>
>>>It would be useful to create a custom field type that wraps the
>>>capability
>>>outlined on the wiki to make it easier to use without requiring the user
>>>to
>>>think spatially.
>>>
>>>You mentioned that these numeric ranges extend upwards of 10 billion or
>>>so.
>>>Unfortunately, the current "prefix tree" implementation under the hood
>>>for
>>>non-geodetic spatial, the QuadTree, is unlikely to scale to numbers that
>>>big.  I don't know where the boundary is, but I doubt 10B.  You could
>>>try
>>>and see what happens.  I'm working (very slowly on very little spare
>>>time)
>>>on improving the PrefixTree implementations to scale to such large
>>>numbers;
>>>I hope something will be available this fall.
>>>
>>>~ David Smiley
>>>
>>>
>>>Kevin Stone wrote
>>>> I have a particular use case that I think might require a custom field
>>>> type, however I am having trouble getting the plugin to work.
>>>> My use case has to do with genetics data, and we are running into
>>>>several
>>>> situations were we need to be able to query multiple regions of a
>>>> chromosome (or gene, or other object types). All that really boils
>>>>down
>>>>to
>>>> is being able to give a number, e.g. 10234, and return documents that
>>>>have
>>>> regions containing the number. So you'd have a document with a list
>>>>like
>>>> ["10000:16090","400:8000","40123:43564"], and it should come back
>>>>because
>>>> 10234 falls between "10000:16090". If there is a better or easier way
>>>>to
>>>> do this please speak up. I'd rather not have to use a "join" on
>>>>another
>>>> index, because 1) it's more complex to set up, and 2) we might need to
>>>> join against something else and you can only do one join at a time.
>>>>
>>>> AnywayŠ I tried creating a field type similar to a PointType just to
>>>>see
>>>> if I could get one working. I added the following jars to get it to
>>>> compile:
>>>>
>>>>apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-so
>>>>lr
>>>>-solrj-4.0.0.
>>>> I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib
>>>> folder, and specified it in my solr.xml (I have multiple cores).
>>>>
>>>> After starting up solr, I got the line that it picked up the jar:
>>>> INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader
>>>>
>>>> But I get this error about it not being able to find the
>>>> AbstractSubTypeFieldType class.
>>>> Here is the first bit of the trace:
>>>>
>>>> SEVERE: null:java.lang.NoClassDefFoundError:
>>>> org/apache/solr/schema/AbstractSubTypeFieldType
>>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
>>>> at
>>>>java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>> ...etcŠ
>>>>
>>>>
>>>> Any hints as to what I did wrong? I can provide source code, or a
>>>>fuller
>>>> stack trace, config settings, etc.
>>>>
>>>> Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib,
>>>>then
>>>> repack. However, when I did that, I get a NoClassDefFoundError for my
>>>> plugin itself.
>>>>
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> The information in this email, including attachments, may be
>>>>confidential
>>>> and is intended solely for the addressee(s). If you believe you
>>>>received
>>>> this email by mistake, please notify the sender by return email as
>>>>soon
>>>>as
>>>> possible.
>>>
>>>
>>>
>>>
>>>
>>>-----
>>> Author:
>>>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>>>--
>>>View this message in context:
>>>http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086p40
>>>79
>>>494.html
>>>Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 
>> The information in this email, including attachments, may be
>>confidential
>> and is intended solely for the addressee(s). If you believe you received
>> this email by mistake, please notify the sender by return email as soon
>>as
>> possible.
>
>
>
>
>
>-----
> Author: 
>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086p4079
>822.html
>Sent from the Solr - User mailing list archive at Nabble.com.


The information in this email, including attachments, may be confidential and is intended
solely for the addressee(s). If you believe you received this email by mistake, please notify
the sender by return email as soon as possible.

Mime
View raw message