lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "SolrAdaptersForLuceneSpatial4" by DavidSmiley
Date Fri, 05 Oct 2012 05:36:18 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "SolrAdaptersForLuceneSpatial4" page has been changed by DavidSmiley:
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4?action=diff&rev1=5&rev2=6

Comment:
More progress; more to come tomorrow

  
   * Multi-valued indexed fields.  This is critical for storing the results of automatic place
extraction from text using natural language processing techniques with a gazetteer (a variant
of "geocoding"), since a variable number of locations will be found.
   * Index shapes with area, not just points.  An indexed shape is essentially pixelated (i.e.
gridded) to a configured resolution per shape.  By default that resolution is defined by a
percentage of the overall shape size, and it applies to query shapes too.  Note: If extremely
high precision of shape edges needs to be retained for accurate indexing, then this solution
probably won't scale too well at indexing time (big indexes, slow indexing).  On the other
hand, query shapes generally scale well to the maximum configured precision regardless of
shape size.  Note: indexing shapes with area sorely [[https://issues.apache.org/jira/browse/LUCENE-4419|needs
testing]].
-  * Polygon, LineString and other new shapes.  All shapes are supported as indexed shapes
and query shapes.  Note: Shapes other than point, rectangle and circle are supported via JTS
-- an otherwise optional dependency.  JTS views the world as a flat plane; the latitude and
longitude are mapped to this plane directly.  It uses Euclidean math operations, not Geodesic
ones.  By and large this isn't a problem, although it can be if the vertices are particularly
far apart longitudinally.  Spatial4j adapts shapes that cross the dateline to be compatible
with JTS, and you shouldn't notice a problem (notwithstanding unknown bugs).  It does not
support shapes covering the poles yet.  Consequently if you want to index or query by the
Antarctica polygon for example, you are out of luck for now.
+  * Polygon, LineString and other new shapes.  All shapes are supported as indexed shapes
and query shapes.  Note: Shapes other than point, rectangle and circle are supported via JTS
-- an otherwise optional dependency.  JTS views the world as a flat plane; the latitude and
longitude are mapped to this plane directly.  It uses Euclidean math operations, not Geodesic
ones.  By and large this isn't a problem, although it can be if the vertices are particularly
far apart longitudinally.  Spatial4j adapts shapes that cross the dateline to be compatible
with JTS, and so you shouldn't notice a problem (notwithstanding unknown bugs).  It does not
support shapes covering the poles yet.  Consequently if you want to index or query by the
Antarctica polygon for example, you are out of luck for now.
   * Rectangles with user-specifiable corners.  Oddly, Solr 3 spatial only supports the bounding
box of a circle. 
-  * Multi-value distance sort / score boost.  Note: this is a preliminary unoptimized implementation
that uses a fair amount of RAM.  An alternative should be provided in the future.
+  * Multi-value distance sort / score boost.  Note: this is a preliminary unoptimized implementation
that uses a fair amount of RAM, even when multiValued=false.  An alternative should be provided
in the future.
   * Configurable precision which can vary per shape at query time (and sort of at index time).
 This enhances the performance.
   * Fast filtering.  The code was benchmarked once showing it outperforms Solr 3's "LatLonType"
at its own game (single valued indexed points), and several 3rd parties anecdotally reported
the same, especially for multi-million document indices.  It is based on SOLR-2155 which was
benchmarked in January 2010; so a new benchmark is a TODO item.  Also, Solr 3 LatLonType sometimes
requires all the points to be in memory, whereas the new spatial module here doesn't for filtering.
+  * [[http://en.wikipedia.org/wiki/Well-known_text|Well Known Text]] (WKT) support via JTS.
 WKT is arguably the most widely supported textual format for shapes.  However, standard WKT
doesn't specify a format for circles.
  
  Of course, the basics in Solr 3 not mentioned here are implemented in this framework.  For
example, lat-lon bounding boxes and circles.
  
@@ -31, +32 @@

  
  First, you must register a spatial field type in the Solr schema.xml file.  The instructions
in this whole document imply the RecursivePrefixTreeStrategy based field type used in a geospatial
context.
  {{{
-     <fieldType name="geo"   class="solr.SpatialRecursivePrefixTreeFieldType"
+     <fieldType name="location_rpt"   class="solr.SpatialRecursivePrefixTreeFieldType"
                 spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
                 distErrPct="0.025"
-                maxDetailDist="0.001"
+                maxDistErr="0.000009"
+                units="degrees"
              />
  }}}
  The XML attributes are parameters for configuring the field type:
-  * spatialContextFactory: If polygons or WKT formatted shape support is needed, then use
the JTS based class as shown above, otherwise this can be omitted.  The JTS jar file must
be on Solr's classpath as well.
-  * distErrPct="0.025": When indexing shapes other than points, this is used to specify the
default precision of a shape's area which is basically pixelated on an indexed grid.  This
number is approximated as the fraction of a shape's average radius.  The name of the parameter
suggests it is a percentage but this is in error, it is a fraction between 0 and 1.  The closer
this number is to zero, the more space an indexed shape will take up and will take longer
to index.
-  * maxDetailDist="0.001": The highest level of detail indexed, expressed in kilometers,
i.e. the minimum required precision.  The actual detail will be even more precise than this,
and it will be higher towards the poles.  Internally, this number is used to derive a "maxLevels"
trie length for the trie encoding used, which is logged at startup.
+  * spatialContextFactory: If polygons or other WKT formatted shape support is needed, then
use the JTS based class as shown above, otherwise this can be omitted.  The JTS jar file must
be on Solr's classpath as well.  Due to a combination of things, JTS can't simply be referenced
by a "<lib>" entry in solrconfig.xml; it needs to be in WEB-INF/lib in Solr's war file,
basically. 
+  * distErrPct="0.025": When indexing shapes other than points, this is used to specify the
default precision of a shape's area which is basically pixelated on an indexed grid.  This
number is approximated as the fraction of the distance between the center of a shape and the
farthest corner of its bounding box.  As a fraction it is between 0 and 1, although this one
is capped at 0.5.  The closer this number is to zero, the more accurate the shape will be
but it will use more disk space and it will take longer to index.
+  * maxDistErr="0.000009": The highest level of detail required for indexed data.  If you
specify nothing then it is an amount equivalent to a meter -- which is just a hair less than
0.000009 degrees.  The units of this are as indicated in the "units" attribute.  The actual
detail will generally by somewhat more precise than this.  Internally, this number is used
to derive a "maxLevels" trie length for the trie encoding used, which is logged at startup.
+  * units="degrees": This parameter is mandatory, and currently the only value supported
is "degrees".  This affects the interpretation of maxDistErr, circle radius distances, and
other absolute distances.  There are approximately 111.2 kilometers in a degree, based on
the average earth radius.
+ 
+ For non-geospatial uses, there are some other attributes to be aware of:
+  * geo="false": Set geospatial to false. It defaults to true.   By setting it to false,
you really should indicate worldBounds and probably maxDistErr as well.
+  * worldBounds="minX minY maxX maxY": Set the valid numerical ranges for x & y.  By
default for non-geospatial this is the limits of a Java double however those values have been
shown to not work (yet).
+ 
- There are other parameters not yet documented as they are more obscure, such as using for
non-geo, using other distance calculation formulas, other default units besides km, internal
trie encodings other than geohash.
+ There are other parameters not yet documented as they are more obscure, such as using other
distance calculation formulas, and specifying the grid encoding (geohash vs quad).
  
  And finally, specify a field that uses this field type:
- {{{   <field name="geo"  type="geo"  indexed="true" stored="true"  multiValued="true"
/>  }}}
+ {{{   <field name="geo"  type="location_rpt"  indexed="true" stored="true"  multiValued="true"
/>  }}}
  
  A key feature of the new spatial module is multi-value support but you certainly aren't
required to declare the field multiValued if it isn't.
  
@@ -57, +65 @@

  If a comma is omitted, then it is in x-y (lon-lat) order:
  {{{	<field name="geo">-90.57341 43.17614</field> }}}
  
- A lat-lon rectangle can be indexed with 4 numbers in minX maxX minY maxY order:
+ A lat-lon rectangle can be indexed with 4 numbers in minX minY maxX maxY order:
  {{{	<field name="geo">-74.093 41.042 -69.347 44.558</field> }}}
  
  A circle is specified like so:
- {{{	<field name="geo">Circle(4.56,1.23  distance=7.89)</field> }}}
+ {{{	<field name="geo">Circle(4.56,1.23 d=0.0710)</field> }}}
- The first part of it is the center point, in either "lon lat" or "lat,lon" format, then
the distance in km.  "d" can be used to abbreviate "distance".
+ The first part of it is the center point, in either "lon lat" or "lat,lon" format, then
the "d" distance radius is in degrees.
  
  For polygons, use the WKT standard (Well Known Text) like so:
  {{{	 <field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>
}}}
@@ -70, +78 @@

  
  == Shape / Polygon / WKT notes ==
  
-  * Only Polygon, and Multipolygon WKT types have been tested.  GeometryCollection will not
work but the others should in theory.  Holes in polygons haven't been tested but may work.
+  * Only Polygon, and Multipolygon WKT types have been tested.  GeometryCollection will not
work but the others should in theory.  Holes in polygons haven't been tested but they there
is code to support them.
   * The implementation doesn't support WKT that encompasses a pole.  The only shape that
can encompass a pole is a Circle.  Technically a longitude-wrapping (-180 to +180) lat-lon
box that touches a pole will too though.
   * Polygons and other WKT must have each vertex less than 180 degrees in longitude difference
than the vertex before it, or else it will be confused as going the wrong way around the globe.
 Dateline crossing is supported.
-  * All input coordinates are normalized into the standard geospatial lat-lon boundaries.
 So, -184 longitude becomes +176, for example.  Both +180 and -180 are kept distinct.
+  * All wkt input coordinates are normalized into the standard geospatial lat-lon boundaries.
 So, -184 longitude becomes +176, for example.  Both +180 and -180 are kept distinct.
  
  == Search ==
  
  Searching with the new spatial module is used significantly different than Solr 3 spatial.
 Here is a Solr filter query parameter for a lat-lon bounding box:
  
- {{{	fq={!needScore=false}geo:"Intersects(-74.093 41.042 -69.347 44.558)"  }}}
+ {{{	fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"  }}}
  
- The needScore local-param is optional but provides an optimization hint that should be used
for using the new spatial module in a filter query.  Notice that the query uses the standard
default Lucene query parser and uses its fielded-query syntax in which a field is referenced
followed by a colon.  The spatial operation and shape are provided in the double-quotes. 
Just use Intersects operation for now, as the other's aren't well supported.  The contents
of the parenthesis are a shape in the very same format used when indexing.
+ Notice that the query uses the standard default Lucene query parser and uses its fielded-query
syntax in which a field is referenced followed by a colon.  The spatial operation and shape
are provided in the double-quotes.  Just use Intersects operation for now, as the other's
aren't well supported.  The contents of the parenthesis are a shape in the very same format
used when indexing.
  
  Keep in mind that the query shape will by default have a non-zero precision of 0.025 (2.5%),
calculated in the same way that distErrPct is on the field type declaration.  Here is an example
polygon query setting it to 0:
  
- {{{	fq={!needScore=false}geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10
30))) distPrec=0" }}}
+ {{{	fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) distErrPct=0"
}}}
  
- For skinny snake-like polygons, this is often desired.
- 
- The search results presently show the field value in "x y" order, but in the future it will
be "y,x" order for a geospatial context.  And also know that, at least for now, the point
detail reflects rounding to the maxDetailDist field type configuration parameter, so it won't
be precisely the same as that given on indexing.  This will probably be changed in the future.
  
  == Final Notes ==
  

Mime
View raw message