lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "UpdateCSV" by CassandraTargett
Date Tue, 20 Dec 2016 21:40:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "UpdateCSV" page has been changed by CassandraTargett:
https://wiki.apache.org/solr/UpdateCSV?action=diff&rev1=23&rev2=24

Comment:
remove outdated content; point users to Ref Guide

  = Updating a Solr Index with CSV =
  
+ {{{#!wiki important
+ This page exists for the Solr Community to share Tips, Tricks, and Advice about
+ [[https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates|CSV
Update Handler]].
+  
+ Reference material previously located on this page has been migrated to the
+ [[https://cwiki.apache.org/solr/|Official Solr Reference Guide]].
+ If you need help, please consult the Reference Guide for the version of Solr you are using
+ for the specific details about using [[https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates|this
feature]].
+  
+ If you'd like to share information about how you use this feature, please [[FrontPage#How_to_edit_this_Wiki|add
it to this page]].
+ /* cwikimigrated */
- Solr accepts index updates in [[http://en.wikipedia.org/wiki/Comma-separated_values|CSV]]
(Comma Separated Values) format.  Different separators and escape mechanisms are configurable,
and multi-valued fields are supported.  You may also be interested in  the syntax for the
XML based [[UpdateXmlMessages|update]] directive, or Solr's [[CSVResponseWriter|CSV output
capability]].
- 
- <!> [[Solr1.2]]
- 
- <<TableOfContents>>
- 
- == Requirements ==
- <!> [[Solr1.2]] is the first version with CSV support for updates.
- 
- The CSV request handler needs to be configured in solrconfig.xml.
- This should already be present in the example solrconfig.xml
- {{{
-   <!-- CSV update handler, loaded on demand -->
-   <requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy">
-   </requestHandler>
  }}}
  
- <!> In [[Solr4.0]], CSV support is included in the standard [[UpdateRequestHandler]]
- {{{
-   <requestHandler name="/update" class="solr.UpdateRequestHandler"/>
- }}}
- Note, requests need to include Content-type:application/csv or Content-type:text/csv
- 
- 
- 
- == Methods of uploading CSV records ==
- CSV records may be uploaded to Solr by sending the data to the /solr/update/csv URL.
- All of the normal methods for [[ContentStream|uploading content]] are supported.
- 
- === Example ===
- There is a sample CSV file at {{{example/exampledocs/books.csv}}} that may be used to add
documents to the solr example server.
- 
- Example of using HTTP-POST to send the CSV data over the network to the Solr server:
- {{{
- cd example/exampledocs
- curl http://localhost:8983/solr/update/csv --data-binary @books.csv -H 'Content-type:text/plain;
charset=utf-8'
- }}}
- 
- Uploading a local CSV file can be more efficient than sending it over the network via HTTP.
- Remote streaming must be enabled for this method to work.  See the following line in {{{solrconfig.xml}}},
change it to {{{enableRemoteStreaming="true"}}}, and restart Solr.
- {{{
-   <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
- }}}
- 
- The following request will cause Solr to directly read the input file:
- {{{
- curl http://localhost:8983/solr/update/csv?stream.file=exampledocs/books.csv&stream.contentType=text/plain;charset=utf-8
- #NOTE: The full path, or a path relative to the CWD of the running solr server must be used.
- }}}
- 
- == Parameters ==
- Some parameters may be specified on a per field basis via {{{f.<fieldname>.param=value}}}
- 
- Example: If CSV file has multivalued field(s) which are separated by different separator(s),
then they can be split-ted based on following additional parameters passed to   {{{f.<fieldname>.split=true&f.<fieldname>.separator=<separator>}}}
 Or other way you can have multivalued fields part of same CSV with different separators &
can be updated to solr index based on above syntax.
- 
- === separator ===
- Specifies the character to act as the field separator.  Default is {{{separator=,}}}
- 
- === header ===
- {{{true}}} if the first line of the CSV input contains field or column names. The default
is {{{header=true}}}.
- If the {{{fieldnames}}} parameter is absent, these field names will be used when adding
documents to the index.
- 
- === fieldnames ===
- Specifies a comma separated list of field names to use when adding documents to the Solr
index.  If the CSV input already has a header, the names specified by this parameter will
override them.
- 
- Example: {{{fieldnames=id,name,category}}}
- 
- === skip ===
- A comma separated list of field names to skip in the input.
- An alternate way to skip a field is to specify it's name as a zero length string in {{{fieldnames}}}
- 
- Example: 
- {{{
- fieldnames=id,name,category&skip=name
- }}}
- skips the name field, and is equivalent to 
- {{{
- fieldnames=id,,category
- }}}
- 
- === skipLines ===
- Specifies the number of lines in the input stream to discard before the CSV data starts
(including the header, if present). Default is {{{skipLines=0}}}.
- 
- === trim ===
- If {{{true}}} remove leading and trailing whitespace from values.  CSV parsing already ignores
leading whitespace by default, but there may be trailing whitespace, or there may be leading
whitespace that is encapsulated by quotes and is thus not removed.  This may be specified
globally, or on a per-field basis.  The default is {{{trim=false}}}
- 
- === encapsulator ===
- The character optionally used to surround values to preserve characters such as the CSV
separator or whitespace.
- This standard CSV format handles the encapsulator itself appearing in an encapsulated value
by doubling the encapsulator.
- 
- CSV Example of quotes inside an encapsulated value:
- {{{
- 100,"this is a ""quoted"" string inside an encapsulated value"
- }}}
- The default is {{{encapsulator="}}}
- 
- === escape ===
- <!> [[Solr1.3]]
- The character used for escaping CSV separators or other reserved characters.  If an escape
is specified, the encapsulator is not used unless also explicitly specified since most formats
use either encapsulation or escaping, not both.
- 
- === keepEmpty ===
- Keep and index empty (zero length) field values.  This may be specified globally, or on
a per-field basis.  The default is {{{keepEmpty=false}}}.
- 
- === literal ===
- <!> [[Solr4.0]] Adds fixed field name/value to all documents.
- 
- Example: Adds a "datasource" field with value equal to "products" for every document indexed
from the CSV
- {{{
- literal.datasource=products
- }}}
- 
- === map ===
- Specifies a mapping between one value and another.  The string on the LHS of the colon will
be replaced with the string on the RHS.  This parameter can be specified globally or on a
per-field basis.
- 
- 
- Example: replaces "Absolutely" with "true" in every field
- {{{
- map=Absolutely:true
- }}}
- 
- Example: removes any values of "RemoveMe" in the field "foo"
- {{{
- f.foo.map=RemoveMe:&f.foo.keepEmpty=false
- }}}
- 
- === split ===
- If true, the field value is split into multiple values by another CSV parser.  The CSV parsing
rules such as {{{separator}}} and {{{encapsulator}}} may be specified as field parameters.
-  
- Example: for the following input
- {{{
- id,tags
- 101,"movie,spiderman,action"
- }}}
- to index the 3 separate tags into a multi-valued Solr field called "tags", use
- {{{
- f.tags.split=true
- }}}
- 
- Example: for the following input with a space separator and single quote encapsulator for
the tags field
- {{{
- id,tags
- 101,movie 'spider man' action
- }}}
- to index the 3 separate tags into a multi-valued Solr field called "tags", use
- {{{
- f.tags.split=true&f.tags.separator=%20&f.tags.encapsulator='
- }}}
- 
- The target Solr field of any split should be multiValued.
- 
- === rowid ===
- 
- <!> [[Solr4.4]]
- 
- If not null, add a new field to the document where the passed in parameter name is the field
name to be added and the current line/rowid is the value.  This is useful if your CSV doesn't
have a unique id already in it and you want to use the line number as one.  Also useful if
you simply want to index where exactly in the original CSV file the row came from.
- 
- Example:
- {{{
- curl "http://localhost:8983/solr/update?rowid=id" --data-binary @1987.csv -H 'Content-type:application/csv;
charset=utf-8'
- }}}
- 
- === rowidOffset ===
- 
- <!> [[Solr4.4]]
- 
- In conjunction with the rowid parameter, this integer value will be added to the rowid before
adding it the field. 
- 
- === overwrite ===
- If {{{true}}} (the default), check for and overwrite duplicate documents, based on the uniqueKey
field declared in the solr schema. If you know the documents you are indexing do not contain
any duplicates then you may see a considerable speed up with {{{&overwrite=false}}}.
- 
- === commit ===
- Commit changes after all records in this request have been indexed.  The default is {{{commit=false}}}
to avoid the potential performance impact of frequent commits.
- 
- == Disadvantages ==
- There is no way to provide document or field index-time boosts with the CSV format, however
many indicies do not utilize that feature.
- 
- Because the UpdateCSV handler functions at a lower level than the DataImportHandler (DIH),
built in features that the DIH provides such as Transformers, EntityProcessors and Import
commands aren't available when using UpdateCSV. With that said, additional consideration should
be made on the data format provided within the CSV file and how it gets consumed by your solr
schema. 
- 
- Unlike DIH, there isn't a queryable way to know the status of the import after its been
executed.  
- 
- == Tab-delimited importing ==
- Don't let the "CSV" name fool you, this loader can load your tab-delimited files, and even
handle backslash escaping rather than CSV encapsulation. 
- 
- For example, one can dump MySQL table to a tab delimited file with
- {{{
- SELECT * INTO OUTFILE '/tmp/result.text' FROM mytable;
- }}}
- 
- This file could then be imported into solr by setting the separator to tab (%09) and the
escape to backslash (%5c)
- {{{
- curl 'http://localhost:8983/solr/update/csv?commit=true&separator=%09&escape=\&stream.file=/tmp/result.text'
- }}}
- <!> [[Solr1.3]] is required to specify an escape.
- 

Mime
View raw message