lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "O'Shaughnessy, Devon" <dev...@ulfoods.com>
Subject Indexing a CSV that contains double quotes
Date Mon, 07 Aug 2017 16:57:23 GMT
Hello all,

I'm pretty new at Solr, having only worked with in a couple weeks, and I'm guessing I'm having
a newbie problem of some sort. I'm a little confused about how Solr works with double quotes
within strings. I'm uploading a CSV to Solr once a day containing some item data, some of
which contains quotes, and I'm getting some errors. I'll do my best to explain my problem.

Here is my schema:

  <field name="Cat1_Description" type="text_en"/>
  <field name="Cat2_Description" type="text_en"/>
  <field name="Cat3_Description" type="text_en"/>
  <field name="Cat1_Facet" type="string"/>
  <field name="Cat2_Facet" type="string"/>
  <field name="Cat3_Facet" type="string"/>
  <field name="Item_Cat1" type="string"/>
  <field name="Item_Cat2" type="string"/>
  <field name="Item_Cat3" type="string"/>
  <field name="Item_Combined" type="string" indexed="false"/>
  <field name="Item_Description" type="text_en"/>
  <field name="Item_Number" type="string" indexed="true" required="true" stored="true"/>
  <field name="Item_Status" type="string"/>
  <field name="Keywords" type="text_en"/>
  <field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
  <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
  <field name="_version_" type="long" indexed="false" stored="false"/>
  <copyField source="*" dest="_text_"/>
  <copyField source="Cat1_Description" dest="Cat1_Facet"/>
  <copyField source="Cat2_Description" dest="Cat2_Facet"/>
  <copyField source="Cat3_Description" dest="Cat3_Facet"/>

The command I am using to update the data:

curl 'http://10.0.1.24:8983/solr/products/update?commit=true' --data-binary @solrItmList.csv
-H 'Content-type:application/csv'

This is the error I recieve in response:

[cid:745aeeaa-f63d-4eed-8b9b-5ca9d2b258cb]

If for some reason the image doesn't show, it's an XML response indicating an IOException
with the message "CSVLoader: input=null, line=2014, can't read line: 2013 values={NO LINES
AVAILABLE} with a code of 400.

Is the solr.log file, the java.io.IOException is explained further:

"(line 2013) invalid char between encapsulated token end delimiter"

Here is an example of my data that is coming from the CSV that is giving me trouble.

(Headings at the top of the CSV)
Item Number,Item Description,Item Combined,Item Status,Item Cat1,Cat1 Description,Item Cat2,Cat2
Description,Item Cat3,Cat3 Description,Keywords

(Specific entry that Solr stops at.)
152600,YOGURT "PARFAIT PRO" LF,152600 YOGURT "PARFAIT PRO" LF,A,1002,Dairy,2231,Yogurt,11408,Yogurt
Bulk,"PARFAIT INC FAT FOODS FREE GF GLUTEN INC LOW MILL MILLS PARFAIT PRO PRO" SMART SNACK
VANILLA VAQNILLA YOGURT

Notice the double quotes in Item Description, Item Combined, and Keywords.

So the strange this is, if I remove the Keywords field from the schema and generate a CSV
that does not include the Keywords data, but otherwise make no other changes, the data is
able to load just fine, even though there are still double quotes in the Item Description
and Item Combine fields.

I know there shouldn't be any double quotes in the data, which I am working on getting rectified,
but I'm just wondering: why is this an issue with one of my fields but not others, seeing
as they have the same data type?



Wow, this email ended up really long for such a simple question! Any enlightenment would be
much appreciated.

Thanks,



Devon O'Shaughnessy

Developer/Analyst

Upper Lakes Foods

p: 800.879.1265 | ext: 4135

w: upperlakesfoods.com<http://upperlakesfoods.com/>



[1498580146444_PastedImage]


Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message