lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Meraj A. Khan" <mera...@gmail.com>
Subject Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?
Date Fri, 02 Jan 2015 20:57:00 GMT
Is this SolrCloud or single Solr Instance?
On Jan 2, 2015 3:44 PM, <jiag@ece.ubc.ca> wrote:

> Happy New Year Everyone :)
>
> I am trying to automatically generate document Id when indexing a csv
> file that contains multiple lines of documents. The desired case: if the
> csv file contains 2 lines (each line is a document), then the index
> should contain 2 documents.
>
>  What I observed: If the csv files contains 2 lines, then the index
> contains 3 documents, because the 1st document is repeated once, an
> example output:
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId1</str>
> </doc>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId2</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId3</str>
> </doc>
>
> And if the csv file contains 3 lines, then the index contains 6 elements,
> because document 1 is repeated 3 times and document 2 is repeated twice,
> as following:
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId1</str>
> </doc>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId2</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId3</str>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId4</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId5</str>
> </doc>
> <doc>
> <sr name ="col1"> doc3 </str>
> <sr name= "col2"> rank3 </str>
> <str name="id"> randomlyGeneratedId6</str>
> </doc>
>
> Here's what I have done:
> 1. In my solrConfig:
> <updateRequestProcessorChain name="autoGenId">
>                 <processor class="solr.UUIDUpdateProcessorFactory">
>                 <str name="fieldName">doc_key</str>
>                 </processor>
>                 <processor class="solr.LogUpdateProcessorFactory" />
>                 <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> <requestHandler name="/update" class="solr.UpdateRequestHandler">
>        <lst name="defaults">
>             <str name="update.chain">autoGenId</str>
>        </lst>
>   </requestHandler>
> 2. in schema.xml:
> <field name="doc_key" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>         <field name = "col1" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>         <field name = "col2" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>  <uniqueKey>id</uniqueKey>
>
> This problem doesn't exist when I assign an Id field, instead of using
> the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks
> like the csv file is processed one line at a time, and the index shows
> the entire process: so we see each previous line repeated in the output.
> Is there a way to not show the 'appending of previous lines', and
> rather just the 'final results' - so the total number of indexed
> document would match the input number of documents from the csv file?
>
> Many thanks,
> Jia
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message