lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vineeth Dasaraju <vineeth.ii...@gmail.com>
Subject Re: Tips for faster indexing
Date Tue, 21 Jul 2015 19:38:38 GMT
Hi,

Thank You Erick for your inputs. I tried creating batches of 1000 objects
and indexing it to solr. The performance is way better than before but I
find that number of indexed documents that is shown in the dashboard is
lesser than the number of documents that I had actually indexed through
solrj. My code is as follows:

private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore
";
private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json";
private static JSONParser parser = new JSONParser();
private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);

public static void main(String[] args) throws IOException,
SolrServerException, ParseException {
        File file = new File(JSON_FILE_PATH);
        Scanner scn=new Scanner(file,"UTF-8");
        JSONObject object;
        int i = 0;
        Collection<SolrInputDocument> batch = new
ArrayList<SolrInputDocument>();
        while(scn.hasNext()){
            object= (JSONObject) parser.parse(scn.nextLine());
            SolrInputDocument doc = indexJSON(object);
            batch.add(doc);
            if(i%1000==0){
                System.out.println("Indexed " + (i+1) + " objects." );
                solr.add(batch);
                batch = new ArrayList<SolrInputDocument>();
            }
            i++;
        }
        solr.add(batch);
        solr.commit();
        System.out.println("Indexed " + (i+1) + " objects." );
}

public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
ParseException, IOException, SolrServerException {
    Collection<SolrInputDocument> batch = new
ArrayList<SolrInputDocument>();

    SolrInputDocument mainEvent = new SolrInputDocument();
    mainEvent.addField("id", generateID());
    mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage"));
    mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
    mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
    mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType"));
    mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
    mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));

    Object obj = parser.parse(jsonOBJ.get("User").toString());
    JSONObject userObj = (JSONObject) obj;

    SolrInputDocument childUserEvent = new SolrInputDocument();
    childUserEvent.addField("id", generateID());
    childUserEvent.addField("User", userObj.get("User"));

    obj = parser.parse(jsonOBJ.get("EventDescription").toString());
    JSONObject eventdescriptionObj = (JSONObject) obj;

    SolrInputDocument childEventDescEvent = new SolrInputDocument();
    childEventDescEvent.addField("id", generateID());
    childEventDescEvent.addField("EventApplicationName",
eventdescriptionObj.get("EventApplicationName"));
    childEventDescEvent.addField("Query", eventdescriptionObj.get("Query"));

    obj= JSONValue.parse(eventdescriptionObj.get("Information").toString());
    JSONArray informationArray = (JSONArray) obj;

    for(int i = 0; i<informationArray.size(); i++){
        JSONObject domain = (JSONObject) informationArray.get(i);

        SolrInputDocument domainDoc = new SolrInputDocument();
        domainDoc.addField("id", generateID());
        domainDoc.addField("domainName", domain.get("domainName"));

        String s = domain.get("columns").toString();
        obj= JSONValue.parse(s);
        JSONArray ColumnsArray = (JSONArray) obj;

        SolrInputDocument columnsDoc = new SolrInputDocument();
        columnsDoc.addField("id", generateID());

        for(int j = 0; j<ColumnsArray.size(); j++){
            JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
            SolrInputDocument columnDoc = new SolrInputDocument();
            columnDoc.addField("id", generateID());
            columnDoc.addField("movieName", ColumnsObj.get("movieName"));
            columnsDoc.addChildDocument(columnDoc);
        }
        domainDoc.addChildDocument(columnsDoc);
        childEventDescEvent.addChildDocument(domainDoc);
    }

    mainEvent.addChildDocument(childEventDescEvent);
    mainEvent.addChildDocument(childUserEvent);
    return mainEvent;
}

I would be grateful if you could let me know what I am missing.

On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> First thing is it looks like you're only sending one document at a
> time, perhaps with child objects. This is not optimal at all. I
> usually batch my docs up in groups of 1,000, and there is anecdotal
> evidence that there may (depending on the docs) be some gains above
> that number. Gotta balance the batch size off against how bug the docs
> are of course.
>
> Assuming that you really are calling this method for one doc (and
> children) at a time, the far bigger problem other than calling
> server.add for each parent/children is that you're then calling
> solr.commit() every time. This is an anti-pattern. Generally, let the
> autoCommit setting in solrconfig.xml handle the intermediate commits
> while the indexing program is running and only issue a commit at the
> very end of the job if at all.
>
> Best,
> Erick
>
> On Sun, Jul 19, 2015 at 12:08 PM, Vineeth Dasaraju
> <vineeth.iitgn@gmail.com> wrote:
> > Hi,
> >
> > I am trying to index JSON objects (which contain nested JSON objects and
> > Arrays in them) into solr.
> >
> > My JSON Object looks like the following (This is fake data that I am
> using
> > for this example):
> >
> > {
> >     "RawEventMessage": "Lorem ipsum dolor sit amet, consectetur
> adipiscing
> > elit. Aliquam dolor orci, placerat ac pretium a, tincidunt consectetur
> > mauris. Etiam sollicitudin sapien id odio tempus, non sodales odio
> iaculis.
> > Donec fringilla diam at placerat interdum. Proin vitae arcu non augue
> > facilisis auctor id non neque. Integer non nibh sit amet justo facilisis
> > semper a vel ligula. Pellentesque commodo vulputate consequat. ",
> >     "EventUid": "1279706565",
> >     "TimeOfEvent": "2015-05-01-08-07-13",
> >     "TimeOfEventUTC": "2015-05-01-01-07-13",
> >     "EventCollector": "kafka",
> >     "EventMessageType": "kafka-@column",
> >     "User": {
> >         "User": "Lorem ipsum",
> >         "UserGroup": "Manager",
> >         "Location": "consectetur adipiscing",
> >         "Department": "Legal"
> >     },
> >     "EventDescription": {
> >         "EventApplicationName": "",
> >         "Query": "SELECT * FROM MOVIES",
> >         "Information": [
> >             {
> >                 "domainName": "English",
> >                 "columns": [
> >                     {
> >                         "movieName": "Casablanca",
> >                         "duration": "154",
> >                     },
> >     {
> >                         "movieName": "Die Hard",
> >                         "duration": "127",
> >                     }
> >                 ]
> >             },
> >             {
> >                 "domainName": "Hindi",
> >                 "columns": [
> >                     {
> >                         "movieName": "DDLJ",
> >                         "duration": "176",
> >                     }
> >                 ]
> >             }
> >         ]
> >     }
> > }
> >
> >
> >
> > My function for indexing the object is as follows:
> >
> > public static void indexJSON(JSONObject jsonOBJ) throws ParseException,
> > IOException, SolrServerException {
> >     Collection<SolrInputDocument> batch = new
> > ArrayList<SolrInputDocument>();
> >
> >     SolrInputDocument mainEvent = new SolrInputDocument();
> >     mainEvent.addField("id", generateID());
> >     mainEvent.addField("RawEventMessage",
> jsonOBJ.get("RawEventMessage"));
> >     mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
> >     mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
> >     mainEvent.addField("EventMessageType",
> jsonOBJ.get("EventMessageType"));
> >     mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
> >     mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
> >
> >     Object obj = parser.parse(jsonOBJ.get("User").toString());
> >     JSONObject userObj = (JSONObject) obj;
> >
> >     SolrInputDocument childUserEvent = new SolrInputDocument();
> >     childUserEvent.addField("id", generateID());
> >     childUserEvent.addField("User", userObj.get("User"));
> >
> >     obj = parser.parse(jsonOBJ.get("EventDescription").toString());
> >     JSONObject eventdescriptionObj = (JSONObject) obj;
> >
> >     SolrInputDocument childEventDescEvent = new SolrInputDocument();
> >     childEventDescEvent.addField("id", generateID());
> >     childEventDescEvent.addField("EventApplicationName",
> > eventdescriptionObj.get("EventApplicationName"));
> >     childEventDescEvent.addField("Query",
> eventdescriptionObj.get("Query"));
> >
> >     obj=
> JSONValue.parse(eventdescriptionObj.get("Information").toString());
> >     JSONArray informationArray = (JSONArray) obj;
> >
> >     for(int i = 0; i<informationArray.size(); i++){
> >         JSONObject domain = (JSONObject) informationArray.get(i);
> >
> >         SolrInputDocument domainDoc = new SolrInputDocument();
> >         domainDoc.addField("id", generateID());
> >         domainDoc.addField("domainName", domain.get("domainName"));
> >
> >         String s = domain.get("columns").toString();
> >         obj= JSONValue.parse(s);
> >         JSONArray ColumnsArray = (JSONArray) obj;
> >
> >         SolrInputDocument columnsDoc = new SolrInputDocument();
> >         columnsDoc.addField("id", generateID());
> >
> >         for(int j = 0; j<ColumnsArray.size(); j++){
> >             JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
> >             SolrInputDocument columnDoc = new SolrInputDocument();
> >             columnDoc.addField("id", generateID());
> >             columnDoc.addField("movieName", ColumnsObj.get("movieName"));
> >             columnsDoc.addChildDocument(columnDoc);
> >         }
> >         domainDoc.addChildDocument(columnsDoc);
> >         childEventDescEvent.addChildDocument(domainDoc);
> >     }
> >
> >     mainEvent.addChildDocument(childEventDescEvent);
> >     mainEvent.addChildDocument(childUserEvent);
> >     batch.add(mainEvent);
> >     solr.add(batch);
> >     solr.commit();
> > }
> >
> > When I try to index the using the above code, I am able to index only 12
> > Objects per second. Is there a faster way to do the indexing? I believe I
> > am using the json-fast parser which is one of the fastest parsers for
> json.
> >
> > Your help will be very valuable to me.
> >
> > Thanks,
> > Vineeth
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message