Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jena-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4E0C5D98.2080707@epimorphics.com>
Date: Thu, 30 Jun 2011 12:27:20 +0100
From: Andy Seaborne <andy.seaborne@epimorphics.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
 rv:1.9.2.17) Gecko/20110516 Thunderbird/3.1.10
MIME-Version: 1.0
To: jena-dev@incubator.apache.org
Subject: Re: BulkLoader error with large data and fast harddrive
References: <BANLkTimPbne+J-bN52NK2DONfv4Ojm_+Bg@mail.gmail.com>
	<4DFF9FB8.9080006@epimorphics.com>
	<BANLkTikD8EcLYXA-wOfRhjy27c9B_36t3A@mail.gmail.com>
	<4DFFB434.2050405@epimorphics.com>
	<BANLkTimyrpaN1aVV06wR0mVGQU-S9UGTTg@mail.gmail.com>
	<4E0051E7.1000300@epimorphics.com>	<4E09B838.7090004@epimorphics.com>
	<BANLkTi=qeCtKPwqFc51YyFxyawo7YCiJXg@mail.gmail.com>
	<4E0A3A00.5050000@epimorphics.com>
	<BANLkTin=hACy4mzGBUcwG+ogWa4Uuf7cPA@mail.gmail.com>
	<BANLkTik1QJuwb1Pc9PfNH8dcNvpCWXAv4w@mail.gmail.com>
 <BANLkTinJ9A4NLKW-FrXy9dgFW9caNHL-Uw@mail.gmail.com>
In-Reply-To: <BANLkTinJ9A4NLKW-FrXy9dgFW9caNHL-Uw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 29/06/11 00:36, jp wrote:
> tdbloader2 was able to load the file. Log can be found here
> http://www.kosmyna.com/tdbloader2.log
> I guess the questions now are What's the difference between tdbloader2
> and the test application? and Why does tdbloader fail?

tdbloader also works for me on my SSD.

The test application is using the pure Java loader - tdbloader is a 
command line wrapper to that loader.  tdbloader2 is different.

tdbloader in the data phase loads the node table (one index, one object 
file) and the primary triple index (SPO - also an index file).  So it 
has two indexes open, each is a pair of memory mapped files.

tdbloader2 only loads the node table during the first phase.  One index 
is two memory mapped file.  The only difference seems to be the 
complexity of the memory management needed.

The
   SystemTDB.setFileMode(FileMode.direct) ;
changes the file management to internal caching, not using memory mapped 
files.  Unfortunately, the built-in cache sizes are set for a 32 bit 
machine so it's slower anyway (32 bit TDB does not use memory mapped 
files because of the limitations of Java addressing).

SystemTDB.BlockWriteCacheSize and SystemTDB.BlockReadCacheSize are the 
key settings.

That's the only difference I can think of. It really does seem it's not 
a TDB code thing.  The code is single threaded because the bottlenecks 
are memory and persistent storage I/O, not CPU.

My SSD is setup is only used for TDB testing currently.

/dev/sdb1  /mnt/ssd1 ext4 noatime 0 2

which looks close enough to your

/dev/sda5 / ext4 rw,errors=remount-ro,commit=0 0 0

I have the SSD configured as a secondary disk and so I don't care about 
atime (or diratime).

I couldn't find any clues on the web with regard to mmap and SSD

(commit=0 is the same as commit=5 which is the default if I understand 
the manual correctly).

You could try increasing the size of the direct mode caches and the heap 
if you need to load within an app - and if there is data already in the 
store, no special loading is done - triples are just added one at a 
time, and indexed each time.

If you can bulk load using tdbloader2, then that is one option.

Other than that, I'm at a loss as to how to explain what you're seeing.

	Andy


>
> -jp
>
> On Tue, Jun 28, 2011 at 7:21 PM, jp<jp@nimblegraph.com>  wrote:
>> Sorry for any confusion tdbloader2 is working find I had a typo in my
>> $PATH variable. I'll post results of the load asap.
>>
>> -jp
>>
>> On Tue, Jun 28, 2011 at 7:02 PM, jp<jp@nimblegraph.com>  wrote:
>>> The complete log file is over 13gb. I have posted the first 5000 lines
>>> here http://www.kosmyna.com/ReportLoadOnSSD.log.5000lines
>>> The run of tdbloader failed as well. first 5000 lines can be found
>>> here http://www.kosmyna.com/tdbloader.log.5000lines
>>>
>>> I could not run tdbloader2 I get the following error
>>> ./tdbloader2: line 14: make_classpath: command not found
>>>
>>> I have TDBROOT environment variable correctly set and am using this
>>> version of tdb
>>> http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/tags/TDB-0.8.10/bin
>>>
>>> -jp
>>>
>>>
>>> On Tue, Jun 28, 2011 at 4:30 PM, Andy Seaborne
>>> <andy.seaborne@epimorphics.com>  wrote:
>>>>> Aside from shipping you my laptop is there anything I can provide you
>>>>> with to help track down the issue?
>>>>
>>>> A complete log, with the exception would help to identify the point where it
>>>> fails.  Its a possible clue.
>>>>
>>>> Could you also try running tdbloader and tdbloader2 to bulk load the files?
>>>>
>>>>         Andy
>>>>
>>>>
>>>> On 28/06/11 21:19, jp wrote:
>>>>>
>>>>> Hey Andy,
>>>>>
>>>>> Saw the twitter message 29% load speed increase is pretty nice. Glad I
>>>>> could give you the excuse to upgrade :) Though It worries me that you
>>>>> don't receive the same exception I do. I consistently have loading
>>>>> issues using the file posted at
>>>>> http://www.kosmyna.com/mappingbased_properties_en.nt.bz2. I can get
>>>>> the test program to complete by making the following changes but it's
>>>>> slow (30 minutes).
>>>>>
>>>>> SystemTDB.setFileMode(FileMode.direct) ;
>>>>>
>>>>>   if ( true ) {
>>>>>       String dir = "/home/jp/scratch/ssdtest/DB-X" ;
>>>>>       FileOps.clearDirectory(dir) ;
>>>>>       datasetGraph = TDBFactory.createDatasetGraph(dir);
>>>>>   }
>>>>>
>>>>> Running the program with the sections of code below fails every time.
>>>>>
>>>>> //SystemTDB.setFileMode(FileMode.direct) ;
>>>>>
>>>>>   if ( true ) {
>>>>>       String dir = "/home/jp/scratch/ssdtest/DB-X" ;
>>>>>       FileOps.clearDirectory(dir) ;
>>>>>       datasetGraph = TDBFactory.createDatasetGraph(dir);
>>>>>   }
>>>>>
>>>>> The exception:
>>>>> java.lang.IllegalArgumentException
>>>>>         at java.nio.Buffer.position(Buffer.java:235)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:94)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:95)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:41)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeRecords.getSplitKey(BPTreeRecords.java:141)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.split(BPTreeNode.java:435)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:387)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.TupleIndexRecord.performAdd(TupleIndexRecord.java:48)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.TupleIndexBase.add(TupleIndexBase.java:49)
>>>>>         at com.hp.hpl.jena.tdb.index.TupleTable.add(TupleTable.java:54)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:77)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:268)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:244)
>>>>>         at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:60)
>>>>>         at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
>>>>>         at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:122)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:159)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:117)
>>>>>         at
>>>>> com.nimblegraph.data.bin.ReportLoadOnSSD.main(ReportLoadOnSSD.java:68)
>>>>> http://dbpedia.org/resource/Spirea_X
>>>>> http://dbpedia.org/ontology/associatedBand
>>>>> http://dbpedia.org/resource/Adventures_in_Stereo
>>>>>
>>>>> If I continue to let it run I start seeing this error as well
>>>>> com.hp.hpl.jena.tdb.TDBException: No known block type for 4
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.base.block.BlockType.extract(BlockType.java:64)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.getType(BPTreeNodeMgr.java:166)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.access$200(BPTreeNodeMgr.java:22)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr$Block2BPTreeNode.fromByteBuffer(BPTreeNodeMgr.java:136)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.get(BPTreeNodeMgr.java:84)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(BPTreeNode.java:127)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:379)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.TupleIndexRecord.performAdd(TupleIndexRecord.java:48)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.index.TupleIndexBase.add(TupleIndexBase.java:49)
>>>>>         at com.hp.hpl.jena.tdb.index.TupleTable.add(TupleTable.java:54)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:77)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:268)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:244)
>>>>>         at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:60)
>>>>>         at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
>>>>>         at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:122)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:159)
>>>>>         at
>>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:117)
>>>>>         at
>>>>> com.nimblegraph.data.bin.ReportLoadOnSSD.main(ReportLoadOnSSD.java:68)
>>>>>
>>>>> Aside from shipping you my laptop is there anything I can provide you
>>>>> with to help track down the issue? I am comfortable building tdb from
>>>>> source and setting conditional breakpoints while debugging if that can
>>>>> be of any benefit.
>>>>>
>>>>> Thanks for your help.
>>>>> -jp
>>>>>
>>>>> On Tue, Jun 28, 2011 at 7:17 AM, Andy Seaborne
>>>>> <andy.seaborne@epimorphics.com>    wrote:
>>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I now have an SSD (256G from Crucial) :-)
>>>>>>
>>>>>> /dev/sdb1 on /mnt/ssd1 type ext4 (rw,noatime)
>>>>>>
>>>>>> and I ran the test program on jamendo-rdf and on
>>>>>> mappingbased_properties_en.nt, then on jamendo-rdf with existing data as
>>>>>> in
>>>>>> the test case.
>>>>>>
>>>>>> Everything works for me - the loads complete without an exception.
>>>>>>
>>>>>>         Andy
>>>>>>
>>>>>> On 21/06/11 09:10, Andy Seaborne wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 21/06/11 06:01, jp wrote:
>>>>>>>>
>>>>>>>> Hey Andy
>>>>>>>>
>>>>>>>> I wasn't able to unzip the file
>>>>>>>
>>>>>>>> http://people.apache.org/~andy/jamendo.nt.gz however I ran it on my
>>>>>>>> dataset and I received an out of memory exception. I then changed line
>>>>>>>> 42 to true and received the original error. You can download the data
>>>>>>>> file I have been testing with from
>>>>>>>> http://www.kosmyna.com/mappingbased_properties_en.nt.bz2 unzipped it's
>>>>>>>> 2.6gb. This file has consistently failed to load.
>>>>>>>
>>>>>>> downloads.dbpedia.org is back - I download that file and loaded it with
>>>>>>> the test program - no problems.
>>>>>>>
>>>>>>>> While trying other datasets and variations of the simple program I had
>>>>>>>> what seemed to be a successful BulkLoad however when I opened the
>>>>>>>> dataset and tried to query it there were no results. I don't have the
>>>>>>>> exact details of this run but can try to reproduce it if you think it
>>>>>>>> would be useful.
>>>>>>>
>>>>>>> Yes please. At this point, any details a help
>>>>>>>
>>>>>>> Also, a complete log of the failed load of
>>>>>>> mappingbased_properties_en.nt.bz2 would be useful.
>>>>>>>
>>>>>>> Having looked at the stacktraces, and aligned them to the source code,
>>>>>>> it appears the code passes an internal consistency check, then fails on
>>>>>>> something that the test tests for.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>>>
>>>>>>>> -jp
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 20, 2011 at 4:57 PM, Andy Seaborne
>>>>>>>> <andy.seaborne@epimorphics.com>    wrote:
>>>>>>>>>
>>>>>>>>> Fixed - sorry about that.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> On 20/06/11 21:50, jp wrote:
>>>>>>>>>>
>>>>>>>>>> Hey andy,
>>>>>>>>>>
>>>>>>>>>> I assume the file you want me to run is
>>>>>>>>>> http://people.apache.org/~andy/ReportLoadOnSSD.java
>>>>>>>>>>
>>>>>>>>>> When I try to download it I get a permissions error. Let me know when
>>>>>>>>>> I should try again.
>>>>>>>>>>
>>>>>>>>>> -jp
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 20, 2011 at 3:30 PM, Andy Seaborne
>>>>>>>>>> <andy.seaborne@epimorphics.com>    wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi there,
>>>>>>>>>>>
>>>>>>>>>>> I tried to recreate this but couldn't, but I don't have an SSD to
>>>>>>>>>>> hand at
>>>>>>>>>>> the moment (being fixed :-)
>>>>>>>>>>>
>>>>>>>>>>> I've put my test program and the data from the jamendo-rdf you sent
>>>>>>>>>>> me
>>>>>>>>>>> in:
>>>>>>>>>>>
>>>>>>>>>>> http://people.apache.org/~andy/
>>>>>>>>>>>
>>>>>>>>>>> so we can agree on exactly a test case. This code is single
>>>>>>>>>>> threaded.
>>>>>>>>>>>
>>>>>>>>>>> The conversion from .rdf to .nt wasn't pure.
>>>>>>>>>>>
>>>>>>>>>>> I tried running using the in-memory store as well.
>>>>>>>>>>> downloads.dbpedia.org was down atthe weekend - I'll try to get the
>>>>>>>>>>> same
>>>>>>>>>>> dbpedia data.
>>>>>>>>>>>
>>>>>>>>>>> Could you run exactly what I was running? The file name needs
>>>>>>>>>>> changing.
>>>>>>>>>>>
>>>>>>>>>>> You can also try uncommenting
>>>>>>>>>>> SystemTDB.setFileMode(FileMode.direct) ;
>>>>>>>>>>> and run it using non-mapped files in about 1.2 G of heap.
>>>>>>>>>>>
>>>>>>>>>>> Looking through the stacktarce, there is a point where the code has
>>>>>>>>>>> passed
>>>>>>>>>>> an internal consistence test then fails with something that should
>>>>>>>>>>> be
>>>>>>>>>>> caught
>>>>>>>>>>> by that test - and the code is sync'ed or single threaded. This is,
>>>>>>>>>>> to
>>>>>>>>>>> put
>>>>>>>>>>> it mildly, worrying.
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> On 18/06/11 16:38, jp wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hey Andy,
>>>>>>>>>>>>
>>>>>>>>>>>> My entire program is run on one jvm as follows.
>>>>>>>>>>>>
>>>>>>>>>>>> public static void main(String[] args) throws IOException{
>>>>>>>>>>>> DatasetGraphTDB datasetGraph =
>>>>>>>>>>>> TDBFactory.createDatasetGraph(tdbDir);
>>>>>>>>>>>>
>>>>>>>>>>>> /* I saw the BulkLoader had two ways of loading data based on
>>>>>>>>>>>> whether
>>>>>>>>>>>> the dataset existed already. I did two runs one with the following
>>>>>>>>>>>> two
>>>>>>>>>>>> lines commented out to test both ways the BulkLoader runs.
>>>>>>>>>>>> Hopefully
>>>>>>>>>>>> this had the desired effect. */
>>>>>>>>>>>> datasetGraph.getDefaultGraph().add(new
>>>>>>>>>>>> Triple(Node.createURI("urn:hello"), RDF.type.asNode(),
>>>>>>>>>>>> Node.createURI("urn:house")));
>>>>>>>>>>>> datasetGraph.sync();
>>>>>>>>>>>>
>>>>>>>>>>>> InputStream inputStream = new FileInputStream(dbpediaData);
>>>>>>>>>>>>
>>>>>>>>>>>> BulkLoader bulkLoader = new BulkLoader();
>>>>>>>>>>>> bulkLoader.loadDataset(datasetGraph, inputStream, true);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> The data can be found here
>>>>>>>>>>>>
>>>>>>>>>>>> http://downloads.dbpedia.org/3.6/en/mappingbased_properties_en.nt.bz2
>>>>>>>>>>>> I appended the ontology to end of file it can be found here
>>>>>>>>>>>> http://downloads.dbpedia.org/3.6/dbpedia_3.6.owl.bz2
>>>>>>>>>>>>
>>>>>>>>>>>> The tdbDir is an empty directory.
>>>>>>>>>>>> On my system the error starts occurring after about 2-3minutes and
>>>>>>>>>>>> 8-12 million triples loaded.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for looking over this and please let me know if I can be of
>>>>>>>>>>>> further assistance.
>>>>>>>>>>>>
>>>>>>>>>>>> -jp
>>>>>>>>>>>> jp@nimblegraph.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 17, 2011 9:29 am, andy wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> jp,
>>>>>>>>>>>>>
>>>>>>>>>>>>> How does this fit with running:
>>>>>>>>>>>>>
>>>>>>>>>>>>> datasetGraph.getDefaultGraph().add(new
>>>>>>>>>>>>> Triple(Node.createURI("urn:hello"), RDF.type.asNode(),
>>>>>>>>>>>>> Node.createURI("urn:house")));
>>>>>>>>>>>>> datasetGraph.sync();
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is the preload of one triple a separate JVM or the same JVM as the
>>>>>>>>>>>>> BulkLoader call - could you provide a single complete minimal
>>>>>>>>>>>>> example?
>>>>>>>>>>>>>
>>>>>>>>>>>>> In attempting to reconstruct this, I don't want to hide the
>>>>>>>>>>>>> problem by
>>>>>>>>>>>>> guessing how things are wired together.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also - exactly which dbpedia file are you loading (URL?) although
>>>>>>>>>>>>> I
>>>>>>>>>>>>> doubt the exact data is the cause here.
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>>
>>