accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Massimilian Mattetti" <MASSI...@il.ibm.com>
Subject Re: empty tablet direcoties on HDFS
Date Tue, 23 May 2017 14:54:24 GMT
I have just started a compaction, it will take a while to complete.

"When a tablet is split, the files are inspected and potentially assigned 
to both new tablets"
I thought about this, so in my case the 3 directories containing 130GB of 
data (divided among 500 files) are actually holding the data of all the 
other tablets whose directory are empty. Am I right?

"what does inspecting the rfile show you about the range of data in those?
"
I am using accumulo rfile-info to inspect the file but it does not tell me 
if the file is shared or not among different tablets. Am I missing 
something?

"Did you create 3 splits when you bulk imported?  Or did you create the 
1.01 splits?"
I crated 3 splits for the bulk ingestion too. A separated file is written 
and then imported for each split point.

Thanks.

Max



From:   Michael Wall <mjwall@gmail.com>
To:     user@accumulo.apache.org
Date:   23/05/2017 17:20
Subject:        Re: empty tablet direcoties on HDFS



Was your cluster with the batch writer done spitting and moving data?  
That is a lot of splits that got generated.  When a tablet is split, the 
files are inspected and potentially assigned to both new tablets.  
Compacting that range will rewrite the data into files for each tablet so 
rfiles contain only data for their range.  Dave is suggesting a compaction 
for that reason, as it will re"distribute" the data in the rfiles.  
Eventually, it should get to the same state you saw with the bulk import 
test.  For the batch writer test and the 3 tablets with all the data, what 
does inspecting the rfile show you about the range of data in those?

Did you create 3 splits when you bulk imported?  Or did you create the 
1.01 splits?

On Tue, May 23, 2017 at 8:51 AM Massimilian Mattetti <MASSIMIL@il.ibm.com> 
wrote:
Sorry Dave, but I don't get what you mean by "get distributed". Running a 
compaction from the shell will create one file per tablet, there is no 
data repartitioning involved in this process.




From:        Dave Marion <dlmarion@comcast.net>
To:        user@accumulo.apache.org
Date:        23/05/2017 15:10
Subject:        Re: empty tablet direcoties on HDFS



Does the data get distributed if you compact the table?
On May 23, 2017 at 5:04 AM Massimilian Mattetti <MASSIMIL@il.ibm.com> 
wrote:

Hi all,

I created a table with 3 initial split-points ( I used a sharding 
mechanism to evenly distribute the data between them) and started 
ingesting data using the batch writer API. At the end of the ingestion 
process I got around 1.01K tablets (threshold for splitting was set to 
1GB)  for a total of 600GB of space on HDFS ( measured using the command 
hadoop fs -du -h on the table directory). Digging into the table directory 
on HDFS I noticed that there are around 700 tablets (directory starting 
with t-) that are empty, other 300 tablets that have around 1GB or less of 
data and 3 tablets (default_tablet included) containing 130 GB of data 
each one.  Is this a normal behavior? (I am working with a cluster of 3 
servers running Accumulo 1.8.1).

I ran also another experiment importing the same data on a different table 
that was configured in the same way of the previous one, but this time 
using the bulk import. Eventually for this table I did not have empty 
tablets although most of them contains few MBs, and the final space on 
HDFS was around 450GB. What can be the reason for such big difference on 
the space on disk between the batch writer API and bulk import?
Thanks.

Best Regards,
Max





Mime
View raw message