accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wall <mjw...@gmail.com>
Subject Re: empty tablet direcoties on HDFS
Date Tue, 23 May 2017 14:20:24 GMT
Was your cluster with the batch writer done spitting and moving data?  That
is a lot of splits that got generated.  When a tablet is split, the files
are inspected and potentially assigned to both new tablets.  Compacting
that range will rewrite the data into files for each tablet so rfiles
contain only data for their range.  Dave is suggesting a compaction for
that reason, as it will re"distribute" the data in the rfiles.  Eventually,
it should get to the same state you saw with the bulk import test.  For the
batch writer test and the 3 tablets with all the data, what does inspecting
the rfile show you about the range of data in those?

Did you create 3 splits when you bulk imported?  Or did you create the 1.01
splits?

On Tue, May 23, 2017 at 8:51 AM Massimilian Mattetti <MASSIMIL@il.ibm.com>
wrote:

> Sorry Dave, but I don't get what you mean by "get distributed". Running a
> compaction from the shell will create one file per tablet, there is no data
> repartitioning involved in this process.
>
>
>
>
> From:        Dave Marion <dlmarion@comcast.net>
> To:        user@accumulo.apache.org
> Date:        23/05/2017 15:10
> Subject:        Re: empty tablet direcoties on HDFS
> ------------------------------
>
>
>
> Does the data get distributed if you compact the table?
>
> On May 23, 2017 at 5:04 AM Massimilian Mattetti <MASSIMIL@il.ibm.com>
> wrote:
>
> Hi all,
>
> I created a table with 3 initial split-points ( I used a sharding
> mechanism to evenly distribute the data between them) and started ingesting
> data using the batch writer API. At the end of the ingestion process I got
> around 1.01K tablets (threshold for splitting was set to 1GB)  for a total
> of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on
> the table directory). Digging into the table directory on HDFS I noticed
> that there are around 700 tablets (directory starting with t-) that are
> empty, other 300 tablets that have around 1GB or less of data and 3 tablets
> (default_tablet included) containing 130 GB of data each one.  Is this a
> normal behavior? (I am working with a cluster of 3 servers running Accumulo
> 1.8.1).
>
> I ran also another experiment importing the same data on a different table
> that was configured in the same way of the previous one, but this time
> using the bulk import. Eventually for this table I did not have empty
> tablets although most of them contains few MBs, and the final space on HDFS
> was around 450GB. What can be the reason for such big difference on the
> space on disk between the batch writer API and bulk import?
> Thanks.
>
> Best Regards,
> Max
>
>
>

Mime
View raw message