hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Nevill (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
Date Tue, 03 Mar 2015 19:34:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345588#comment-14345588
] 

Edward Nevill commented on HADOOP-11660:
----------------------------------------

On 3 March 2015 at 18:38, Colin Patrick McCabe (JIRA) <jira@apache.org>


 The USE_PIPELINED optimises crc on x86 because x86 has 3 crc units and
therefore can essentially calculate 3 crcs in parallel. The code for the
x86 loop does

      while (likely(counter)) {
        __asm__ __volatile__(
        "crc32q (%7), %0;\n\t"
        "crc32q (%7,%6,1), %1;\n\t"
        "crc32q (%7,%6,2), %2;\n\t"
         : "=r"(c1), "=r"(c2), "=r"(c3)
         : "0"(c1), "1"(c2), "2"(c3), "r"(block_size), "r"(data)
        );
        data++;
        counter--;
      }

So, the 2nd crc32q executes in the shadow of the 1st (ie it executes in
what would be a result delay from the 1st crc32q in any case). Similarly
the 3rd crc32q executes in the shadow or the 1st and 2nd.

On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if
we did the same on aarch64 and had say

        crc32   w0, w0, x3
        crc32   w1, w1, x4
        crc32   w2, w2, x5

The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete
and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so
there is no benefit to pipelining.

I did some tests with pipelining on aarch64 and it worked out marginally
slower than the non pipelined version.

Note that the pipelined_crc32 is within a #ifdef x86 section of code so
from that point of view it is architecture specific and if I enabled it I
would then have to redundantly implement the pipelined version on aarch64.



No. Because not all aarch64 HW has the CRC instructions. So we need all
three

crc32c_hardware - When called to do a castagnoli CRC on a cpu with HW
support
crc32_zlib_hardware - When called to do a zlib CRC on a cpu with HW support
crc32_zlib_sb8 - When call to do a CRC on a CPU without HW support

The problem is that the existing code assumes that HW support means only HW
support for the Castagnoli CRC.

The way to get around the #ifdef __aarch64__ is to add support for zlib CRC
on all archs. This would mean adding code to the x86 side to say that zlib
CRC is not available.

So what this would involve is.

- On the x86 side add a dummy definitions of crc32_zlib_hardware which will
never get called but is just there to satisfy the reference. It would have
an assert to ensure it is never called. This is much like the existing
dummy crc32_hardware at the very end.

static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t
length) {
  // never called!
  assert(0 && "hardware crc called on an unsupported platform");
  return 0;
}

- Add an additional variable cached_cpu_supports_crc32_zlib to complement
the existing cached_cpu_supports_crc32 variable.

- In the init section for x86

 void __attribute__ ((constructor)) init_cpu_support_flag(void) {
  uint32_t ecx = cpuid(CPUID_FEATURES);
  cached_cpu_supports_crc32 = ecx & SSE42_FEATURE_BIT;
}

add an initialisation

  cached_cpu_supports_crc32_zlib = 0;

(alternatively this could be left out and rely on the static initialisation
of 0).

- In the init section for aarch64

 void __attribute__ ((constructor)) init_cpu_support_flag(void) {
  unsigned long auxv = getauxval(AT_HWCAP);
  cached_cpu_supports_crc32 = auxv & HWCAP_CRC32;
}

add an initialisation

  cached_cpu_supports_crc32_zlib = cached_cpu_supports_crc32;

(if CRC is supported on aarch64 both variants must be supported).

- Then instead of

    case CRC32_ZLIB_POLYNOMIAL:
      crc_update_func = crc32_zlib_sb8;
#ifdef __aarch64__
      if (likely(cached_cpu_supports_crc32))
        crc_update_func = crc32_zlib_hardware;
#endif

we do

    case CRC32_ZLIB_POLYNOMIAL:
      crc_update_func = crc32_zlib_sb8;
      if (likely(cached_cpu_supports_crc32_zlib))
        crc_update_func = crc32_zlib_hardware;

This would have the advantage that there is a generic framework for
architectures to add support for 1 or both (or none) HW support for CRC
without additional #ifdefs.

So this would get rid of 2 #ifdef __aarch64__ statements leaving just the
#ifdef around the USE_PIPELINE and I do feel that adding a significant
amout of code to the aarch64 side to implement a pipelined version which
offers no performance advantages would be undesirable.

But I'm happy to do whichever you feel is best. Just let me know.

All the best,
Ed.





> Add support for hardware crc on ARM aarch64 architecture
> --------------------------------------------------------
>
>                 Key: HADOOP-11660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11660
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: native
>    Affects Versions: 3.0.0
>         Environment: ARM aarch64 development platform
>            Reporter: Edward Nevill
>            Assignee: Edward Nevill
>            Priority: Minor
>              Labels: performance
>             Fix For: 3.0.0
>
>         Attachments: jira-11660.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This patch adds support for hardware crc for ARM's new 64 bit architecture
> The patch is completely conditionalized on __aarch64__
> I have only added support for the non pipelined version as I benchmarked the pipelined
version on aarch64 and it showed no performance improvement.
> The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported
on ARM aarch64 hardwre.
> To benchmark this I modified the test_bulk_crc32 test to print out the time taken to
CRC a 1MB dataset 1000 times.
> Before:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> After:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message