모두의 코드
VDBPSADBW (Intel x86/64 assembly instruction)

작성일 : 2020-09-01 이 글은 689 번 읽혔습니다.

VDBPSADBW

Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

참고 사항

아래 표를 해석하는 방법은 x86-64 명령어 레퍼런스 읽는 법 글을 참조하시기 바랍니다.

Opcode/
Instruction

Op /
En

64/32
bit Mode
Support

CPUID
Feature
Flag

Description

EVEX.NDS.128.66.0F3A.W0 42 /r ib
VDBPSADBW xmm1 {k1}{z} xmm2 xmm3/m128 imm8

FVM

V/V

AVX512VL
AVX512BW

Compute packed SAD word results of unsigned bytes in dword block from xmm2 with unsigned bytes of dword blocks transformed from xmm3/m128 using the shuffle controls in imm8. Results are written to xmm1 under the writemask k1.

EVEX.NDS.256.66.0F3A.W0 42 /r ib
VDBPSADBW ymm1 {k1}{z} ymm2 ymm3/m256 imm8

FVM

V/V

AVX512VL
AVX512BW

Compute packed SAD word results of unsigned bytes in dword block from ymm2 with unsigned bytes of dword blocks transformed from ymm3/m256 using the shuffle controls in imm8. Results are written to ymm1 under the writemask k1.

EVEX.NDS.512.66.0F3A.W0 42 /r ib
VDBPSADBW zmm1 {k1}{z} zmm2 zmm3/m512 imm8

FVM

V/V

AVX512BW

Compute packed SAD word results of unsigned bytes in dword block from zmm2 with unsigned bytes of dword blocks transformed from zmm3/m512 using the shuffle controls in imm8. Results are written to zmm1 under the writemask k1.

Instruction Operand Encoding

Op/En Operand 1 Operand 2 Operand 3 Operand 4

FVM ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) Imm8

Description

Compute packed SAD (sum of absolute differences) word results of unsigned bytes from two 32-bit dword elements. Packed SAD word results are calculated in multiples of qword superblocks, producing 4 SAD word results in each 64-bit superblock of the destination register.

Within each super block of packed word results, the SAD results from two 32-bit dword elements are calculated as follows:

  • The lower two word results are calculated each from the SAD operation between a sliding dword element within a qword superblock from an intermediate vector with a stationary dword element in the corresponding qword superblock of the first source operand. The intermediate vector, see "Tmp1" in Figure 5-8, is constructed from the second source operand the imm8 byte as shuffle control to select dword elements within a 128-bit lane of the second source operand. The two sliding dword elements in a qword superblock of Tmp1 are located at byte offset 0 and 1 within the superblock, respectively. The stationary dword element in the qword superblock from the first source operand is located at byte offset 0.

  • The next two word results are calculated each from the SAD operation between a sliding dword element within a qword superblock from the intermediate vector Tmp1 with a second stationary dword element in the corre-sponding qword superblock of the first source operand. The two sliding dword elements in a qword superblock of Tmp1 are located at byte offset 2and 3 within the superblock, respectively. The stationary dword element in the qword superblock from the first source operand is located at byte offset 4.

  • The intermediate vector is constructed in 128-bits lanes. Within each 128-bit lane, each dword element of the intermediate vector is selected by a two-bit field within the imm8 byte on the corresponding 128-bits of the second source operand. The imm8 byte serves as dword shuffle control within each 128-bit lanes of the inter-mediate vector and the second source operand, similarly to PSHUFD.

The first source operand is a ZMM/YMM/XMM register. The second source operand is a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The destination operand is conditionally updated based on writemask k1 at 16-bit word granularity.

l e p u s d r q m o o f u h s 8 i p m T f e n L t b - 8 2 n * 8 1 + 5 9 n * 8 2 + 6 * 8 1 2 1 * 8 1 7 2 1 0 p D W 2 D W k D D : B 1 W D : B 0 1 D : 1 0 D : B 0 1 3 0 c r f W o m L t m b - 2 n * 1 + 9 n * 8 1 + 2 3 6 * 1 7 8 2 1 r s 3 1 a 2 _ 2 9 s a W s a a _ c a _ 8 r t a g + s 1 d 0 T a 1 1 r l 3 1 0 6 y r s a c n 1 4 n 1 b D 4 i o b n r r 6 5 a 0 S d d s B g i i u t 2 y _ o d k _ 1 0 1 8 b + w 3 r r 2 d 2 a n o d i o d b b s o n W 5 1 7 8 i t n 9 1 1 b 0 5 p s S n W 2 s 2 a b o b m 7 o t D 2 d 3 1 1 e r _ s i 7 a 7 5 t q _ d 1 b 4 a l l c p 5 w 1 4 r 1 w o b 1 r n s o e l c t 1 t i 2 1 r 7 y o w o S d r 0 8 m p 3 0 w s d g S w r d 8 b s a 3 _ w 3 s w i + a s S r W 5 1 3 s n t i 3 n a s r 2 o r 1 o m d l i a n d o d _ e b _ a s _ 1 n b 3 s g _ a l y s 3 _ a b 1 d f * s * c + 1 d t i 3 n r r 8 o c w d 3 T 1 t m p s i 3 4 i a b a s _ b 9 + s _ d s i n t l i d s o n 2 + 7 o p 1 r e b o 7 a 2 w 9 a _ 3 0 5 1 3 2 T 3 3 5 n 2 5 1 3 2 4 1 d 7 i 6 a 5 5 1 T 2 3 3 a n 3 1 3 3 + 5 T 5 6 a 3 + 1 * c 7 3 2 n b 3 9 1 1
Figure 5-8. 64-bit Super Block of SAD Operation in VDBPSADBW

Operation

VDBPSADBW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)
Selection of quadruplets:
FOR I = 0 to VL step 128
    TMP1[I+31:I] <-  select (SRC2[I+127: I], imm8[1:0])
    TMP1[I+63: I+32] <-  select (SRC2[I+127: I], imm8[3:2])
    TMP1[I+95: I+64] <-  select (SRC2[I+127: I], imm8[5:4])
    TMP1[I+127: I+96]<-  select (SRC2[I+127: I], imm8[7:6])
END FOR
SAD of quadruplets:
FOR I =0 to VL step 64
    TMP_DEST[I+15:I] <-  ABS(SRC1[I+7: I] - TMP1[I+7: I]) +
          ABS(SRC1[I+15: I+8]- TMP1[I+15: I+8]) +
          ABS(SRC1[I+23: I+16]- TMP1[I+23: I+16]) +
          ABS(SRC1[I+31: I+24]- TMP1[I+31: I+24]) 
    TMP_DEST[I+31: I+16] <- ABS(SRC1[I+7: I] - TMP1[I+15: I+8]) +
          ABS(SRC1[I+15: I+8]- TMP1[I+23: I+16]) +
          ABS(SRC1[I+23: I+16]- TMP1[I+31: I+24]) +
          ABS(SRC1[I+31: I+24]- TMP1[I+39: I+32])
    TMP_DEST[I+47: I+32] <- ABS(SRC1[I+39: I+32] - TMP1[I+23: I+16]) +
          ABS(SRC1[I+47: I+40]- TMP1[I+31: I+24]) +
          ABS(SRC1[I+55: I+48]- TMP1[I+39: I+32]) +
          ABS(SRC1[I+63: I+56]- TMP1[I+47: I+40]) 
    TMP_DEST[I+63: I+48] <- ABS(SRC1[I+39: I+32] - TMP1[I+31: I+24]) +
          ABS(SRC1[I+47: I+40] - TMP1[I+39: I+32]) +
          ABS(SRC1[I+55: I+48] - TMP1[I+47: I+40]) +
          ABS(SRC1[I+63: I+56] - TMP1[I+55: I+48])
ENDFOR
FOR j <-  0 TO KL-1
    i <-  j * 16
    IF k1[j] OR *no writemask*
          THEN DEST[i+15:i] <-  TMP_DEST[i+15:i]
          ELSE 
                IF *merging-masking* ; merging-masking
                      THEN *DEST[i+15:i] remains unchanged*
                      ELSE  ; zeroing-masking
                            DEST[i+15:i] <-  0
                FI
    FI;
ENDFOR
DEST[MAX_VL-1:VL] <-  0

Intel C/C++ Compiler Intrinsic Equivalent

VDBPSADBW __m512i _mm512_dbsad_epu8(__m512i a, __m512i b);
VDBPSADBW __m512i _mm512_mask_dbsad_epu8(__m512i s, __mmask32 m, __m512i a,
                                         __m512i b);
VDBPSADBW __m512i _mm512_maskz_dbsad_epu8(__mmask32 m, __m512i a, __m512i b);
VDBPSADBW __m256i _mm256_dbsad_epu8(__m256i a, __m256i b);
VDBPSADBW __m256i _mm256_mask_dbsad_epu8(__m256i s, __mmask16 m, __m256i a,
                                         __m256i b);
VDBPSADBW __m256i _mm256_maskz_dbsad_epu8(__mmask16 m, __m256i a, __m256i b);
VDBPSADBW __m128i _mm_dbsad_epu8(__m128i a, __m128i b);
VDBPSADBW __m128i _mm_mask_dbsad_epu8(__m128i s, __mmask8 m, __m128i a,
                                      __m128i b);
VDBPSADBW __m128i _mm_maskz_dbsad_epu8(__mmask8 m, __m128i a, __m128i b);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type E4NF.nb.

첫 댓글을 달아주세요!
프로필 사진 없음
강좌에 관련 없이 궁금한 내용은 여기를 사용해주세요

    댓글을 불러오는 중입니다..