모두의 코드
VPERMI2W, VPERMI2D, VPERMI2Q, VPERMI2PS, VPERMI2PDs (Intel x86/64 assembly instruction)

작성일 : 2020-09-01 이 글은 446 번 읽혔습니다.

VPERMI2W, VPERMI2D, VPERMI2Q, VPERMI2PS, VPERMI2PD

Full Permute From Two Tables Overwriting the Index

참고 사항

아래 표를 해석하는 방법은 x86-64 명령어 레퍼런스 읽는 법 글을 참조하시기 바랍니다.

Opcode/
Instruction

Op /
En

64/32
bit Mode
Support

CPUID
Feature
Flag

Description

EVEX.DDS.128.66.0F38.W1 75 /r
VPERMI2W xmm1 {k1}{z} xmm2 xmm3/m128

FVM

V/V

AVX512VL
AVX512BW

Permute word integers from two tables in xmm3/m128 and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 75 /r
VPERMI2W ymm1 {k1}{z} ymm2 ymm3/m256

FVM

V/V

AVX512VL
AVX512BW

Permute word integers from two tables in ymm3/m256 and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 75 /r
VPERMI2W zmm1 {k1}{z} zmm2 zmm3/m512

FVM

V/V

AVX512BW

Permute word integers from two tables in zmm3/m512 and zmm2 using indexes in zmm1 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W0 76 /r
VPERMI2D xmm1 {k1}{z} xmm2 xmm3/m128/m32bcst

FV

V/V

AVX512VL
AVX512F

Permute double-words from two tables in xmm3/m128/m32bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W0 76 /r
VPERMI2D ymm1 {k1}{z} ymm2 ymm3/m256/m32bcst

FV

V/V

AVX512VL
AVX512F

Permute double-words from two tables in ymm3/m256/m32bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W0 76 /r
VPERMI2D zmm1 {k1}{z} zmm2 zmm3/m512/m32bcst

FV

V/V

AVX512F

Permute double-words from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W1 76 /r
VPERMI2Q xmm1 {k1}{z} xmm2 xmm3/m128/m64bcst

FV

V/V

AVX512VL
AVX512F

Permute quad-words from two tables in xmm3/m128/m64bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 76 /r
VPERMI2Q ymm1 {k1}{z} ymm2 ymm3/m256/m64bcst

FV

V/V

AVX512VL
AVX512F

Permute quad-words from two tables in ymm3/m256/m64bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 76 /r
VPERMI2Q zmm1 {k1}{z} zmm2 zmm3/m512/m64bcst

FV

V/V

AVX512F

Permute quad-words from two tables in zmm3/m512/m64bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W0 77 /r
VPERMI2PS xmm1 {k1}{z} xmm2 xmm3/m128/m32bcst

FV

V/V

AVX512VL
AVX512F

Permute single-precision FP values from two tables in xmm3/m128/m32bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W0 77 /r
VPERMI2PS ymm1 {k1}{z} ymm2 ymm3/m256/m32bcst

FV

V/V

AVX512VL
AVX512F

Permute single-precision FP values from two tables in ymm3/m256/m32bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W0 77 /r
VPERMI2PS zmm1 {k1}{z} zmm2 zmm3/m512/m32bcst

FV

V/V

AVX512F

Permute single-precision FP values from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

Opcode/
Instruction

Op /
En

64/32
bit Mode
Support

CPUID
Feature
Flag

Description

EVEX.DDS.128.66.0F38.W1 77 /r
VPERMI2PD xmm1 {k1}{z} xmm2 xmm3/m128/m64bcst

FV

V/V

AVX512VL
AVX512F

Permute double-precision FP values from two tables in xmm3/m128/m64bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 77 /r
VPERMI2PD ymm1 {k1}{z} ymm2 ymm3/m256/m64bcst

FV

V/V

AVX512VL
AVX512F

Permute double-precision FP values from two tables in ymm3/m256/m64bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 77 /r
VPERMI2PD zmm1 {k1}{z} zmm2 zmm3/m512/m64bcst

FV

V/V

AVX512F

Permute double-precision FP values from two tables in zmm3/m512/m64bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

FVM

ModRM:reg (r,w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

FV

ModRM:reg (r, w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Permutes 16-bit/32-bit/64-bit values in the second operand (the first source operand) and the third operand (the second source operand) using indices in the first operand to select elements from the second and third operands. The selected elements are written to the destination operand (the first operand) according to the writemask k1.

The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the result.

D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table2).

Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits [3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If the id bit is 0, table1 (the first source) is selected; otherwise the second source operand is selected.

Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each input table.

Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each input table.

Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than one location in the destination operand. Note also that in this case, the same table can be reused for example for a second iteration, while the index elements are overwritten.

Bits (MAXVL-1:256/128) of the destination are zeroed for VL=256,128.

Operation

VPERMI2W (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
    id <-  2
FI;
IF VL = 256
    id <-  3
FI;
IF VL = 512
    id <-  4
FI;
TMP_DEST<-  DEST
FOR j <-  0 TO KL-1
    i <-  j * 16
    off <-  16*TMP_DEST[i+id:i]
    IF k1[j] OR *no writemask*
          THEN 
                DEST[i+15:i]=TMP_DEST[i+id+1] ? SRC2[off+15:off] 
                             : SRC1[off+15:off]
          ELSE 
                IF *merging-masking* ; merging-masking
                      THEN *DEST[i+15:i] remains unchanged*
                      ELSE  ; zeroing-masking
                            DEST[i+15:i] <-  0
                FI
    FI;
ENDFOR
DEST[MAX_VL-1:VL] <-  0

VPERMI2D/VPERMI2PS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL = 128
    id <-  1
FI;
IF VL = 256
    id <-  2
FI;
IF VL = 512
    id <-  3
FI;
TMP_DEST<-  DEST
FOR j <-  0 TO KL-1
    i <-  j * 32
    off <-  32*TMP_DEST[i+id:i]
    IF k1[j] OR *no writemask*
          THEN 
                IF (EVEX.b = 1) AND (SRC2 *is memory*)
                      THEN 
                            DEST[i+31:i]  <-  TMP_DEST[i+id+1] ? SRC2[31:0] 
                             : SRC1[off+31:off]
                ELSE 
                      DEST[i+31:i] <-  TMP_DEST[i+id+1] ? SRC2[off+31:off] 
                             : SRC1[off+31:off]
FI 
          ELSE 
                IF *merging-masking* ; merging-masking
                      THEN *DEST[i+31:i] remains unchanged*
                      ELSE  ; zeroing-masking
                            DEST[i+31:i] <-  0
                FI
    FI;
ENDFOR
DEST[MAX_VL-1:VL] <-  0

VPERMI2Q/VPERMI2PD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8 512)
IF VL = 128
    id <-  0
FI;
IF VL = 256
    id <-  1
FI;
IF VL = 512
    id <-  2
FI;
TMP_DEST<-  DEST
FOR j <-  0 TO KL-1
    i <-  j * 64
    off <-  64*TMP_DEST[i+id:i]
    IF k1[j] OR *no writemask*
          THEN 
                IF (EVEX.b = 1) AND (SRC2 *is memory*)
                      THEN 
                            DEST[i+63:i] <-  TMP_DEST[i+id+1] ? SRC2[63:0] 
                             : SRC1[off+63:off]
                ELSE 
                      DEST[i+63:i] <-  TMP_DEST[i+id+1] ? SRC2[off+63:off] 
                             : SRC1[off+63:off]
                FI 
          ELSE 
                IF *merging-masking* ; merging-masking
                      THEN *DEST[i+63:i] remains unchanged*
                      ELSE  ; zeroing-masking
                            DEST[i+63:i] <-  0
                FI
    FI;
ENDFOR
DEST[MAX_VL-1:VL] <-  0

Intel C/C++ Compiler Intrinsic Equivalent

VPERMI2D __m512i _mm512_permutex2var_epi32(__m512i a, __m512i idx, __m512i b);
VPERMI2D __m512i _mm512_mask_permutex2var_epi32(__m512i a, __mmask16 k,
                                                __m512i idx, __m512i b);
VPERMI2D __m512i _mm512_mask2_permutex2var_epi32(__m512i a, __m512i idx,
                                                 __mmask16 k, __m512i b);
VPERMI2D __m512i _mm512_maskz_permutex2var_epi32(__mmask16 k, __m512i a,
                                                 __m512i idx, __m512i b);
VPERMI __m256i _mm256_permutex2var_epi32(__m256i a, __m256i idx, __m256i b);
VPERMI2D __m256i _mm256_mask_permutex2var_epi32(__m256i a, __mmask8 k,
                                                __m256i idx, __m256i b);
VPERMI2D __m256i _mm256_mask2_permutex2var_epi32(__m256i a, __m256i idx,
                                                 __mmask8 k, __m256i b);
VPERMI2D __m256i _mm256_maskz_permutex2var_epi32(__mmask8 k, __m256i a,
                                                 __m256i idx, __m256i b);
VPERMI2D __m128i _mm_permutex2var_epi32(__m128i a, __m128i idx, __m128i b);
VPERMI2D __m128i _mm_mask_permutex2var_epi32(__m128i a, __mmask8 k, __m128i idx,
                                             __m128i b);
VPERMI2D __m128i _mm_mask2_permutex2var_epi32(__m128i a, __m128i idx,
                                              __mmask8 k, __m128i b);
VPERMI2D __m128i _mm_maskz_permutex2var_epi32(__mmask8 k, __m128i a,
                                              __m128i idx, __m128i b);
VPERMI2PD __m512d _mm512_permutex2var_pd(__m512d a, __m512i idx, __m512d b);
VPERMI2PD __m512d _mm512_mask_permutex2var_pd(__m512d a, __mmask8 k,
                                              __m512i idx, __m512d b);
VPERMI2PD __m512d _mm512_mask2_permutex2var_pd(__m512d a, __m512i idx,
                                               __mmask8 k, __m512d b);
VPERMI2PD __m512d _mm512_maskz_permutex2var_pd(__mmask8 k, __m512d a,
                                               __m512i idx, __m512d b);
VPERMI2PD __m256d _mm256_permutex2var_pd(__m256d a, __m256i idx, __m256d b);
VPERMI2PD __m256d _mm256_mask_permutex2var_pd(__m256d a, __mmask8 k,
                                              __m256i idx, __m256d b);
VPERMI2PD __m256d _mm256_mask2_permutex2var_pd(__m256d a, __m256i idx,
                                               __mmask8 k, __m256d b);
VPERMI2PD __m256d _mm256_maskz_permutex2var_pd(__mmask8 k, __m256d a,
                                               __m256i idx, __m256d b);
VPERMI2PD __m128d _mm_permutex2var_pd(__m128d a, __m128i idx, __m128d b);
VPERMI2PD __m128d _mm_mask_permutex2var_pd(__m128d a, __mmask8 k, __m128i idx,
                                           __m128d b);
VPERMI2PD __m128d _mm_mask2_permutex2var_pd(__m128d a, __m128i idx, __mmask8 k,
                                            __m128d b);
VPERMI2PD __m128d _mm_maskz_permutex2var_pd(__mmask8 k, __m128d a, __m128i idx,
                                            __m128d b);
VPERMI2PS __m512 _mm512_permutex2var_ps(__m512 a, __m512i idx, __m512 b);
VPERMI2PS __m512 _mm512_mask_permutex2var_ps(__m512 a, __mmask16 k, __m512i idx,
                                             __m512 b);
VPERMI2PS __m512 _mm512_mask2_permutex2var_ps(__m512 a, __m512i idx,
                                              __mmask16 k, __m512 b);
VPERMI2PS __m512 _mm512_maskz_permutex2var_ps(__mmask16 k, __m512 a,
                                              __m512i idx, __m512 b);
VPERMI2PS __m256 _mm256_permutex2var_ps(__m256 a, __m256i idx, __m256 b);
VPERMI2PS __m256 _mm256_mask_permutex2var_ps(__m256 a, __mmask8 k, __m256i idx,
                                             __m256 b);
VPERMI2PS __m256 _mm256_mask2_permutex2var_ps(__m256 a, __m256i idx, __mmask8 k,
                                              __m256 b);
VPERMI2PS __m256 _mm256_maskz_permutex2var_ps(__mmask8 k, __m256 a, __m256i idx,
                                              __m256 b);
VPERMI2PS __m128 _mm_permutex2var_ps(__m128 a, __m128i idx, __m128 b);
VPERMI2PS __m128 _mm_mask_permutex2var_ps(__m128 a, __mmask8 k, __m128i idx,
                                          __m128 b);
VPERMI2PS __m128 _mm_mask2_permutex2var_ps(__m128 a, __m128i idx, __mmask8 k,
                                           __m128 b);
VPERMI2PS __m128 _mm_maskz_permutex2var_ps(__mmask8 k, __m128 a, __m128i idx,
                                           __m128 b);
VPERMI2Q __m512i _mm512_permutex2var_epi64(__m512i a, __m512i idx, __m512i b);
VPERMI2Q __m512i _mm512_mask_permutex2var_epi64(__m512i a, __mmask8 k,
                                                __m512i idx, __m512i b);
VPERMI2Q __m512i _mm512_mask2_permutex2var_epi64(__m512i a, __m512i idx,
                                                 __mmask8 k, __m512i b);
VPERMI2Q __m512i _mm512_maskz_permutex2var_epi64(__mmask8 k, __m512i a,
                                                 __m512i idx, __m512i b);
VPERMI2Q __m256i _mm256_permutex2var_epi64(__m256i a, __m256i idx, __m256i b);
VPERMI2Q __m256i _mm256_mask_permutex2var_epi64(__m256i a, __mmask8 k,
                                                __m256i idx, __m256i b);
VPERMI2Q __m256i _mm256_mask2_permutex2var_epi64(__m256i a, __m256i idx,
                                                 __mmask8 k, __m256i b);
VPERMI2Q __m256i _mm256_maskz_permutex2var_epi64(__mmask8 k, __m256i a,
                                                 __m256i idx, __m256i b);
VPERMI2Q __m128i _mm_permutex2var_epi64(__m128i a, __m128i idx, __m128i b);
VPERMI2Q __m128i _mm_mask_permutex2var_epi64(__m128i a, __mmask8 k, __m128i idx,
                                             __m128i b);
VPERMI2Q __m128i _mm_mask2_permutex2var_epi64(__m128i a, __m128i idx,
                                              __mmask8 k, __m128i b);
VPERMI2Q __m128i _mm_maskz_permutex2var_epi64(__mmask8 k, __m128i a,
                                              __m128i idx, __m128i b);
VPERMI2W __m512i _mm512_permutex2var_epi16(__m512i a, __m512i idx, __m512i b);
VPERMI2W __m512i _mm512_mask_permutex2var_epi16(__m512i a, __mmask32 k,
                                                __m512i idx, __m512i b);
VPERMI2W __m512i _mm512_mask2_permutex2var_epi16(__m512i a, __m512i idx,
                                                 __mmask32 k, __m512i b);
VPERMI2W __m512i _mm512_maskz_permutex2var_epi16(__mmask32 k, __m512i a,
                                                 __m512i idx, __m512i b);
VPERMI2W __m256i _mm256_permutex2var_epi16(__m256i a, __m256i idx, __m256i b);
VPERMI2W __m256i _mm256_mask_permutex2var_epi16(__m256i a, __mmask16 k,
                                                __m256i idx, __m256i b);
VPERMI2W __m256i _mm256_mask2_permutex2var_epi16(__m256i a, __m256i idx,
                                                 __mmask16 k, __m256i b);
VPERMI2W __m256i _mm256_maskz_permutex2var_epi16(__mmask16 k, __m256i a,
                                                 __m256i idx, __m256i b);
VPERMI2W __m128i _mm_permutex2var_epi16(__m128i a, __m128i idx, __m128i b);
VPERMI2W __m128i _mm_mask_permutex2var_epi16(__m128i a, __mmask8 k, __m128i idx,
                                             __m128i b);
VPERMI2W __m128i _mm_mask2_permutex2var_epi16(__m128i a, __m128i idx,
                                              __mmask8 k, __m128i b);
VPERMI2W __m128i _mm_maskz_permutex2var_epi16(__mmask8 k, __m128i a,
                                              __m128i idx, __m128i b);

SIMD Floating-Point Exceptions

None

Other Exceptions

VPERMI2D/Q/PS/PD: See Exceptions Type E4NF.

VPERMI2W: See Exceptions Type E4NF.nb.

첫 댓글을 달아주세요!
프로필 사진 없음
강좌에 관련 없이 궁금한 내용은 여기를 사용해주세요

    댓글을 불러오는 중입니다..