SIMD Instructions
Chapter 11
This chapter discusses the vector instructions on the x86-64. This special class of instructions provides parallel processing, traditionally known as single-instruction, multiple-data (SIMD) instructions because, quite literally, a single instruction operates on several pieces of data concurrently.
As a result of this concurrency, SIMD instructions can often execute several times faster (in theory, as much as 32 to 64 times faster) than the comparable single instruction, single-data (SISD), or scalar, instructions that compose the standard x86-64 instruction set.
The x86-64 actually provides three sets of vector instructions: the Multimedia Extensions (MMX) instruction set, the Streaming SIMD Extensions (SSE) instruction set, and the Advanced Vector Extensions (AVX) instruction set.
The SSE/AVX Architecture
The SSE/AVX architectures have three main generations:
The SSE architecture, which (on 64-bit CPUs) provided sixteen 128-bit XMM registers supporting integer and floating-point data types
The AVX/AVX2 architecture, which supported sixteen 256-bit YMM registers (also supporting integer and floating-point data types)
The AVX-512 architecture, which supported up to thirty-two 512-bit ZMM registers
Streaming Data Types
The SSE and AVX programming models support two basic data types: scalars and vectors. Scalars hold one single- or double-precision floating-point value. Vectors hold multiple floating-point or integer values (between 2 and 32 values, depending on the scalar data type of byte, word, dword, qword, single precision, or double precision, and the register and memory size of 128 or 256 bits).
The XMM registers (XMM0 to XMM15) can hold a single 32-bit floatingpoint value (a scalar) or four single-precision floating-point values (a vector). The YMM registers (YMM0 to YMM15) can hold eight single-precision (32-bit) floating-point values.

The XMM registers can hold a single double-precision scalar value or a vector containing a pair of double-precision values. The YMM registers can hold a vector containing four double-precision floating-point values.

The XMM registers can hold 16 byte values (YMM registers can hold 32 byte values), allowing the CPU to perform 16 (32) byte-sized computations with one instruction.

The XMM registers can hold eight word values (YMM registers can hold sixteen word values), allowing the CPU to perform eight (sixteen) 16-bit word-sized integer computations with one instruction.

The XMM registers can hold four dword values (YMM registers can hold eight dword values), allowing the CPU to perform four (eight) 32-bit dword-sized integer computations with one instruction.

The XMM registers can hold two qword values (YMM registers can hold four qword values), allowing the CPU to perform two (four) 64-bit qword computations with one instruction.

Using cpuid to Differentiate Instruction Sets
The cpuid instruction expects a single parameter (called a leaf function) passed in the EAX register. It returns various pieces of information about the CPU in different 32-bit registers based on the value passed in EAX. An application can test the return information to see if certain CPU features are available.
To determine the highest function supported, you load EAX with 0 or 8000_0000h and execute the cpuid instruction. The return value is the maximum you can pass to cpuid in EAX.
In addition to providing the highest function supported, the cpuid instruction with EAX = 0h (or 8000_0002h) also returns a 12-character vendor ID in the EBX, ECX, and EDX registers.
To determine if the CPU can execute most SSE and AVX instructions, you must execute cpuid with EAX = 01h and test various bits placed in the ECX register.

; Test for AVX2 Instructions
option casemap:none
nl = 10
.const
ttlStr byte "Listing 11-2", 0
.data
maxFeature dword ?
VendorID byte 14 dup (0)
.code
externdef printf:proc
; Used for debugging:
print proc
push rax
push rbx
push rcx
push rdx
push r8
push r9
push r10
push r11
push rbp
mov rbp, rsp
sub rsp, 40
and rsp, -16
mov rcx, [rbp+72] ;Return address
call printf
mov rcx, [rbp+72]
dec rcx
skipTo0: inc rcx
cmp byte ptr [rcx], 0
jne skipTo0
inc rcx
mov [rbp+72], rcx
leave
pop r11
pop r10
pop r9
pop r8
pop rdx
pop rcx
pop rbx
pop rax
ret
print endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbx
push rbp
mov rbp, rsp
sub rsp, 56 ;Shadow storage
xor eax, eax
cpuid
mov maxFeature, eax
mov dword ptr VendorID, ebx
mov dword ptr VendorID[4], edx
mov dword ptr VendorID[8], ecx
lea rdx, VendorID
mov r8d, eax
call print
byte "CPUID(0): Vendor ID='%s', max feature=0%xh", nl, 0
; Leaf function 1 is available on all CPUs that support
; CPUID, no need to test for it.
mov eax, 1
cpuid
mov r8d, edx
mov edx, ecx
call print
byte "cpuid(1), ECX=%08x, EDX=%08x", nl, 0
; Most likely, leaf function 7 is supported on all modern CPUs
; (e.g., x86-64), but we'll test its availability nonetheless.
cmp maxFeature, 7
jb Unsupported
mov eax, 7
xor ecx, ecx
cpuid
mov edx, ebx
mov r8d, ecx
call print
byte "cpuid(1), EBX=%08x, ECX=%08x", nl, 0
; Test for extended bit manipulation instructions
; (BMI1 and BMI2):
and ebx, 108h ;Test bits 3 & 8
cmp ebx, 108h ;Both must be set
jne Unsupported
call print
byte "CPU supports BMI1 & BMI2", nl, 0
jmp allDone
Unsupported:
call print
byte "CPU does not support BMI1 & BMI2 "
byte "instructions", nl, 0
allDone: leave
pop rbx
ret ;Returns to caller
asmMain endp
end
Memory Operand Alignment
Almost all SSE, AVX, and AVX2 instructions will generate a memory alignment fault if you attempt to access a 128-bit object at an address that is not 16-byte aligned. Always ensure that your SSE packed operands are properly aligned.
Because segments have a default alignment of para (16 bytes), you can easily ensure that any 16-byte packed data objects are 16-byte-aligned by using the align directive:
align 16
If you are using AVX instructions to access 256-bit (32-byte) memory operands, you must ensure that those memory operands begin on a 32-byte address boundary. Unfortunately, align 32 won’t work, because the default segment alignment is para (16-byte) alignment, and the segment’s alignment must be greater than or equal to the operand field of any align directives appearing within that segment. Therefore, to be able to define 256-bit variables usable by AVX instructions, you must explicitly define a segment that is aligned on a 32-byte boundary, such as the following:
avxData segment align(32)
align 32 ; This is actually redundant here
someData oword 0, 1 ; 256 bits of data
.
.
.
avxData ends
Inside a procedure, you can allocate storage for a 16-, 32-, or 64-bytealigned variable by over-allocating the storage, adding the size minus 1 of the object to the allocated address, and then using the and instruction to zero out LO bits of the address (4 bits for 16-byte-aligned objects, 5 bits for 32-byte-aligned objects, and 6 bits for 64-byte-aligned objects). Then reference the object by using this pointer.
sseproc proc
sseptr equ <[rbp - 8]>
avxptr equ <[rbp - 16]>
avx2ptr equ <[rbp - 24]>
push rbp
mov rbp, rsp
sub rsp, 160
; Load RAX with an address 64 bytes
; above the current stack pointer. A
; 64-byte-aligned address will be somewhere
; between RSP and RSP + 63.
lea rax, [rsp + 63]
; Mask out the LO 6 bits of RAX. This
; generates an address in RAX that is
; aligned on a 64-byte boundary and is
; between RSP and RSP + 63:
and rax, -64 ; 0FFFF...FC0h
; Save this 64-byte-aligned address as
; the pointer to the AVX2 data:
mov avx2ptr, rax
; Add 64 to AVX2's address. This skips
; over AVX2's data. The address is also
; 64-byte-aligned (which means it is
; also 32-byte-aligned). Use this as
; the address of AVX's data:
add rax, 64
mov avxptr, rax
; Add 32 to AVX's address. This skips
; over AVX's data. The address is also
; 32-byte-aligned (which means it is
; also 16-byte-aligned). Use this as
; the address of SSE's data:
add rax, 32
mov sseptr, rax
.
. Code that accesses the
. AVX2, AVX, and SSE data
. areas using avx2ptr,
. avxptr, and sseptr
leave
ret
sseproc endp
SIMD Data Movement Instructions
The (v)movd and (v)movq Instructions
For the SSE instruction set, the movd (move dword) and movq (move qword) instructions copy the value from a 32- or 64-bit general-purpose register or memory location into the LO dword or qword of an XMM register.
movd xmmn, reg32/mem32
movq xmmn, reg64/mem64


The following instructions store the LO 32 or 64 bits of an XMM register into a dword or qword memory location or general-purpose register:
movd reg32/mem32, xmmn
movq reg64/mem64, xmmn
For the AVX instructions, you use the following instructions
vmovd xmmn, reg32/mem32
vmovd reg32/mem32, xmmn
vmovq xmmn, reg64/mem64
vmovq reg64/mem64, xmmn
The (v)movaps, (v)movapd, and (v)movdqa Instructions
The movaps (move aligned, packed single), movapd (move aligned, packed double), and movdqa (move double quad-word aligned) instructions move 16 bytes of data between memory and an XMM register or between two XMM registers. The AVX versions (with the v prefix) move 16 or 32 bytes between memory and an XMM or a YMM register or between two XMM or YMM registers.
movaps xmmn, mem128 vmovaps xmmn, mem128 vmovaps ymmn, mem256
movaps mem128, xmmn vmovaps mem128, xmmn vmovaps mem256, ymmn
movaps xmmn, xmmn vmovaps xmmn, xmmn vmovaps ymmn, ymmn
movapd xmmn, mem128 vmovapd xmmn, mem128 vmovapd ymmn, mem256
movapd mem128, xmmn vmovapd mem128, xmmn vmovapd mem256, ymmn
movapd xmmn, xmmn vmovapd xmmn, xmmn vmovapd ymmn, ymmn
movdqa xmmn, mem128 vmovdqa xmmn, mem128 vmovdqa ymmn, mem256
movdqa mem128, xmmn vmovdqa mem128, xmmn vmovdqa mem256, ymmn
movdqa xmmn, xmmn vmovdqa xmmn, xmmn vmovdqa ymmn, ymmn
The mem128 operand should be a vector (array) of four single-precision floating-point values for the (v)movaps instruction; it should be a vector of two double-precision floating-point values for the (v)movapd instruction; it should be a 16-byte value (16 bytes, 8 words, 4 dwords, or 2 qwords) when using the (v)movdqa instruction.
The mem256 operand should be a vector (array) of eight single-precision floating-point values for the vmovaps instruction; it should be a vector of four double-precision floating-point values for the vmovapd instruction; it should be a 32-byte value (32 bytes, 16 words, 8 dwords, or 4 qwords) when using the vmovdqa instruction.
You will need to use operand type coercion if the instruction doesn’t match one of the following types:
The movaps instruction allows real4, dword, and oword operands.
The movapd instruction allows real8, qword, and oword operands.
The movdqa instruction allows only oword operands.
The vmovaps instruction allows real4, dword, and ymmword ptr operands (when using a YMM register).
The vmovapd instruction allows real8, qword, and ymmword ptr operands (when using a YMM register).
The vmovdqa instruction allows only ymmword ptr operands (when using a YMM register).
The (v)movups, (v)movupd, and (v)movdqu Instructions
When you cannot guarantee that packed data memory operands lie on a 16- or 32-byte address boundary, you can use the (v)movups (move unaligned packed single-precision), (v)movupd (move unaligned packed double-precision), and (v)movdqu (move double quad-word unaligned) instructions to move data between XMM or YMM registers and memory.
These instructions typically run slower than their aligned equivalents.
The (v)movlps and (v)movlpd Instructions
The (v)movl* instructions and (v)movh* instructions allow you to merge two pairs of single-precision or a pair of double-precision floating-point operands from two different sources into a single XMM register.
movlps xmmdest, mem64
movlps mem64, xmmsrc
vmovlps xmmdest, xmmsrc, mem64
vmovlps mem64, xmmsrc
The movlps xmmdest, mem64 form copies a pair of single-precision floating point values into the two LO 32-bit lanes of a destination XMM register.

The movlps mem64, xmmsrc form copies the LO 64 bits (the two LO single precision lanes) from the XMM source register to the specified memory location.
The vmovlps instruction has three operands: a destination XMM register, a source XMM register, and a source (64-bit) memory location. This instruction copies the two single-precision values from the memory location into the LO 64 bits of the destination XMM register. It copies the HO 64 bits of the source register into the HO 64 bits of the destination register.

Like movsd, the movlpd (move low packed double) instruction copies the LO 64 bits (a double-precision floating-point value) of the source operand to the LO 64 bits of the destination operand. The difference is that the movlpd instruction doesn’t zero-extend the value when moving data from memory into an XMM register, whereas the movsd instruction will zero-extend.
The movhps and movhpd Instructions
The movhps and movhpd instructions move a 64-bit value into the HO quad word of a destination XMM register.

The movhps and movhpd instructions can also store the HO quad word of an XMM register into memory. These instructions do not affect bits 128 to 255 of the YMM registers.
movhps xmmn, mem64
movhps mem64, xmmn
movhpd xmmn, mem64
movhpd mem64, xmmn
You would normally use a movlps instruction followed by a movhps instruction to load four single-precision floating-point values into an XMM register, taking the floating-point values from two different data sources.
The vmovhps and vmovhpd Instructions
Although the AVX instruction extensions provide vmovhps and vmovhpd instructions, they are not a simple extension of the SSE movhps and movhpd instructions.
vmovhps xmmdest, xmmsrc, mem64
vmovhps mem64, xmmsrc
vmovhpd xmmdest, xmmsrc, mem64
vmovhpd mem64, xmmsrc
The instructions that store data into a 64-bit memory location behave similarly to the movhps and movhpd instructions.
The instructions that load data into an XMM register have two source operands. They load a full 128 bits into the destination XMM register. The HO 64 bits come from the memory operand and the LO 64 bits come from the LO quad word of the source XMM register.

Unlike for the movhps instruction, MASM properly accepts real4 source operands for the vmovhps instruction/
The movlhps and vmovlhps Instructions
The movlhps instruction moves a pair of 32-bit single-precision floating-point values from the LO qword of the source XMM register into the HO 64 bits of a destination XMM register. It leaves the LO 64 bits of the destination register unchanged.
If the destination register is on a CPU that supports 256-bit AVX registers, this instruction also leaves the HO 128 bits of the overlaid YMM register unchanged.
movlhps xmmdest, xmmsrc
vmovlhps xmmdest, xmmsrc1, xmmsrc2
You cannot use this instruction to move data between memory and an XMM register; it transfers data only between XMM registers. No doubleprecision version of this instruction exists.
The vmovlhps instruction is similar to movlhps, with the following differences:
vmovlhps requires three operands: two source XMM registers and a destination XMM register.
vmovlhps copies the LO quad word of the first source register into the LO quad word of the destination register.
vmovlhps copies the LO quad word of the second source register into bits 64 to 127 of the destination register.
vmovlhps zero-extends the result into the upper 128 bits of the overlaid YMM register.
The movhlps and vmovhlps Instructions
The movhlps instruction copies the pair of 32-bit single-precision floatingpoint values from the HO qword of the source operand to the LO qword of the destination register, leaving the HO 64 bits of the destination register unchanged. This instruction copies data only between XMM registers; it does not allow a memory operand.
movhlps xmmdest, xmmsrc
The vmovhlps instruction requires three XMM register operands. This instruction copies the HO 64 bits of the first source register into the HO 64 bits of the destination register, copies the HO 64 bits of the second source register into bits 0 to 63 of the destination register, and finally, zero-extends the result into the upper bits of the overlaid YMM register.
vmovhlps xmmdest, xmmsrc1, xmmsrc2
The (v)movshdup and (v)movsldup Instructions
The movshdup instruction moves the two odd-index single-precision floatingpoint values from the source operand (memory or XMM register) and duplicates each element into the destination XMM register.
This instruction ignores the single-precision floating-point values at even-lane indexes into the XMM register.
The vmovshdup instruction works the same way but on YMM registers, copying four single-precision values rather than two.
movshdup xmmdest, mem128/xmmsrc
vmovshdup xmmdest, mem128/xmmsrc
vmovshdup ymmdest, mem256/ymmsrc

The movsldup instruction works just like the movshdup instruction, except it copies and duplicates the two single-precision values at even indexes in the source XMM register to the destination XMM register. Likewise, the vmovsldup instruction copies and duplicates the four double-precision values in the source YMM register at even indexes.
movsldup xmmdest, mem128/xmmsrc
vmovsldup xmmdest, mem128/xmmsrc
vmovsldup ymmdest, mem256/ymmsrc

The (v)movddup Instruction
The movddup instruction copies and duplicates a double-precision value from the LO 64 bits of an XMM register or a 64-bit memory location into the LO 64 bits of a destination XMM register; then it also duplicates this value into bits 64 to 127 of that same destination register.

This instruction does not disturb the HO 128 bits of a YMM register.
movddup xmmdest, mem64/xmmsrc
The vmovddup instruction operates on an XMM or a YMM destination register and an XMM or a YMM source register or 128- or 256-bit memory location.
The 128-bit version works just like the movddup instruction except it zeroes the HO bits of the destination YMM register. The 256-bit version copies a pair of double-precision values at even indexes (0 and 2) in the source value to their corresponding indexes in the destination YMM register and duplicates those values at the odd indexes in the destination.

movddup xmmdest, mem64/xmmsrc
vmovddup ymmdest, mem256/ymmsrc
The (v)lddqu Instruction
The (v)lddqu instruction is operationally identical to (v)movdqu. You can sometimes use this instruction to improve performance if the (memory) source operand is not aligned properly and crosses a cache line boundary in memory.
lddqu xmmdest, mem128
vlddqu xmmdest, mem128
vlddqu ymmdest, mem256
The Shuffle and Unpack Instructions
The (v)pshufb Instructions
The pshufb instruction was the first packed byte shuffle SIMD instruction. The first (destination) operand is an XMM register whose byte lanes pshufb will shuffle (rearrange). The second operand (either an XMM register or a 128-bit oword memory location) is an array of 16 byte values holding indexes that control the shuffle operation.
pshufb xmmdest, xmm/mem128
If the second operand is a memory location, that oword value must be aligned on a 16-byte boundary. Each byte (lane) in the second operand selects a value for the corresponding byte lane in the first operand.

The 16-byte indexes in the second operand each take the form shown below

The pshufb instruction ignores bits 4 to 6 in an index byte. Bit 7 is the clear bit; if this bit contains a 1, the pshufb instruction ignores the lane index bits and stores a 0 into the corresponding byte in XMMdest. If the clear bit contains a 0, the pshufb instruction does a shuffle operation.
The pshufb shuffle operation takes place on a lane-by-lane basis. The instruction first makes a temporary copy of XMMdest. Then for each index byte (whose HO bit is 0), the pshufb copies the lane specified by the LO 4 bits of the index from the XMMdest lane that matches the index’s lane.

The AVX variant adds a source register (rather than using XMMdest as both the source and destination registers), and, rather than creating a temporary copy of XMMdest prior to the operation and picking the values from that copy, the vpshufb instructions select the source bytes from the XMMsrc register. Other than that, and the fact that these instructions zero the HO bits of YMMdest, the 128-bit variant operates identically to the SSE pshufb instruction.
The AVX instruction allows you to specify 256-bit YMM registers in addition to 128-bit XMM registers.
vpshufb xmmdest, xmmsrc, xmmindex/mem128
vpshufb ymmdest, ymmsrc, ymmindex/mem256
The (v)pshufd Instructions
These instructions shuffle dwords in XMM and YMM registers (not double-precision values) similarly to the (v)pshufb instructions. However, the shuffle index is specified differently from (v)pshufb.
pshufd xmmdest, xmmsrc/mem128, imm8
vpshufd xmmdest, xmmsrc/mem128, imm8
vpshufd ymmdest, ymmsrc/mem256, imm8
The first operand (XMMdest or YMMdest) is the destination operand where the shuffled values will be stored. The second operand is the source from which the instruction will select the double words to place in the destination register. The third operand is an 8-bit immediate value that specifies the indexes for the double words to select from the source operand.

The value in bits 0 to 1 selects a particular dword from the source operand to place in dword 0 of the XMMdest operand. The value in bits 2 to 3 selects a dword from the source operand to place in dword 1 of the XMMdest operand. The value in bits 4 to 5 selects a dword from the source operand to place in dword 2 of the XMMdest operand. Finally, the value in bits 6 to 7 selects a dword from the source operand to place in dword 3 of the XMMdest operand.
The difference between the 128-bit pshufd and vpshufd instructions is that pshufd leaves the HO 128 bits of the underlying YMM register unchanged and vpshufd zeroes the HO 128 bits of the underlying YMM register.

The 256-bit variant of vpshufd (when using YMM registers as the source and destination operands) still uses an 8-bit immediate operand as the index value. Each 2-bit index value manipulates two dword values in the YMM registers. Bits 0 to 1 control dwords 0 and 4, bits 2 to 3 control dwords 1 and 5, bits 4 to 5 control dwords 2 and 6, and bits 6 to 7 control dwords 3 and 7.
The 256-bit version is slightly less flexible as it copies two dwords at a time, rather than one.
The (v)pshuflw and (v)pshufhw Instructions
The pshuflw and vpshuflw and the pshufhw and vpshufhw instructions provide support for 16-bit word shuffles within an XMM or a YMM register.
pshuflw xmmdest, xmmsrc/mem128, imm8
pshufhw xmmdest, xmmsrc/mem128, imm8
vpshuflw xmmdest, xmmsrc/mem128, imm8
vpshufhw xmmdest, xmmsrc/mem128, imm8
vpshuflw ymmdest, ymmsrc/mem256, imm8
vpshufhw ymmdest, ymmsrc/mem256, imm8
The 128-bit lw variants copy the HO 64 bits of the source operand to the same positions in the XMMdest operand. Then they use the index (imm8) operand to select word lanes 0 to 3 in the LO qword of the XMMsrc/mem128 operand to move to the LO 4 lanes of the destination operand. Note that pshuflw does not modify the HO 128 bits of the overlaid YMM register, whereas vpshuflw zeroes those HO bits.

The 256-bit vpshuflw instruction (with a YMM destination register) copies two pairs of words at a time—one pair in the HO 128 bits and one pair in the LO 128 bits of the YMM destination register and 256-bit source locations. The index (imm8) selection is the same for the LO and HO 128 bits.

The 128-bit hw variants copy the LO 64 bits of the source operand to the same positions in the destination operand. Then they use the index operand to select words 4 to 7 (indexed as 0 to 3) in the 128-bit source operand to move to the HO four word lanes of the destination operand.

The 256-bit vpshufhw instruction (with a YMM destination register) copies two pairs of words at a time—one in the HO 128 bits and one in the LO 128 bits of the YMM destination register and 256-bit source locations.

shufps and shufpd Instructions
The shuffle instructions (shufps and shufpd) extract single- or double-precision values from the source operands and place them in specified positions in the destination operand. The third operand, an 8-bit immediate value, selects which values to extract from the source to move into the destination register.
shufps xmmsrc1/dest, xmmsrc2/mem128, imm8
shufpd xmmsrc1/dest, xmmsrc2/mem128, imm8
For the shufps instruction, the second source operand is an 8-bit immediate value that is actually a four-element array of 2-bit values.
imm8 bits 0 and 1 select a single-precision value from one of the four lanes in the XMMsrc1/dest operand to store into lane 0 of the destination operation. Bits 2 and 3 select a single-precision value from one of the four lanes in the XMMsrc1/dest operand to store into lane 1 of the destination operation (the destination operand is also XMMsrc1/dest). imm8 bits 4 and 5 select a single-precision value from one of the four lanes in the XMMsrc2/memsrc2 operand to store into lane 2 of the destination operation. Bits 6 and 7 select a single-precision value from one of the four lanes in the XMMsrc2/memsrc2 operand to store into lane 3 of the destination operation.

The shufpd instruction works similarly, shuffling double-precision values. As there are only two double-precision values in an XMM register, it takes only a single bit to choose between the values.
As a result, the third operand, the imm8 value, is actually just a 2-bit value; the instruction ignores bits 2 to 7 in the imm8 operand. Bit 0 of the imm8 operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMMsrc1/dest operand to place into lane 0 and bits 0 to 63 of XMMdest. Bit 1 of the imm8 operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMMsrc/mem128 operand to place into lane 1 and bits 64 to 127 of XMMdest.

The vshufps and vshufpd Instructions
The vshufps and vshufpd instructions are similar to shufps and shufpd. They allow you to shuffle the values in 128-bit XMM registers or 256-bit YMM registers.
The vshufps and vshufpd instructions have four operands: a destination XMM or YMM register, two source operands (src1 must be an XMM or a YMM register, and src2 can be an XMM or a YMM register or a 128- or 256-bit memory location), and an imm8 operand.
vshufps xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vshufpd xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vshufps ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
vshufpd ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
For the 256-bit vshufps instructions, the imm8 operand is an array of four 2-bit values. These 2-bit values select one of four single-precision values from the source locations as shown below:

The vshufps instruction also allows you to specify XMM and 128-bit memory operands. In this form, it behaves quite similarly to the shufps instruction except that you get to specify two different 128-bit source operands and it zeroes the HO 128 bits of the corresponding YMM register.
The vshufpd instruction is an extension of shufpd to 256 bits. As there are four double-precision values present in a 256-bit YMM register, vshufpd needs 4 bits to select the source indexes

Like the vshufps instruction, vshufpd also allows you to specify XMM registers if you want a three-operand version of shufpd.
The (v)unpcklps, (v)unpckhps, (v)unpcklpd, and (v)unpckhpd Instructions
The unpack (and merge) instructions are a simplified variant of the shuffle instructions. These instructions copy single- and double-precision values from fixed locations in their source operands and insert those values into fixed locations in the destination operand.
The unpcklps and unpckhps instructions choose half their single-precision operands from one of two sources, merge these values, and then store the merged result into the destination operand
unpcklps xmmdest, xmmsrc/mem128
unpckhps xmmdest, xmmsrc/mem128
The XMMdest operand serves as both the first source operand and the destination operand. The XMMsrc/mem128 operand is the second source operand.
The unpcklps instruction copies the two LO single-precision values from the source operand to bit positions 32 to 63 (dword 1) and 96 to 127 (dword 3). It leaves dword 0 in the destination operand alone and copies the value originally in dword 1 to dword 2 in the destination

The unpckhps instruction copies the two HO single-precision values from the two sources to the destination register

The unpcklpd and unpckhpd instructions do the same thing as unpcklps and unpckhps except, of course, they operate on double-precision values rather than single-precision values.


The vunpcklps, vunpckhps, vunpcklpd, and vunpckhpd instructions have the following syntax:
vunpcklps xmmdest, xmmsrc1, xmmsrc2/mem128
vunpckhps xmmdest, xmmsrc1, xmmsrc2/mem128
vunpcklps ymmdest, ymmsrc1, ymmsrc2/mem256
vunpckhps ymmdest, ymmsrc1, ymmsrc2/mem256
They work similarly to the non-v variants, with a couple of differences:
The AVX variants support using the YMM registers as well as the XMM registers.
The AVX variants require three operands. The first (destination) and second (source1) operands must be XMM or YMM registers. The third (source2) operand can be an XMM or a YMM register or a 128- or 256-bit memory location. The two-operand form is just a special case of the three-operand form, where the first and second operands specify the same register name.
The 128-bit variants zero out the HO bits of the YMM register rather than leaving those bits unchanged.
The interleaving extension happens in the intuitive way as shown below, with vunpcklps:

The vunpckhps instruction does the following:

The punpck* Instructions
The punpck* instructions provide a set of integer unpack instructions to complement the floating-point variants.

The punpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination SSE register. These instructions do not modify the upper 128 bits of any overlaid YMM register.
punpcklbw xmmdest, xmmsrc
punpcklbw xmmdest, memsrc
punpckhbw xmmdest, xmmsrc
punpckhbw xmmdest, memsrc
punpcklwd xmmdest, xmmsrc
punpcklwd xmmdest, memsrc
punpckhwd xmmdest, xmmsrc
punpckhwd xmmdest, memsrc
punpckldq xmmdest, xmmsrc
punpckldq xmmdest, memsrc
punpckhdq xmmdest, xmmsrc
punpckhdq xmmdest, memsrc
punpcklqdq xmmdest, xmmsrc
punpcklqdq xmmdest, memsrc
punpckhqdq xmmdest, xmmsrc
punpckhqdq xmmdest, memsrc



The vpunpck* SSE Instructions
The AVX vpunpck* instructions provide a set of AVX integer unpack instructions to complement the SSE variants.


The vpunpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination AVX or SSE register.
vpunpcklbw xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhbw xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpcklwd xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhwd xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckldq xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhdq xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpcklqdq xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhqdq xmmdest, xmmsrc1, xmmsrc2/mem128
Functionally, the only difference between these AVX instructions (vunpck*) and the SSE (unpck*) instructions is that the SSE variants leave the upper bits of the YMM AVX registers (bits 128 to 255) unchanged, whereas the AVX variants zero-extend the result to 256 bits.
The vpunpck* AVX Instructions
The AVX vunpck* instructions also support the use of the AVX YMM registers, in which case the unpack and merge operation extends from 128 bits to 256 bits.
vpunpcklbw ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhbw ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpcklwd ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhwd ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckldq ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhdq ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpcklqdq ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhqdq ymmdest, ymmsrc1, ymmsrc2/mem256
The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq Instructions
The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq instructions extract a byte, word, dword, or qword from a 128-bit XMM register and copy this data to a general-purpose register or memory location.
pextrb reg32, xmmsrc, imm8 ; imm8 = 0 to 15
pextrb reg64, xmmsrc, imm8 ; imm8 = 0 to 15
pextrb mem8, xmmsrc, imm8 ; imm8 = 0 to 15
vpextrb reg32, xmmsrc, imm8 ; imm8 = 0 to 15
vpextrb reg64, xmmsrc, imm8 ; imm8 = 0 to 15
vpextrb mem8, xmmsrc, imm8 ; imm8 = 0 to 15
pextrw reg32, xmmsrc, imm8 ; imm8 = 0 to 7
pextrw reg64, xmmsrc, imm8 ; imm8 = 0 to 7
pextrw mem16, xmmsrc, imm8 ; imm8 = 0 to 7
vpextrw reg32, xmmsrc, imm8 ; imm8 = 0 to 7
vpextrw reg64, xmmsrc, imm8 ; imm8 = 0 to 7
vpextrw mem16, xmmsrc, imm8 ; imm8 = 0 to 7
pextrd reg32, xmmsrc, imm8 ; imm8 = 0 to 3
pextrd mem32, xmmsrc, imm8 ; imm8 = 0 to 3
vpextrd mem64, xmmsrc, imm8 ; imm8 = 0 to 3
vpextrd reg32, xmmsrc, imm8 ; imm8 = 0 to 3
vpextrd reg64, xmmsrc, imm8 ; imm8 = 0 to 3
vpextrd mem32, xmmsrc, imm8 ; imm8 = 0 to 3
pextrq reg64, xmmsrc, imm8 ; imm8 = 0 to 1
pextrq mem64, xmmsrc, imm8 ; imm8 = 0 to 1
vpextrq reg64, xmmsrc, imm8 ; imm8 = 0 to 1
vpextrq mem64, xmmsrc, imm8 ; imm8 = 0 to 1
The byte and word instructions expect a 32- or 64-bit general-purpose register as their destination or a memory location that is the same size as the instruction
The double-word and quad-word variants require a 32-bit or 64-bit general-purpose register, respectively. If the destination operand is a 32- or 64-bit general-purpose register, the instruction zero-extends the value to 32 or 64 bits, if necessary.
These instructions do not support extracting data from the upper 128 bits of a YMM register
The (v)pinsrb, (v)pinsrw, (v)pinsrd, and (v)pinsrq Instructions
The (v)pinsr{b,w,d,q} instructions take a byte, word, dword, or qword from a general-purpose register or memory location and store that data to a lane of an XMM register
pinsrb xmmdest, reg32, imm8 ; imm8 = 0 to 15
pinsrb xmmdest, mem8, imm8 ; imm8 = 0 to 15
vpinsrb xmmdest, xmmsrc2, reg32, imm8 ; imm8 = 0 to 15
vpinsrb xmmdest, xmmsrc2, mem8, imm8 ; imm8 = 0 to 15
pinsrw xmmdest, reg32, imm8 ; imm8 = 0 to 7
pinsrw xmmdest, mem16, imm8 ; imm8 = 0 to 7
vpinsrw xmmdest, xmmsrc2, reg32, imm8 ; imm8 = 0 to 7
vpinsrw xmmdest, xmmsrc2, mem16, imm8 ; imm8 = 0 to 7
pinsrd xmmdest, reg32, imm8 ; imm8 = 0 to 3
pinsrd xmmdest, mem32, imm8 ; imm8 = 0 to 3
vpinsrd xmmdest, xmmsrc2, reg32, imm8 ; imm8 = 0 to 3
vpinsrd xmmdest, xmmsrc2, mem32, imm8 ; imm8 = 0 to 3
pinsrq xmmdest, reg64, imm8 ; imm8 = 0 to 1
pinsrq xmmdest, xmmsrc2, mem64, imm8 ; imm8 = 0 to 1
vpinsrq xmmdest, xmmsrc2, reg64, imm8 ; imm8 = 0 to 1
vpinsrq xmmdest, xmmsrc2, mem64, imm8 ; imm8 = 0 to 1
These instructions fetch a byte, word, dword, or qword from the generalpurpose register or memory location and copy that to the lane in the XMM register specified by the 8-bit immediate value. The pinsr{b,w,d,q} instructions leave any HO bits in the underlying YMM register unchanged
The vpinsr{b,w,d,q} instructions copy the data from the XMM source register into the destination register and then copy the byte, word, dword, or quad word to the specified location in the destination register. These instructions zero-extend the value throughout the HO bits of the underlying YMM register.
The (v)extractps and (v)insertps Instructions
The extractps and vextractps instructions are functionally equivalent to pextrd and vpextrd. They extract a 32-bit value from an XMM register and move it into a 32-bit general-purpose register or a 32-bit memory location.
extractps reg32, xmmsrc, imm8
extractps mem32, xmmsrc, imm8
vextractps reg32, xmmsrc, imm8
vextractps mem32, xmmsrc, imm8
The insertps and vinsertps instructions insert a 32-bit floating-point value into an XMM register and, optionally, zero out other lanes in the XMM register.
insertps xmmdest, xmmsrc, imm8
insertps xmmdest, mem32, imm8
vinsertps xmmdest, xmmsrc1, xmmsrc2, imm8
vinsertps xmmdest, xmmsrc1, mem32, imm8

On CPUs with the AVX extensions, insertps does not modify the upper bits of the YMM registers; vinsertps zeroes the upper bits. The vinsertps instruction first copies the XMMsrc1 register to XMMdest before performing the insertion operation. The HO bits of the corresponding YMM register are set to 0.
SIMD Arithmetic and Logical Operations
Logical Instructions

The syntax for these instructions is the following:
andpd xmmdest, xmmsrc/mem128
vandpd xmmdest, xmmsrc1, xmmsrc2/mem128
vandpd ymmdest, ymmsrc1, ymmsrc2/mem256
andnpd xmmdest, xmmsrc/mem128
vandnpd xmmdest, xmmsrc1, xmmsrc2/mem128
vandnpd ymmdest, ymmsrc1, ymmsrc2/mem256
orpd xmmdest, xmmsrc/mem128
vorpd xmmdest, xmmsrc1, xmmsrc2/mem128
vorpd ymmdest, ymmsrc1, ymmsrc2/mem256
xorpd xmmdest, xmmsrc/mem128
vxorpd xmmdest, xmmsrc1, xmmsrc2/mem128
vxorpd ymmdest, ymmsrc1, ymmsrc2/mem256
The SSE instructions (without the v prefix) leave the HO bits of the underlying YMM register unchanged (if applicable). The AVX instructions (with the v prefix) that have 128-bit operands will zero-extend their result into the HO bits of the YMM register.

The (v)ptest Instructions
The ptest instruction (packed test) is similar to the standard integer test instruction. The ptest instruction performs a logical AND between the two operands and sets the zero flag if the result is 0.
The ptest instruction sets the carry flag if the logical AND of the second operand with the inverted bits of the first operand produces 0.
ptest xmmsrc1, xmmsrc2/mem128
vptest xmmsrc1, xmmsrc2/mem128
vptest ymmsrc1, ymmsrc2/mem256
The Byte Shift Instructions
The SSE and AVX instruction set extensions also support a set of logical and arithmetic shift instructions. The first two to consider are pslldq and psrldq.
pslldq xmmdest, imm8
vpslldq xmmdest, xmmsrc, imm8
vpslldq ymmdest, ymmsrc, imm8
psrldq xmmdest, imm8
vpsrldq xmmdest, xmmsrc, imm8
vpsrldq ymmdest, ymmsrc, imm8
The pslldq instruction shifts its destination XMM register to the left by the number of bytes specified by the imm8 operand. This instruction shifts 0s into the vacated LO bytes.
The vpslldq instruction takes the value in the source register (XMM or YMM), shifts that value to the left by imm8 bytes, and then stores the result into the destination register. For the 128-bit variant, this instruction zeroextends the result into bits 128 to 255 of the underlying YMM register.
The psrldq and vpsrldq instructions operate similarly to (v)pslldq except, of course, they shift their operands to the right rather than to the left. These are logical shift-right operations, so they shift 0s into the HO bytes of their operand, and bits shifted out of bit 0 are lost.
The pslldq and psrldq instructions shift bytes rather than bits. For example, many SSE instructions produce byte masks 0 or 0FFh, representing Boolean results. These instructions shift the equivalent of a bit in one of these byte masks by shifting whole bytes at a time.
The Bit Shift Instructions
These instructions provide word, dword, and qword variants of the logical shift-left, logical shift-right, and arithmetic shift-right operations
shift xmmdest, imm8
shift xmmdest, xmmsrc/mem128
vshift xmmdest, xmmsrc, imm8
vshift xmmdest, xmmsrc, mem128
vshift ymmdest, ymmsrc, imm8
vshift ymmdest, ymmsrc, xmm/mem128
shift = psllw, pslld, psllq, psrlw, psrld, psrlq, psraw, or psrad,
vshift = vpsllw, vpslld, vpsllq, vpsrlw, vpsrld, vpsrlq, vpsraw, vpsrad, or vpsraq
The (v)psl* instructions shift their operands to the left; the (v)psr* instructions shift their operands to the right.
The (v)psll* and (v)psrl* instructions are logical shift instructions and shift 0s into the bits vacated by the shift. Any bits shifted out of the operand are lost.
The (v)psra* instructions are arithmetic shift-right instructions. They replicate the HO bit in each lane when shifting that lane’s bits to the right; all bits shifted out of the LO bit are lost.
The w suffix instructions shift 16-bit operands (eight lanes for 128-bit destination operands, sixteen lanes for 256-bit destinations).
The d suffix instructions shift 32-bit dword operands (four lanes for 128-bit destination operands, eight lanes for 256-bit destination operands).
The q suffix instructions shift 64-bit operands (two lanes for 128-bit operands, four lanes for 256-bit operands).
SIMD Integer Addition


These addition instructions are known as vertical additions because if we stack the two source operands on top of each other, the lane additions occur vertically.
The packed additions ignore any overflow from the addition operation, keeping only the LO byte, word, dword, or qword of each addition. As long as overflow is never possible, this is not an issue.
A cleaner solution is to use saturation arithmetic. For unsigned addition, saturation arithmetic clips (or saturates) an overflow to the largest possible value that the instruction’s size can handle. For signed saturation arithmetic, clipping occurs at the largest positive and smallest negative values (for example, 7Fh/+127 for positive values and 80h/–128 for negative values).


The padd* instructions leave the HO bits of any corresponding YMM destination undisturbed; the vpadd* variants clear the HO bits.
Horizontal Additions
The SSE/AVX instruction sets also support three horizontal addition instructions. The horizontal addition instructions add adjacent words or dwords in their two source operands and store the sum of the result into a destination lane


The phaddw instruction does not affect the HO bits of the corresponding YMM destination register, only the LO 128 bits. It has the following syntax:
phaddw xmmdest, xmmsrc/mem128
; Computes the following
temp[0 to 15] = xmmdest[0 to 15] + xmmdest[16 to 31]
temp[16 to 31] = xmmdest[32 to 47] + xmmdest[48 to 63]
temp[32 to 47] = xmmdest[64 to 79] + xmmdest[80 to 95]
temp[48 to 63] = xmmdest[96 to 111] + xmmdest[112 to 127]
temp[64 to 79] = xmmsrc/mem128[0 to 15] + xmmsrc/mem128[16 to 31]
temp[80 to 95] = xmmsrc/mem128[32 to 47] + xmmsrc/mem128[48 to 63]
temp[96 to 111] = xmmsrc/mem128[64 to 79] + xmmsrc/mem128[80 to 95]
temp[112 to 127] = xmmsrc/mem128[96 to 111] + xmmsrc/mem128[112 to 127]
xmmdest = temp
The 128-bit vphaddw instruction zeroes out the HO 128 bits of the corresponding YMM destination register. It has the following syntax:
vphaddw xmmdest, xmmsrc1, xmmsrc2/mem128
; computes the following
xmmdest[0 to 15] = xmmsrc1[0 to 15] + xmmsrc1[16 to 31]
xmmdest[16 to 31] = xmmsrc1[32 to 47] + xmmsrc1[48 to 63]
xmmdest[32 to 47] = xmmsrc1[64 to 79] + xmmsrc1[80 to 95]
xmmdest[48 to 63] = xmmsrc1[96 to 111] + xmmsrc1[112 to 127]
xmmdest[64 to 79] = xmmsrc2/mem128[0 to 15] + xmmsrc2/mem128[16 to 31]
xmmdest[80 to 95] = xmmsrc2/mem128[32 to 47] + xmmsrc2/mem128[48 to 63]
xmmdest[96 to 111] = xmmsrc2/mem128[64 to 79] + xmmsrc2/mem128[80 to 95]
xmmdest[111 to 127] = xmmsrc2/mem128[96 to 111] + xmmsrc2/mem128[112 to 127]
vphaddw does not simply extend the 128-bit version in the intuitive way. Instead, it mixes up computations as follows (where SRC1 is YMMsrc1 and SRC2 is YMMsrc2/mem256):
vphaddw ymmdest, ymmsrc1, ymmsrc2/mem256
; compute the following
ymmdest[0 to 15] = SRC1[16 to 31] + SRC1[0 to 15]
ymmdest[16 to 31] = SRC1[48 to 63] + SRC1[32 to 47]
ymmdest[32 to 47] = SRC1[80 to 95] + SRC1[64 to 79]
ymmdest[48 to 63] = SRC1[112 to 127] + SRC1[96 to 111]
ymmdest[64 to 79] = SRC2[16 to 31] + SRC2[0 to 15]
ymmdest[80 to 95] = SRC2[48 to 63] + SRC2[32 to 47]
ymmdest[96 to 111] = SRC2[80 to 95] + SRC2[64 to 79]
ymmdest[112 to 127] = SRC2[112 to 127] + SRC2[96 to 111]
ymmdest[128 to 143] = SRC1[144 to 159] + SRC1[128 to 143]
ymmdest[144 to 159] = SRC1[176 to 191] + SRC1[160 to 175]
ymmdest[160 to 175] = SRC1[208 to 223] + SRC1[192 to 207]
ymmdest[176 to 191] = SRC1[240 to 255] + SRC1[224 to 239]
ymmdest[192 to 207] = SRC2[144 to 159] + SRC2[128 to 143]
ymmdest[208 to 223] = SRC2[176 to 191] + SRC2[160 to 175]
ymmdest[224 to 239] = SRC2[208 to 223] + SRC2[192 to 207]
ymmdest[240 to 255] = SRC2[240 to 255] + SRC2[224 to 239]
Double-Word–Sized Horizontal Additions
The phaddd instruction has the following syntax:
phaddd xmmdest, xmmsrc/mem128
; computes the following
temp[0 to 31] = xmmdest[0 to 31] + xmmdest[32 to 63]
temp[32 to 63] = xmmdest[64 to 95] + xmmdest[96 to 127]
temp[64 to 95] = xmmsrc/mem128[0 to 31] + xmmsrc/mem128[32 to 63]
temp[96 to 127] = xmmsrc/mem128[64 to 95] + xmmsrc/mem128[96 to 127]
xmmdest = temp
The 128-bit vphaddd instruction has this syntax
vphaddd xmmdest, xmmsrc1, xmmsrc2/mem128
; computes the following
xmmdest[0 to 31] = xmmsrc1[0 to 31] + xmmsrc1[32 to 63]
xmmdest[32 to 63] = xmmsrc1[64 to 95] + xmmsrc1[96 to 127]
xmmdest[64 to 95] = xmmsrc2/mem128[0 to 31] + xmmsrc2/mem128[32 to 63]
xmmdest[96 to 127] = xmmsrc2/mem128[64 to 95] + xmmsrc2/mem128[96 to 127]
(ymmdest[128 to 255] = 0)
The 256-bit vphaddd instruction has the following syntax:
vphaddd ymmdest, ymmsrc1, ymmsrc2/mem256
; computes the following
ymmdest[0 to 31] = ymmsrc1[32 to 63] + ymmsrc1[0 to 31]
ymmdest[32 to 63] = ymmsrc1[96 to 127] + ymmsrc1[64 to 95]
ymmdest[64 to 95] = ymmsrc2/mem128[32 to 63] + ymmsrc2/mem128[0 to 31]
ymmdest[96 to 127] = ymmsrc2/mem128[96 to 127] + ymmsrc2/mem128[64 to 95]
ymmdest[128 to 159] = ymmsrc1[160 to 191] + ymmsrc1[128 to 159]
ymmdest[160 to 191] = ymmsrc1[224 to 255] + ymmsrc1[192 to 223]
ymmdest[192 to 223] = ymmsrc2/mem128[160 to 191] + ymmsrc2/mem128[128 to 159]
ymmdest[224 to 255] = ymmsrc2/mem128[224 to 255] + ymmsrc2/mem128[192 to 223]
The (v)phaddsw instructions take the following forms:
phaddsw xmmdest, xmmsrc/mem128
vphaddsw xmmdest, xmmsrc1, xmmsrc2/mem128
vphaddsw ymmdest, ymmsrc1, ymmsrc2/mem256
The (v)phaddsw instruction (horizontal signed integer add with saturate, word) is a slightly different form of (v)phaddw: rather than storing only the LO bits into the result in the destination lane, this instruction saturates the result.
SIMD Integer Subtraction
SIMD addition instructions, they do not affect any flags; any carry, borrow, overflow, or underflow information is lost. These instructions subtract the second source operand from the first source operand and store the result into the destination operand.

Likewise, there is a set of signed and unsigned byte and word saturating subtraction instructions. For the signed instructions, the byte-sized instructions saturate positive overflow to 7Fh (+127) and negative underflow to 80h (–128). The word-sized instructions saturate to 7FFFh (+32,767) and 8000h (–32,768). The unsigned saturation instructions saturate to 0FFFFh (+65,535) and 0.

SIMD Integer Multiplication

The (v)pmullw instruction multiplies the 16-bit values appearing in the lanes of the source operand and stores the LO word of the result into the corresponding destination lane. This instruction is applicable to both signed and unsigned values.
The (v)pmulhw instruction computes the product of two signed word values and stores the HO word of the result into the destination lanes.
For unsigned operands, (v)pmulhuw performs the same task. By executing both (v)pmullw and (v)pmulh(u)w with the same operands, you can compute the full 32-bit result of a 16×16-bit multiplication.

At some point along the way, Intel introduced (v)pmuldq and (v)pmuludq to perform signed and unsigned 32×32-bit multiplications, producing a 64-bit result
pmuldq xmmdest, xmm/mem128
vpmuldq xmmdest, xmmsrc1, xmm/mem128
vpmuldq ymmdest, ymmsrc1, ymm/mem256
pmuludq xmmdest, xmm/mem128
vpmuludq xmmdest, xmmsrc1, xmm/mem128
vpmuludq ymmdest, ymmsrc1, ymm/mem256
The 128-bit variants multiply the double words appearing in lanes 0 and 2 and store the 64-bit results into qword lanes 0 and 1. On CPUs with AVX registers, pmuldq and pmuludq do not affect the HO 128 bits of the YMM register. The vpmuldq and vpmuludq instructions zero-extend the result to 256 bits.
The 256-bit variants multiply the double words appearing in lanes 0, 2, 4, and 6, producing 64-bit results that they store in qword lanes 0, 1, 2, and 3
The pclmulqdq instruction provides the ability to multiply two qword values, producing a 128-bit result
pclmulqdq xmmdest, xmm/mem128, imm8
vpclmulqdq xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
These instructions multiply a pair of qword values found in XMMdest and XMMsrc and leave the 128-bit result in XMMdest. The imm8 operand specifies which qwords to use as the source operands.

As usual, pclmulqdq leaves the HO 128 bits of the corresponding YMM destination register unchanged, while vpcmulqdq zeroes those bits.
SIMD Integer Averages
The (v)pavgb and (v)pavgw instructions compute the average of two sets of bytes or words. These instructions sum the value in the byte or word lanes of their source and destination operands, divide the result by 2, round the results, and leave the averaged results sitting in the destination operand lanes.
pavgb xmmdest, xmm/mem128
vpavgb xmmdest, xmmsrc1, xmmsrc2/mem128
vpavgb ymmdest, ymmsrc1, ymmsrc2/mem256
pavgw xmmdest, xmm/mem128
vpavgw xmmdest, xmmsrc1, xmmsrc2/mem128
vpavgw ymmdest, ymmsrc1, ymmsrc2/mem256
The 128-bit pavgb and vpavgb instructions compute 16 byte-sized averages (for the 16 lanes in the source and destination operands). The 256-bit variant of the vpavgb instruction computes 32 byte-sized averages.
The 128-bit pavgw and vpavgw instructions compute eight word-sized averages (for the eight lanes in the source and destination operands). The 256-bit variant of the vpavgw instruction computes 16 byte-sized averages.
SIMD Integer Minimum and Maximum
These instructions scan the lanes of a pair of 128- or 256-bit operands and copy the maximum or minimum value from that lane to the same lane in the destination operand.


The general syntax is as follows -
pmxxyz xmmdest, xmmsrc/mem128
vpmxxyz xmmdest, xmmsrc1, xmmsrc2/mem128
vpmxxyz ymmdest, ymmsrc1, ymmsrc2/mem256
SIMD Integer Absolute Value
The SSE/AVX instruction set extensions provide three sets of instructions for computing the absolute values of signed byte, word, and double-word integers: (v)pabsb, (v)pabsw, and (v)pabsd
pabsb xmmdest, xmmsrc/mem128
vpabsb xmmdest, xmmsrc/mem128
vpabsb ymmdest, ymmsrc/mem256
pabsw xmmdest, xmmsrc/mem128
vpabsw xmmdest, xmmsrc/mem128
vpabsw ymmdest, ymmsrc/mem256
pabsd xmmdest, xmmsrc/mem128
vpabsd xmmdest, xmmsrc/mem128
vpabsd ymmdest, ymmsrc/mem256
SIMD Integer Sign Adjustment Instructions
The (v)psignb, (v)psignw, and (v)psignd instructions apply the sign found in a source lane to the corresponding destination lane.
psignb xmmdest, xmmsrc/mem128
vpsignb xmmdest, xmmsrc1, xmmsrc2/mem128
vpsignb ymmdest, ymmsrc1, ymmsrc2/mem256
psignw xmmdest, xmmsrc/mem128
vpsignw xmmdest, xmmsrc1, xmmsrc2/mem128
vpsignw ymmdest, ymmsrc1, ymmsrc2/mem256
psignd xmmdest, xmmsrc/mem128
vpsignd xmmdest, xmmsrc1, xmmsrc2/mem128
vpsignd ymmdest, ymmsrc1, ymmsrc2/mem256
SSE Compare-for-Equality Instructions
pcmpeqb xmmdest, xmmsrc/mem128 ; Compares 16 bytes
pcmpeqw xmmdest, xmmsrc/mem128 ; Compares 8 words
pcmpeqd xmmdest, xmmsrc/mem128 ; Compares 4 dwords
pcmpeqq xmmdest, xmmsrc/mem128 ; Compares 2 qwords
; computes -
; xmmdest[lane] = xmmdest[lane] == xmmsrc/mem128[lane]
; where lane varies from 0 to 15 for pcmpeqb, 0 to 7 for pcmpeqw, 0 to 3 for pcmpeqd,
; and 0 to 1 for pcmpeqq.
The == operator produces a value of all 1 bits if the two values in the same lane are equal; it produces all 0 bits if the values are not equal.
SSE Compare-for-Greater-Than Instructions
pcmpgtb xmmdest, xmmsrc/mem128 ; Compares 16 bytes
pcmpgtw xmmdest, xmmsrc/mem128 ; Compares 8 words
pcmpgtd xmmdest, xmmsrc/mem128 ; Compares 4 dwords
pcmpgtq xmmdest, xmmsrc/mem128 ; Compares 2 qwords
; computes -
; xmmdest[lane] = xmmdest[lane] > xmmsrc/mem128[lane]
; where lane is the same as for the compare-for-equality instructions and
; the > operator produces a value of all 1 bits if the signed integer in
; the XMMdest lane is greater than the signed value in the corresponding
; XMMsrc/MEM128 lane.
AVX Comparison Instructions
The 128-bit variants of these instructions have the following syntax:
vpcmpeqb xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 16 bytes
vpcmpeqw xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 8 words
vpcmpeqd xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 4 dwords
vpcmpeqq xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 2 qwords
vpcmpgtb xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 16 bytes
vpcmpgtw xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 8 words
vpcmpgtd xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 4 dwords
vpcmpgtq xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 2 qwords
; computes -
; xmmdest[lane] = xmmsrc1[lane] == xmmsrc2/mem128[lane]
; xmmdest[lane] = xmmsrc1[lane] > xmmsrc2/mem128[lane]
The 256-bit variants of these instructions have the following syntax:
vpcmpeqb ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 32 bytes
vpcmpeqw ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 16 words
vpcmpeqd ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 8 dwords
vpcmpeqq ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 4 qwords
vpcmpgtb ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 32 bytes
vpcmpgtw ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 16 words
vpcmpgtd ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 8 dwords
vpcmpgtq ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 4 qwords
; computes -
; ymmdest[lane] = ymmsrc1[lane] == ymmsrc2/mem256[lane]
; ymmdest[lane] = ymmsrc1[lane] > ymmsrc2/mem256[lane]
The (v)pmovmskb Instructions
The (v)pmovmskb instruction extracts the HO bit from all the bytes in an XMM or YMM register and stores the 16 or 32 bits (respectively) into a general-purpose register. These instructions set all HO bits of the general purpose register to 0.
pmovmskb reg, xmmsrc
vpmovmskb reg, xmmsrc
vpmovmskb reg, ymmsrc
The 128-bit instructions fill only bits 0 through 15 of the destination register (zeroing out all other bits). The 256-bit form of the vpmovmskb instruction fills bits 0 through 31 of the destination register (zeroing out HO bits if you specify a 64-bit register).
Integer Conversions
Packed Zero-Extension Instructions


Packed Sign-Extension Instructions


Packed Sign Extension with Saturation

The saturate operation checks its operand to see if the value exceeds the range of the result (–128 to +127 for signed bytes, 0 to 255 for unsigned bytes, –32,768 to +32,767 for signed words, and 0 to 65,535 for unsigned words).
When saturating to a byte, if the signed source value is less than –128, byte saturation sets the value to –128. When saturating to a word, if the signed source value is less than –32,786, signed saturation sets the value to –32,768. Similarly, if a signed byte or word value exceeds +127 or +32,767, then saturation replaces the value with +127 or +32,767, respectively.
For unsigned operations, saturation limits the value to +255 (for bytes) or +65,535 (for words). Unsigned values are never less than 0, so unsigned saturation clips values to only +255 or +65,535.
AVX-capable CPUs provide 128-bit variants of these instructions that support three operands: two source operands and an independent destination operand.
vpacksswb xmmdest, xmmsrc1, xmmsrc2/mem128
vpackuswb xmmdest, xmmsrc1, xmmsrc2/mem128
vpackssdw xmmdest, xmmsrc1, xmmsrc2/mem128
vpackusdw xmmdest, xmmsrc1, xmmsrc2/mem128
Also, the SSE instructions do not modify the upper bits of the YMM register (if present on the CPU), whereas the AVX instructions store 0s into the upper YMM register bits

SIMD Floating-Point Instructions
Arithmetic Instructions
The 128-bit SSE packed floating-point instructions have the following generic syntax (where instr is one of the floating-point instructions in the table below
instrps xmmdest, xmmsrc/mem128
instrpd xmmdest, xmmsrc/mem128


The packed single (*ps) instructions perform four single-precision floatingpoint operations simultaneously. The packed double (*pd) instructions perform two double-precision floating-point operations simultaneously. They compute -
xmmdest[lane] = xmmdest[lane] op xmmsrc/mem128[lane]
; where lane varies from 0 to 3 for packed single-precision instructions
; and from 0 to 1 for packed double-precision instructions. op represents the
; operation (such as addition or subtraction)
The 128-bit AVX packed floating-point instructions have this syntax:
vinstrps xmmdest, xmmsrc1, xmmsrc2/mem128 ; For dyadic operations
vinstrpd xmmdest, xmmsrc1, xmmsrc2/mem128 ; For dyadic operations
vinstrps xmmdest, xmmsrc/mem128 ; For monadic operations
vinstrpd xmmdest, xmmsrc/mem128 ; For monadic operations
; they compute
;
; xmmdest[lane] = xmmsrc1[lane] op xmmsrc2/mem128[lane]
;
; where op corresponds to the operation associated with the specific instruction
These 128-bit AVX instructions clear the HO bits of the underlying YMMdest register.
The 256-bit AVX packed floating-point instructions have this syntax:
vinstrps ymmdest, ymmsrc1, ymmsrc2/mem256 ; For dyadic operations
vinstrpd ymmdest, ymmsrc1, ymmsrc2/mem256 ; For dyadic operations
vinstrps ymmdest, ymmsrc/mem256 ; For monadic operations
vinstrpd ymmdest, ymmsrc/mem256 ; For monadic operations
; These instructions compute
;
; ymmdest[lane] = ymmsrc1[lane] op ymmsrc/mem256[lane]
;
; where op corresponds to the operation associated with the specific instruction
The SSE/AVX instruction set extensions also include floating-point horizontal addition and subtraction instructions.
haddps xmmdest, xmmsrc/mem128
vhaddps xmmdest, xmmsrc1, xmmsrc2/mem128
vhaddps ymmdest, ymmsrc1, ymmsrc2/mem256
haddpd xmmdest, xmmsrc/mem128
vhaddpd xmmdest, xmmsrc1, xmmsrc2/mem128
vhaddpd ymmdest, ymmsrc1, ymmsrc2/mem256
hsubps xmmdest, xmmsrc/mem128
vhsubps xmmdest, xmmsrc1, xmmsrc2/mem128
vhsubps ymmdest, ymmsrc1, ymmsrc2/mem256
hsubpd xmmdest, xmmsrc/mem128
vhsubpd xmmdest, xmmsrc1, xmmsrc2/mem128
vhsubpd ymmdest, ymmsrc1, ymmsrc2/mem256
SSE and AVX Comparisons
There are two sets of basic floating-point comparisons: (v)cmpps, which compares a set of packed single-precision values, and (v)cmppd, which compares a set of packed double-precision values. These instructions use an imm8 operand whose value specifies the type of comparison.
cmpps xmmdest, xmmsrc/mem128, imm8
vcmpps xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmpps ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
cmppd xmmdest, xmmsrc/mem128, imm8
vcmppd xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmppd ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
The imm8 operand specifies the type of the comparison. There are 32 possible comparisons

Instruction Synonyms
MASM supports the use of certain synonyms so you don’t have to memorize the 32 encodings.

AVX Extended Comparisons
The AVX versions of these instructions allow three register operands: a destination XMM or YMM register, a source XMM or YMM register, and a source XMM or YMM register or 128-bit or 256-bit memory location
vcmpps xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmpps ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
vcmppd xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmppd ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
The 128-bit vcmpps instruction compares the four single-precision floatingpoint values in each lane of the XMMsrc1 register against the values in the corresponding XMMsrc2/mem128 lanes and stores the true (all 1 bits) or false (all 0 bits) result into the corresponding lane of the XMMdest register.
The 256-bit vcmpps instruction compares the eight single-precision floating-point values in each lane of the YMMsrc1 register against the values in the corresponding YMMsrc2/mem256 lanes and stores the true or false result into the corresponding lane of the YMMdest register.
The vcmppd instructions compare the double-precision values in the two lanes (128-bit version) or four lanes (256-bit version) and store the result into the corresponding lane of the destination register.
The AVX instructions provide synonyms that eliminate the need to memorize 32 imm8 values


The vcmpfalse* instructions always set the destination lanes to false (0 bits), and the vcmptrue* instructions always set the destination lanes to true (1 bits).
The (v)movmskps, (v)movmskpd Instructions
The movmskps and movmskpd instructions extract the sign bits from their packed single- and double-precision floating-point source operands and store these bits into the LO 4 (or 8) bits of a general-purpose register.
movmskps reg, xmmsrc
movmskpd reg, xmmsrc
vmovmskps reg, ymmsrc
vmovmskpd reg, ymmsrc
The movmskps instruction extracts the sign bits from the four singleprecision floating-point values in the XMM source register and copies these bits to the LO 4 bits of the destination register
The movmskpd instruction copies the sign bits from the two double-precision floating-point values in the source XMM register to bits 0 and 1 of the destination register
The vmovmskps instruction extracts the sign bits from the four and eight single-precision floating-point values in the XMM and YMM source register and copies these bits to the LO 4 and 8 bits of the destination register.

The vmovmskpd instruction copies the sign bits from the four doubleprecision floating-point values in the source YMM register to bits 0 to 3 of the destination register

Floating-Point Conversion Instructions
The table below contains various SIMD floating point conversion instructions:



Dynamically Testing For A CPU Feature
; Generic print procedure and dynamically
; selecting CPU features.
option casemap:none
nl = 10
; SSE4.2 feature flags (in ECX):
SSE42 = 00180000h ;Bits 19 & 20
AVXSupport = 10000000h ;Bit 28
; CPUID bits (EAX=7, EBX register)
AVX2Support = 20h ;Bit 5 = AVX
.data
align qword
print qword choosePrint ;Pointer to print function
; Floating-point values for testing purposes:
fp1 real8 1.0
fp2 real8 2.0
fp3 real8 3.0
fp4 real8 4.0
fp5 real8 5.0
.code
externdef printf:proc
;***************************************************************
;
; print-
;
; "Quick" form of printf that allows the format string to
; follow the call in the code stream. Supports up to five
; additional parameters in RDX, R8, R9, R10, and R11.
;
; This function saves all the Microsoft-ABI volatile,
; parameter, and return result registers so that code
; can call it without worrying about any registers being
; modified (this code assumes that Windows ABI treats
; YMM4..YMM15 as non-volatile).
;
; Of course, this code assumes that AVX instructions are
; available on the CPU.
; Allows up to 5 arguments in:
;
; RDX- Arg #1
; R8- Arg #2
; R9- Arg #3
; R10- Arg #4
; R11- Arg #5
;
; Note that you must pass floating-point values in
; these registers, as well. The printf function
; expects real values in the integer registers.
;
;
; There are two versions of this function, one that
; will run on CPUs without AVX capabilities (no YMM
; registers) and one that will run on CPUs that
; have AVX capabilities (YMM registers). The difference
; between the two is which registers they preserve
; (print_SSE only preserves XMM registers and will
; run properly on CPUs that don't have YMM register
; support; print_AVX will preserve the volatile YMM
; registers on CPUs with AVX support);
; On first call, determine if we support AVX instructions
; and set the "print" pointer to point at print_AVX or
; print_SSE:
choosePrint proc
push rax ;Preserve registers that get
push rbx ; tweaked by CPUID
push rcx
push rdx
mov eax, 1
cpuid
test ecx, AVXSupport ;Test bit 28 for AVX
jnz doAVXPrint
lea rax, print_SSE ;From now on, call
mov print, rax ;print_SSE directly
; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_SSE.
pop rdx
pop rcx
pop rbx
pop rax
jmp print_SSE
doAVXPrint: lea rax, print_AVX ;From now on, call
mov print, rax ;print_AVX directly
; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_AUX.
pop rdx
pop rcx
pop rbx
pop rax
jmp print_AVX
choosePrint endp
; Version of print that will preserve volatile
; AVX registers (YMM0..YMM3):
print_AVX proc
; Preserve all the volatile register
; (be nice to the assembly code that
; calls this procedure):
push rax
push rbx
push rcx
push rdx
push r8
push r9
push r10
push r11
; YMM0..YMM3 are considered volatile, so preserve them
sub rsp, 128
vmovdqu ymmword ptr [rsp+000], ymm0
vmovdqu ymmword ptr [rsp+032], ymm1
vmovdqu ymmword ptr [rsp+064], ymm2
vmovdqu ymmword ptr [rsp+096], ymm3
push rbp
returnAdrs textequ <[rbp+200]>
mov rbp, rsp
sub rsp, 128
and rsp, -16
; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:
mov rcx, returnAdrs
; To handle more than three arguments (4 counting
; RCX) must pass data on stack. However, to the
; print caller the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):
mov [rsp+32], r10
mov [rsp+40], r11
call printf
; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.
mov rcx, returnAdrs
dec rcx
skipTo0: inc rcx
cmp byte ptr [rcx], 0
jne skipTo0
inc rcx
mov returnAdrs, rcx
leave
vmovdqu ymm0, ymmword ptr [rsp+000]
vmovdqu ymm1, ymmword ptr [rsp+032]
vmovdqu ymm2, ymmword ptr [rsp+064]
vmovdqu ymm3, ymmword ptr [rsp+096]
add rsp, 128
pop r11
pop r10
pop r9
pop r8
pop rdx
pop rcx
pop rbx
pop rax
ret
print_AVX endp
; Version that will run on CPUs without
; AVX support and will preserve the
; volatile SSE registers (XMM0..XMM3):
print_SSE proc
; Preserve all the volatile register
; (be nice to the assembly code that
; calls this procedure):
push rax
push rbx
push rcx
push rdx
push r8
push r9
push r10
push r11
; XMM0..XMM3 are considered volatile, so preserve them
sub rsp, 64
movdqu xmmword ptr [rsp+00], xmm0
movdqu xmmword ptr [rsp+16], xmm1
movdqu xmmword ptr [rsp+32], xmm2
movdqu xmmword ptr [rsp+48], xmm3
push rbp
returnAdrs textequ <[rbp+136]>
mov rbp, rsp
sub rsp, 128
and rsp, -16
; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:
mov rcx, returnAdrs
; To handle more than three arguments (4 counting
; RCX) must pass data on stack. However, to the
; print caller the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):
mov [rsp+32], r10
mov [rsp+40], r11
call printf
; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.
mov rcx, returnAdrs
dec rcx
skipTo0: inc rcx
cmp byte ptr [rcx], 0
jne skipTo0
inc rcx
mov returnAdrs, rcx
leave
movdqu xmm0, xmmword ptr [rsp+00]
movdqu xmm1, xmmword ptr [rsp+16]
movdqu xmm2, xmmword ptr [rsp+32]
movdqu xmm3, xmmword ptr [rsp+48]
add rsp, 64
pop r11
pop r10
pop r9
pop r8
pop rdx
pop rcx
pop rbx
pop rax
ret
print_SSE endp
;***************************************************************
;
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbx
push rsi
push rdi
push rbp
mov rbp, rsp
sub rsp, 56 ;Shadow storage
; Trivial example, no arguments:
call print
byte "Hello, World!", nl, 0
; Simple example with integer arguments:
mov rdx, 1 ;Argument #1 for printf
mov r8, 2 ;Argument #2 for printf
mov r9, 3 ;Argument #3 for printf
mov r10, 4 ;Argument #4 for printf
mov r11, 5 ;Argument #5 for printf
call print
byte "Arg 1=%d, Arg2=%d, Arg3=%d "
byte "Arg 4=%d, Arg5=%d", nl, 0
; Demonstration of floating-point operands. Note that
; args 1, 2, and 3 must be passed in rdx, r8, and
; r9 (despite Microsoft ABI requirements). You'll have
; to load parameters 4 and 5 into R10 and R11.
mov rdx, qword ptr fp1
mov r8, qword ptr fp2
mov r9, qword ptr fp3
mov r10, qword ptr fp4
mov r11, qword ptr fp5
call print
byte "Arg1=%6.1f, Arg2=%6.1f, Arg3=%6.1f "
byte "Arg4=%6.1f, Arg5=%6.1f ", nl, 0
allDone: leave
pop rdi
pop rsi
pop rbx
ret ;Returns to caller
asmMain endp
end
Last updated