SIMD Instructions

Chapter 11

  • This chapter discusses the vector instructions on the x86-64. This special class of instructions provides parallel processing, traditionally known as single-instruction, multiple-data (SIMD) instructions because, quite literally, a single instruction operates on several pieces of data concurrently.

  • As a result of this concurrency, SIMD instructions can often execute several times faster (in theory, as much as 32 to 64 times faster) than the comparable single instruction, single-data (SISD), or scalar, instructions that compose the standard x86-64 instruction set.

  • The x86-64 actually provides three sets of vector instructions: the Multimedia Extensions (MMX) instruction set, the Streaming SIMD Extensions (SSE) instruction set, and the Advanced Vector Extensions (AVX) instruction set.

The SSE/AVX Architecture

  • The SSE/AVX architectures have three main generations:

    • The SSE architecture, which (on 64-bit CPUs) provided sixteen 128-bit XMM registers supporting integer and floating-point data types

    • The AVX/AVX2 architecture, which supported sixteen 256-bit YMM registers (also supporting integer and floating-point data types)

    • The AVX-512 architecture, which supported up to thirty-two 512-bit ZMM registers

Streaming Data Types

  • The SSE and AVX programming models support two basic data types: scalars and vectors. Scalars hold one single- or double-precision floating-point value. Vectors hold multiple floating-point or integer values (between 2 and 32 values, depending on the scalar data type of byte, word, dword, qword, single precision, or double precision, and the register and memory size of 128 or 256 bits).

  • The XMM registers (XMM0 to XMM15) can hold a single 32-bit floatingpoint value (a scalar) or four single-precision floating-point values (a vector). The YMM registers (YMM0 to YMM15) can hold eight single-precision (32-bit) floating-point values.

Packed and scalar single-precision floating-point data type
  • The XMM registers can hold a single double-precision scalar value or a vector containing a pair of double-precision values. The YMM registers can hold a vector containing four double-precision floating-point values.

Packed and scalar double-precision floating-point type
  • The XMM registers can hold 16 byte values (YMM registers can hold 32 byte values), allowing the CPU to perform 16 (32) byte-sized computations with one instruction.

Packed byte data type
  • The XMM registers can hold eight word values (YMM registers can hold sixteen word values), allowing the CPU to perform eight (sixteen) 16-bit word-sized integer computations with one instruction.

Packed word data type
  • The XMM registers can hold four dword values (YMM registers can hold eight dword values), allowing the CPU to perform four (eight) 32-bit dword-sized integer computations with one instruction.

Packed double-word data type
  • The XMM registers can hold two qword values (YMM registers can hold four qword values), allowing the CPU to perform two (four) 64-bit qword computations with one instruction.

Packed quad-word data type

Using cpuid to Differentiate Instruction Sets

  • The cpuid instruction expects a single parameter (called a leaf function) passed in the EAX register. It returns various pieces of information about the CPU in different 32-bit registers based on the value passed in EAX. An application can test the return information to see if certain CPU features are available.

  • To determine the highest function supported, you load EAX with 0 or 8000_0000h and execute the cpuid instruction. The return value is the maximum you can pass to cpuid in EAX.

  • In addition to providing the highest function supported, the cpuid instruction with EAX = 0h (or 8000_0002h) also returns a 12-character vendor ID in the EBX, ECX, and EDX registers.

  • To determine if the CPU can execute most SSE and AVX instructions, you must execute cpuid with EAX = 01h and test various bits placed in the ECX register.

Memory Operand Alignment

  • Almost all SSE, AVX, and AVX2 instructions will generate a memory alignment fault if you attempt to access a 128-bit object at an address that is not 16-byte aligned. Always ensure that your SSE packed operands are properly aligned.

  • Because segments have a default alignment of para (16 bytes), you can easily ensure that any 16-byte packed data objects are 16-byte-aligned by using the align directive:

  • If you are using AVX instructions to access 256-bit (32-byte) memory operands, you must ensure that those memory operands begin on a 32-byte address boundary. Unfortunately, align 32 won’t work, because the default segment alignment is para (16-byte) alignment, and the segment’s alignment must be greater than or equal to the operand field of any align directives appearing within that segment. Therefore, to be able to define 256-bit variables usable by AVX instructions, you must explicitly define a segment that is aligned on a 32-byte boundary, such as the following:

  • Inside a procedure, you can allocate storage for a 16-, 32-, or 64-bytealigned variable by over-allocating the storage, adding the size minus 1 of the object to the allocated address, and then using the and instruction to zero out LO bits of the address (4 bits for 16-byte-aligned objects, 5 bits for 32-byte-aligned objects, and 6 bits for 64-byte-aligned objects). Then reference the object by using this pointer.

SIMD Data Movement Instructions

The (v)movd and (v)movq Instructions

  • For the SSE instruction set, the movd (move dword) and movq (move qword) instructions copy the value from a 32- or 64-bit general-purpose register or memory location into the LO dword or qword of an XMM register.

Moving a 32-bit value from memory to an XMM register (with zero extension)
Moving a 64-bit value from memory to an XMM register (with zero extension)
  • The following instructions store the LO 32 or 64 bits of an XMM register into a dword or qword memory location or general-purpose register:

  • For the AVX instructions, you use the following instructions

The (v)movaps, (v)movapd, and (v)movdqa Instructions

  • The movaps (move aligned, packed single), movapd (move aligned, packed double), and movdqa (move double quad-word aligned) instructions move 16 bytes of data between memory and an XMM register or between two XMM registers. The AVX versions (with the v prefix) move 16 or 32 bytes between memory and an XMM or a YMM register or between two XMM or YMM registers.

  • The mem128 operand should be a vector (array) of four single-precision floating-point values for the (v)movaps instruction; it should be a vector of two double-precision floating-point values for the (v)movapd instruction; it should be a 16-byte value (16 bytes, 8 words, 4 dwords, or 2 qwords) when using the (v)movdqa instruction.

  • The mem256 operand should be a vector (array) of eight single-precision floating-point values for the vmovaps instruction; it should be a vector of four double-precision floating-point values for the vmovapd instruction; it should be a 32-byte value (32 bytes, 16 words, 8 dwords, or 4 qwords) when using the vmovdqa instruction.

  • You will need to use operand type coercion if the instruction doesn’t match one of the following types:

    • The movaps instruction allows real4, dword, and oword operands.

    • The movapd instruction allows real8, qword, and oword operands.

    • The movdqa instruction allows only oword operands.

    • The vmovaps instruction allows real4, dword, and ymmword ptr operands (when using a YMM register).

    • The vmovapd instruction allows real8, qword, and ymmword ptr operands (when using a YMM register).

    • The vmovdqa instruction allows only ymmword ptr operands (when using a YMM register).

The (v)movups, (v)movupd, and (v)movdqu Instructions

  • When you cannot guarantee that packed data memory operands lie on a 16- or 32-byte address boundary, you can use the (v)movups (move unaligned packed single-precision), (v)movupd (move unaligned packed double-precision), and (v)movdqu (move double quad-word unaligned) instructions to move data between XMM or YMM registers and memory.

  • These instructions typically run slower than their aligned equivalents.

The (v)movlps and (v)movlpd Instructions

  • The (v)movl* instructions and (v)movh* instructions allow you to merge two pairs of single-precision or a pair of double-precision floating-point operands from two different sources into a single XMM register.

  • The movlps xmmdest, mem64 form copies a pair of single-precision floating point values into the two LO 32-bit lanes of a destination XMM register.

movlps instruction
  • The movlps mem64, xmmsrc form copies the LO 64 bits (the two LO single precision lanes) from the XMM source register to the specified memory location.

  • The vmovlps instruction has three operands: a destination XMM register, a source XMM register, and a source (64-bit) memory location. This instruction copies the two single-precision values from the memory location into the LO 64 bits of the destination XMM register. It copies the HO 64 bits of the source register into the HO 64 bits of the destination register.

vmovlps instruction
  • Like movsd, the movlpd (move low packed double) instruction copies the LO 64 bits (a double-precision floating-point value) of the source operand to the LO 64 bits of the destination operand. The difference is that the movlpd instruction doesn’t zero-extend the value when moving data from memory into an XMM register, whereas the movsd instruction will zero-extend.

The movhps and movhpd Instructions

  • The movhps and movhpd instructions move a 64-bit value into the HO quad word of a destination XMM register.

  • The movhps and movhpd instructions can also store the HO quad word of an XMM register into memory. These instructions do not affect bits 128 to 255 of the YMM registers.

  • You would normally use a movlps instruction followed by a movhps instruction to load four single-precision floating-point values into an XMM register, taking the floating-point values from two different data sources.

The vmovhps and vmovhpd Instructions

  • Although the AVX instruction extensions provide vmovhps and vmovhpd instructions, they are not a simple extension of the SSE movhps and movhpd instructions.

  • The instructions that store data into a 64-bit memory location behave similarly to the movhps and movhpd instructions.

  • The instructions that load data into an XMM register have two source operands. They load a full 128 bits into the destination XMM register. The HO 64 bits come from the memory operand and the LO 64 bits come from the LO quad word of the source XMM register.

vmovhpd and vmovhps instructions
  • Unlike for the movhps instruction, MASM properly accepts real4 source operands for the vmovhps instruction/

The movlhps and vmovlhps Instructions

  • The movlhps instruction moves a pair of 32-bit single-precision floating-point values from the LO qword of the source XMM register into the HO 64 bits of a destination XMM register. It leaves the LO 64 bits of the destination register unchanged.

  • If the destination register is on a CPU that supports 256-bit AVX registers, this instruction also leaves the HO 128 bits of the overlaid YMM register unchanged.

  • You cannot use this instruction to move data between memory and an XMM register; it transfers data only between XMM registers. No doubleprecision version of this instruction exists.

  • The vmovlhps instruction is similar to movlhps, with the following differences:

    • vmovlhps requires three operands: two source XMM registers and a destination XMM register.

    • vmovlhps copies the LO quad word of the first source register into the LO quad word of the destination register.

    • vmovlhps copies the LO quad word of the second source register into bits 64 to 127 of the destination register.

    • vmovlhps zero-extends the result into the upper 128 bits of the overlaid YMM register.

The movhlps and vmovhlps Instructions

  • The movhlps instruction copies the pair of 32-bit single-precision floatingpoint values from the HO qword of the source operand to the LO qword of the destination register, leaving the HO 64 bits of the destination register unchanged. This instruction copies data only between XMM registers; it does not allow a memory operand.

  • The vmovhlps instruction requires three XMM register operands. This instruction copies the HO 64 bits of the first source register into the HO 64 bits of the destination register, copies the HO 64 bits of the second source register into bits 0 to 63 of the destination register, and finally, zero-extends the result into the upper bits of the overlaid YMM register.

The (v)movshdup and (v)movsldup Instructions

  • The movshdup instruction moves the two odd-index single-precision floatingpoint values from the source operand (memory or XMM register) and duplicates each element into the destination XMM register.

  • This instruction ignores the single-precision floating-point values at even-lane indexes into the XMM register.

  • The vmovshdup instruction works the same way but on YMM registers, copying four single-precision values rather than two.

movshdup and vmovshdup instructions
  • The movsldup instruction works just like the movshdup instruction, except it copies and duplicates the two single-precision values at even indexes in the source XMM register to the destination XMM register. Likewise, the vmovsldup instruction copies and duplicates the four double-precision values in the source YMM register at even indexes.

movsldup and vmovsldup instructions

The (v)movddup Instruction

  • The movddup instruction copies and duplicates a double-precision value from the LO 64 bits of an XMM register or a 64-bit memory location into the LO 64 bits of a destination XMM register; then it also duplicates this value into bits 64 to 127 of that same destination register.

  • This instruction does not disturb the HO 128 bits of a YMM register.

  • The vmovddup instruction operates on an XMM or a YMM destination register and an XMM or a YMM source register or 128- or 256-bit memory location.

  • The 128-bit version works just like the movddup instruction except it zeroes the HO bits of the destination YMM register. The 256-bit version copies a pair of double-precision values at even indexes (0 and 2) in the source value to their corresponding indexes in the destination YMM register and duplicates those values at the odd indexes in the destination.

vmovddup instruction behavior

The (v)lddqu Instruction

  • The (v)lddqu instruction is operationally identical to (v)movdqu. You can sometimes use this instruction to improve performance if the (memory) source operand is not aligned properly and crosses a cache line boundary in memory.

The Shuffle and Unpack Instructions

The (v)pshufb Instructions

  • The pshufb instruction was the first packed byte shuffle SIMD instruction. The first (destination) operand is an XMM register whose byte lanes pshufb will shuffle (rearrange). The second operand (either an XMM register or a 128-bit oword memory location) is an array of 16 byte values holding indexes that control the shuffle operation.

  • If the second operand is a memory location, that oword value must be aligned on a 16-byte boundary. Each byte (lane) in the second operand selects a value for the corresponding byte lane in the first operand.

Lane index correspondence for pshufb instruction
  • The 16-byte indexes in the second operand each take the form shown below

phsufb byte index
  • The pshufb instruction ignores bits 4 to 6 in an index byte. Bit 7 is the clear bit; if this bit contains a 1, the pshufb instruction ignores the lane index bits and stores a 0 into the corresponding byte in XMMdest. If the clear bit contains a 0, the pshufb instruction does a shuffle operation.

  • The pshufb shuffle operation takes place on a lane-by-lane basis. The instruction first makes a temporary copy of XMMdest. Then for each index byte (whose HO bit is 0), the pshufb copies the lane specified by the LO 4 bits of the index from the XMMdest lane that matches the index’s lane.

Shuffle operation
  • The AVX variant adds a source register (rather than using XMMdest as both the source and destination registers), and, rather than creating a temporary copy of XMMdest prior to the operation and picking the values from that copy, the vpshufb instructions select the source bytes from the XMMsrc register. Other than that, and the fact that these instructions zero the HO bits of YMMdest, the 128-bit variant operates identically to the SSE pshufb instruction.

  • The AVX instruction allows you to specify 256-bit YMM registers in addition to 128-bit XMM registers.

The (v)pshufd Instructions

  • These instructions shuffle dwords in XMM and YMM registers (not double-precision values) similarly to the (v)pshufb instructions. However, the shuffle index is specified differently from (v)pshufb.

  • The first operand (XMMdest or YMMdest) is the destination operand where the shuffled values will be stored. The second operand is the source from which the instruction will select the double words to place in the destination register. The third operand is an 8-bit immediate value that specifies the indexes for the double words to select from the source operand.

  • The value in bits 0 to 1 selects a particular dword from the source operand to place in dword 0 of the XMMdest operand. The value in bits 2 to 3 selects a dword from the source operand to place in dword 1 of the XMMdest operand. The value in bits 4 to 5 selects a dword from the source operand to place in dword 2 of the XMMdest operand. Finally, the value in bits 6 to 7 selects a dword from the source operand to place in dword 3 of the XMMdest operand.

  • The difference between the 128-bit pshufd and vpshufd instructions is that pshufd leaves the HO 128 bits of the underlying YMM register unchanged and vpshufd zeroes the HO 128 bits of the underlying YMM register.

  • The 256-bit variant of vpshufd (when using YMM registers as the source and destination operands) still uses an 8-bit immediate operand as the index value. Each 2-bit index value manipulates two dword values in the YMM registers. Bits 0 to 1 control dwords 0 and 4, bits 2 to 3 control dwords 1 and 5, bits 4 to 5 control dwords 2 and 6, and bits 6 to 7 control dwords 3 and 7.

  • The 256-bit version is slightly less flexible as it copies two dwords at a time, rather than one.

The (v)pshuflw and (v)pshufhw Instructions

  • The pshuflw and vpshuflw and the pshufhw and vpshufhw instructions provide support for 16-bit word shuffles within an XMM or a YMM register.

  • The 128-bit lw variants copy the HO 64 bits of the source operand to the same positions in the XMMdest operand. Then they use the index (imm8) operand to select word lanes 0 to 3 in the LO qword of the XMMsrc/mem128 operand to move to the LO 4 lanes of the destination operand. Note that pshuflw does not modify the HO 128 bits of the overlaid YMM register, whereas vpshuflw zeroes those HO bits.

(v)pshuflw operation
  • The 256-bit vpshuflw instruction (with a YMM destination register) copies two pairs of words at a timeβ€”one pair in the HO 128 bits and one pair in the LO 128 bits of the YMM destination register and 256-bit source locations. The index (imm8) selection is the same for the LO and HO 128 bits.

vpshuflw operation
  • The 128-bit hw variants copy the LO 64 bits of the source operand to the same positions in the destination operand. Then they use the index operand to select words 4 to 7 (indexed as 0 to 3) in the 128-bit source operand to move to the HO four word lanes of the destination operand.

(v)pshufhw operation
  • The 256-bit vpshufhw instruction (with a YMM destination register) copies two pairs of words at a timeβ€”one in the HO 128 bits and one in the LO 128 bits of the YMM destination register and 256-bit source locations.

vpshufhw operation

shufps and shufpd Instructions

  • The shuffle instructions (shufps and shufpd) extract single- or double-precision values from the source operands and place them in specified positions in the destination operand. The third operand, an 8-bit immediate value, selects which values to extract from the source to move into the destination register.

  • For the shufps instruction, the second source operand is an 8-bit immediate value that is actually a four-element array of 2-bit values.

  • imm8 bits 0 and 1 select a single-precision value from one of the four lanes in the XMMsrc1/dest operand to store into lane 0 of the destination operation. Bits 2 and 3 select a single-precision value from one of the four lanes in the XMMsrc1/dest operand to store into lane 1 of the destination operation (the destination operand is also XMMsrc1/dest). imm8 bits 4 and 5 select a single-precision value from one of the four lanes in the XMMsrc2/memsrc2 operand to store into lane 2 of the destination operation. Bits 6 and 7 select a single-precision value from one of the four lanes in the XMMsrc2/memsrc2 operand to store into lane 3 of the destination operation.

shufps operation
  • The shufpd instruction works similarly, shuffling double-precision values. As there are only two double-precision values in an XMM register, it takes only a single bit to choose between the values.

  • As a result, the third operand, the imm8 value, is actually just a 2-bit value; the instruction ignores bits 2 to 7 in the imm8 operand. Bit 0 of the imm8 operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMMsrc1/dest operand to place into lane 0 and bits 0 to 63 of XMMdest. Bit 1 of the imm8 operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMMsrc/mem128 operand to place into lane 1 and bits 64 to 127 of XMMdest.

shufpd operation

The vshufps and vshufpd Instructions

  • The vshufps and vshufpd instructions are similar to shufps and shufpd. They allow you to shuffle the values in 128-bit XMM registers or 256-bit YMM registers.

  • The vshufps and vshufpd instructions have four operands: a destination XMM or YMM register, two source operands (src1 must be an XMM or a YMM register, and src2 can be an XMM or a YMM register or a 128- or 256-bit memory location), and an imm8 operand.

  • For the 256-bit vshufps instructions, the imm8 operand is an array of four 2-bit values. These 2-bit values select one of four single-precision values from the source locations as shown below:

  • The vshufps instruction also allows you to specify XMM and 128-bit memory operands. In this form, it behaves quite similarly to the shufps instruction except that you get to specify two different 128-bit source operands and it zeroes the HO 128 bits of the corresponding YMM register.

  • The vshufpd instruction is an extension of shufpd to 256 bits. As there are four double-precision values present in a 256-bit YMM register, vshufpd needs 4 bits to select the source indexes

  • Like the vshufps instruction, vshufpd also allows you to specify XMM registers if you want a three-operand version of shufpd.

The (v)unpcklps, (v)unpckhps, (v)unpcklpd, and (v)unpckhpd Instructions

  • The unpack (and merge) instructions are a simplified variant of the shuffle instructions. These instructions copy single- and double-precision values from fixed locations in their source operands and insert those values into fixed locations in the destination operand.

  • The unpcklps and unpckhps instructions choose half their single-precision operands from one of two sources, merge these values, and then store the merged result into the destination operand

  • The XMMdest operand serves as both the first source operand and the destination operand. The XMMsrc/mem128 operand is the second source operand.

  • The unpcklps instruction copies the two LO single-precision values from the source operand to bit positions 32 to 63 (dword 1) and 96 to 127 (dword 3). It leaves dword 0 in the destination operand alone and copies the value originally in dword 1 to dword 2 in the destination

unpcklps instruction operation
  • The unpckhps instruction copies the two HO single-precision values from the two sources to the destination register

unpckhps instruction operation
  • The unpcklpd and unpckhpd instructions do the same thing as unpcklps and unpckhps except, of course, they operate on double-precision values rather than single-precision values.

unpcklpd instruction operation
unpckhpd instruction operation
  • The vunpcklps, vunpckhps, vunpcklpd, and vunpckhpd instructions have the following syntax:

  • They work similarly to the non-v variants, with a couple of differences:

    • The AVX variants support using the YMM registers as well as the XMM registers.

    • The AVX variants require three operands. The first (destination) and second (source1) operands must be XMM or YMM registers. The third (source2) operand can be an XMM or a YMM register or a 128- or 256-bit memory location. The two-operand form is just a special case of the three-operand form, where the first and second operands specify the same register name.

    • The 128-bit variants zero out the HO bits of the YMM register rather than leaving those bits unchanged.

  • The interleaving extension happens in the intuitive way as shown below, with vunpcklps:

vunpcklps instruction operation
  • The vunpckhps instruction does the following:

vunpckhps instruction operation

The punpck* Instructions

  • The punpck* instructions provide a set of integer unpack instructions to complement the floating-point variants.

  • The punpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination SSE register. These instructions do not modify the upper 128 bits of any overlaid YMM register.

The vpunpck* SSE Instructions

  • The AVX vpunpck* instructions provide a set of AVX integer unpack instructions to complement the SSE variants.

  • The vpunpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination AVX or SSE register.

  • Functionally, the only difference between these AVX instructions (vunpck*) and the SSE (unpck*) instructions is that the SSE variants leave the upper bits of the YMM AVX registers (bits 128 to 255) unchanged, whereas the AVX variants zero-extend the result to 256 bits.

The vpunpck* AVX Instructions

  • The AVX vunpck* instructions also support the use of the AVX YMM registers, in which case the unpack and merge operation extends from 128 bits to 256 bits.

The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq Instructions

  • The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq instructions extract a byte, word, dword, or qword from a 128-bit XMM register and copy this data to a general-purpose register or memory location.

  • The byte and word instructions expect a 32- or 64-bit general-purpose register as their destination or a memory location that is the same size as the instruction

  • The double-word and quad-word variants require a 32-bit or 64-bit general-purpose register, respectively. If the destination operand is a 32- or 64-bit general-purpose register, the instruction zero-extends the value to 32 or 64 bits, if necessary.

  • These instructions do not support extracting data from the upper 128 bits of a YMM register

The (v)pinsrb, (v)pinsrw, (v)pinsrd, and (v)pinsrq Instructions

  • The (v)pinsr{b,w,d,q} instructions take a byte, word, dword, or qword from a general-purpose register or memory location and store that data to a lane of an XMM register

  • These instructions fetch a byte, word, dword, or qword from the generalpurpose register or memory location and copy that to the lane in the XMM register specified by the 8-bit immediate value. The pinsr{b,w,d,q} instructions leave any HO bits in the underlying YMM register unchanged

  • The vpinsr{b,w,d,q} instructions copy the data from the XMM source register into the destination register and then copy the byte, word, dword, or quad word to the specified location in the destination register. These instructions zero-extend the value throughout the HO bits of the underlying YMM register.

The (v)extractps and (v)insertps Instructions

  • The extractps and vextractps instructions are functionally equivalent to pextrd and vpextrd. They extract a 32-bit value from an XMM register and move it into a 32-bit general-purpose register or a 32-bit memory location.

  • The insertps and vinsertps instructions insert a 32-bit floating-point value into an XMM register and, optionally, zero out other lanes in the XMM register.

  • On CPUs with the AVX extensions, insertps does not modify the upper bits of the YMM registers; vinsertps zeroes the upper bits. The vinsertps instruction first copies the XMMsrc1 register to XMMdest before performing the insertion operation. The HO bits of the corresponding YMM register are set to 0.

SIMD Arithmetic and Logical Operations

Logical Instructions

SSE/AVX Logical Instructions
  • The syntax for these instructions is the following:

  • The SSE instructions (without the v prefix) leave the HO bits of the underlying YMM register unchanged (if applicable). The AVX instructions (with the v prefix) that have 128-bit operands will zero-extend their result into the HO bits of the YMM register.

The (v)ptest Instructions

  • The ptest instruction (packed test) is similar to the standard integer test instruction. The ptest instruction performs a logical AND between the two operands and sets the zero flag if the result is 0.

  • The ptest instruction sets the carry flag if the logical AND of the second operand with the inverted bits of the first operand produces 0.

The Byte Shift Instructions

  • The SSE and AVX instruction set extensions also support a set of logical and arithmetic shift instructions. The first two to consider are pslldq and psrldq.

  • The pslldq instruction shifts its destination XMM register to the left by the number of bytes specified by the imm8 operand. This instruction shifts 0s into the vacated LO bytes.

  • The vpslldq instruction takes the value in the source register (XMM or YMM), shifts that value to the left by imm8 bytes, and then stores the result into the destination register. For the 128-bit variant, this instruction zeroextends the result into bits 128 to 255 of the underlying YMM register.

  • The psrldq and vpsrldq instructions operate similarly to (v)pslldq except, of course, they shift their operands to the right rather than to the left. These are logical shift-right operations, so they shift 0s into the HO bytes of their operand, and bits shifted out of bit 0 are lost.

  • The pslldq and psrldq instructions shift bytes rather than bits. For example, many SSE instructions produce byte masks 0 or 0FFh, representing Boolean results. These instructions shift the equivalent of a bit in one of these byte masks by shifting whole bytes at a time.

The Bit Shift Instructions

  • These instructions provide word, dword, and qword variants of the logical shift-left, logical shift-right, and arithmetic shift-right operations

  • shift = psllw, pslld, psllq, psrlw, psrld, psrlq, psraw, or psrad,

  • vshift = vpsllw, vpslld, vpsllq, vpsrlw, vpsrld, vpsrlq, vpsraw, vpsrad, or vpsraq

  • The (v)psl* instructions shift their operands to the left; the (v)psr* instructions shift their operands to the right.

  • The (v)psll* and (v)psrl* instructions are logical shift instructions and shift 0s into the bits vacated by the shift. Any bits shifted out of the operand are lost.

  • The (v)psra* instructions are arithmetic shift-right instructions. They replicate the HO bit in each lane when shifting that lane’s bits to the right; all bits shifted out of the LO bit are lost.

  • The w suffix instructions shift 16-bit operands (eight lanes for 128-bit destination operands, sixteen lanes for 256-bit destinations).

  • The d suffix instructions shift 32-bit dword operands (four lanes for 128-bit destination operands, eight lanes for 256-bit destination operands).

  • The q suffix instructions shift 64-bit operands (two lanes for 128-bit operands, four lanes for 256-bit operands).

SIMD Integer Addition

  • These addition instructions are known as vertical additions because if we stack the two source operands on top of each other, the lane additions occur vertically.

  • The packed additions ignore any overflow from the addition operation, keeping only the LO byte, word, dword, or qword of each addition. As long as overflow is never possible, this is not an issue.

  • A cleaner solution is to use saturation arithmetic. For unsigned addition, saturation arithmetic clips (or saturates) an overflow to the largest possible value that the instruction’s size can handle. For signed saturation arithmetic, clipping occurs at the largest positive and smallest negative values (for example, 7Fh/+127 for positive values and 80h/–128 for negative values).

  • The padd* instructions leave the HO bits of any corresponding YMM destination undisturbed; the vpadd* variants clear the HO bits.

Horizontal Additions

  • The SSE/AVX instruction sets also support three horizontal addition instructions. The horizontal addition instructions add adjacent words or dwords in their two source operands and store the sum of the result into a destination lane

Horizontal Addition Instructions
Horizontal addition operation
  • The phaddw instruction does not affect the HO bits of the corresponding YMM destination register, only the LO 128 bits. It has the following syntax:

  • The 128-bit vphaddw instruction zeroes out the HO 128 bits of the corresponding YMM destination register. It has the following syntax:

  • vphaddw does not simply extend the 128-bit version in the intuitive way. Instead, it mixes up computations as follows (where SRC1 is YMMsrc1 and SRC2 is YMMsrc2/mem256):

Double-Word–Sized Horizontal Additions

  • The phaddd instruction has the following syntax:

  • The 128-bit vphaddd instruction has this syntax

  • The 256-bit vphaddd instruction has the following syntax:

  • The (v)phaddsw instructions take the following forms:

  • The (v)phaddsw instruction (horizontal signed integer add with saturate, word) is a slightly different form of (v)phaddw: rather than storing only the LO bits into the result in the destination lane, this instruction saturates the result.

SIMD Integer Subtraction

  • SIMD addition instructions, they do not affect any flags; any carry, borrow, overflow, or underflow information is lost. These instructions subtract the second source operand from the first source operand and store the result into the destination operand.

SIMD Integer Subtraction Instructions
  • Likewise, there is a set of signed and unsigned byte and word saturating subtraction instructions. For the signed instructions, the byte-sized instructions saturate positive overflow to 7Fh (+127) and negative underflow to 80h (–128). The word-sized instructions saturate to 7FFFh (+32,767) and 8000h (–32,768). The unsigned saturation instructions saturate to 0FFFFh (+65,535) and 0.

SIMD Integer Saturating Subtraction Instructions

SIMD Integer Multiplication

SIMD 16-Bit Packed Integer Multiplication Instructions
  • The (v)pmullw instruction multiplies the 16-bit values appearing in the lanes of the source operand and stores the LO word of the result into the corresponding destination lane. This instruction is applicable to both signed and unsigned values.

  • The (v)pmulhw instruction computes the product of two signed word values and stores the HO word of the result into the destination lanes.

  • For unsigned operands, (v)pmulhuw performs the same task. By executing both (v)pmullw and (v)pmulh(u)w with the same operands, you can compute the full 32-bit result of a 16Γ—16-bit multiplication.

SIMD 32- and 64-Bit Packed Integer Multiplication Instructions
  • At some point along the way, Intel introduced (v)pmuldq and (v)pmuludq to perform signed and unsigned 32Γ—32-bit multiplications, producing a 64-bit result

  • The 128-bit variants multiply the double words appearing in lanes 0 and 2 and store the 64-bit results into qword lanes 0 and 1. On CPUs with AVX registers, pmuldq and pmuludq do not affect the HO 128 bits of the YMM register. The vpmuldq and vpmuludq instructions zero-extend the result to 256 bits.

  • The 256-bit variants multiply the double words appearing in lanes 0, 2, 4, and 6, producing 64-bit results that they store in qword lanes 0, 1, 2, and 3

  • The pclmulqdq instruction provides the ability to multiply two qword values, producing a 128-bit result

  • These instructions multiply a pair of qword values found in XMMdest and XMMsrc and leave the 128-bit result in XMMdest. The imm8 operand specifies which qwords to use as the source operands.

  • As usual, pclmulqdq leaves the HO 128 bits of the corresponding YMM destination register unchanged, while vpcmulqdq zeroes those bits.

SIMD Integer Averages

  • The (v)pavgb and (v)pavgw instructions compute the average of two sets of bytes or words. These instructions sum the value in the byte or word lanes of their source and destination operands, divide the result by 2, round the results, and leave the averaged results sitting in the destination operand lanes.

  • The 128-bit pavgb and vpavgb instructions compute 16 byte-sized averages (for the 16 lanes in the source and destination operands). The 256-bit variant of the vpavgb instruction computes 32 byte-sized averages.

  • The 128-bit pavgw and vpavgw instructions compute eight word-sized averages (for the eight lanes in the source and destination operands). The 256-bit variant of the vpavgw instruction computes 16 byte-sized averages.

SIMD Integer Minimum and Maximum

  • These instructions scan the lanes of a pair of 128- or 256-bit operands and copy the maximum or minimum value from that lane to the same lane in the destination operand.

  • The general syntax is as follows -

SIMD Integer Absolute Value

  • The SSE/AVX instruction set extensions provide three sets of instructions for computing the absolute values of signed byte, word, and double-word integers: (v)pabsb, (v)pabsw, and (v)pabsd

SIMD Integer Sign Adjustment Instructions

  • The (v)psignb, (v)psignw, and (v)psignd instructions apply the sign found in a source lane to the corresponding destination lane.

SSE Compare-for-Equality Instructions

  • The == operator produces a value of all 1 bits if the two values in the same lane are equal; it produces all 0 bits if the values are not equal.

SSE Compare-for-Greater-Than Instructions

AVX Comparison Instructions

  • The 128-bit variants of these instructions have the following syntax:

  • The 256-bit variants of these instructions have the following syntax:

The (v)pmovmskb Instructions

  • The (v)pmovmskb instruction extracts the HO bit from all the bytes in an XMM or YMM register and stores the 16 or 32 bits (respectively) into a general-purpose register. These instructions set all HO bits of the general purpose register to 0.

  • The 128-bit instructions fill only bits 0 through 15 of the destination register (zeroing out all other bits). The 256-bit form of the vpmovmskb instruction fills bits 0 through 31 of the destination register (zeroing out HO bits if you specify a 64-bit register).

Integer Conversions

Packed Zero-Extension Instructions

SSE4.1 and AVX Packed Zero-Extension Instructions
AVX2 Packed Zero-Extension Instructions

Packed Sign-Extension Instructions

SSE Packed Sign-Extension Instructions
AVX Packed Sign-Extension Instructions

Packed Sign Extension with Saturation

SSE Packed Sign-Extension with Saturation Instructions
  • The saturate operation checks its operand to see if the value exceeds the range of the result (–128 to +127 for signed bytes, 0 to 255 for unsigned bytes, –32,768 to +32,767 for signed words, and 0 to 65,535 for unsigned words).

  • When saturating to a byte, if the signed source value is less than –128, byte saturation sets the value to –128. When saturating to a word, if the signed source value is less than –32,786, signed saturation sets the value to –32,768. Similarly, if a signed byte or word value exceeds +127 or +32,767, then saturation replaces the value with +127 or +32,767, respectively.

  • For unsigned operations, saturation limits the value to +255 (for bytes) or +65,535 (for words). Unsigned values are never less than 0, so unsigned saturation clips values to only +255 or +65,535.

  • AVX-capable CPUs provide 128-bit variants of these instructions that support three operands: two source operands and an independent destination operand.

  • Also, the SSE instructions do not modify the upper bits of the YMM register (if present on the CPU), whereas the AVX instructions store 0s into the upper YMM register bits

AVX Packed Sign-Extension with Saturation Instructions

SIMD Floating-Point Instructions

Arithmetic Instructions

  • The 128-bit SSE packed floating-point instructions have the following generic syntax (where instr is one of the floating-point instructions in the table below

  • The packed single (*ps) instructions perform four single-precision floatingpoint operations simultaneously. The packed double (*pd) instructions perform two double-precision floating-point operations simultaneously. They compute -

  • The 128-bit AVX packed floating-point instructions have this syntax:

  • These 128-bit AVX instructions clear the HO bits of the underlying YMMdest register.

  • The 256-bit AVX packed floating-point instructions have this syntax:

  • The SSE/AVX instruction set extensions also include floating-point horizontal addition and subtraction instructions.

SSE and AVX Comparisons

  • There are two sets of basic floating-point comparisons: (v)cmpps, which compares a set of packed single-precision values, and (v)cmppd, which compares a set of packed double-precision values. These instructions use an imm8 operand whose value specifies the type of comparison.

  • The imm8 operand specifies the type of the comparison. There are 32 possible comparisons

imm8 Values for cmpps and cmppd Instructions

Instruction Synonyms

  • MASM supports the use of certain synonyms so you don’t have to memorize the 32 encodings.

AVX Extended Comparisons

  • The AVX versions of these instructions allow three register operands: a destination XMM or YMM register, a source XMM or YMM register, and a source XMM or YMM register or 128-bit or 256-bit memory location

  • The 128-bit vcmpps instruction compares the four single-precision floatingpoint values in each lane of the XMMsrc1 register against the values in the corresponding XMMsrc2/mem128 lanes and stores the true (all 1 bits) or false (all 0 bits) result into the corresponding lane of the XMMdest register.

  • The 256-bit vcmpps instruction compares the eight single-precision floating-point values in each lane of the YMMsrc1 register against the values in the corresponding YMMsrc2/mem256 lanes and stores the true or false result into the corresponding lane of the YMMdest register.

  • The vcmppd instructions compare the double-precision values in the two lanes (128-bit version) or four lanes (256-bit version) and store the result into the corresponding lane of the destination register.

  • The AVX instructions provide synonyms that eliminate the need to memorize 32 imm8 values

AVX Packed Compare Instructions
  • The vcmpfalse* instructions always set the destination lanes to false (0 bits), and the vcmptrue* instructions always set the destination lanes to true (1 bits).

The (v)movmskps, (v)movmskpd Instructions

  • The movmskps and movmskpd instructions extract the sign bits from their packed single- and double-precision floating-point source operands and store these bits into the LO 4 (or 8) bits of a general-purpose register.

  • The movmskps instruction extracts the sign bits from the four singleprecision floating-point values in the XMM source register and copies these bits to the LO 4 bits of the destination register

  • The movmskpd instruction copies the sign bits from the two double-precision floating-point values in the source XMM register to bits 0 and 1 of the destination register

  • The vmovmskps instruction extracts the sign bits from the four and eight single-precision floating-point values in the XMM and YMM source register and copies these bits to the LO 4 and 8 bits of the destination register.

  • The vmovmskpd instruction copies the sign bits from the four doubleprecision floating-point values in the source YMM register to bits 0 to 3 of the destination register

vmovmskpd operation

Floating-Point Conversion Instructions

  • The table below contains various SIMD floating point conversion instructions:

Last updated