SIMD Instructions

Chapter 11

  • This chapter discusses the vector instructions on the x86-64. This special class of instructions provides parallel processing, traditionally known as single-instruction, multiple-data (SIMD) instructions because, quite literally, a single instruction operates on several pieces of data concurrently.

  • As a result of this concurrency, SIMD instructions can often execute several times faster (in theory, as much as 32 to 64 times faster) than the comparable single instruction, single-data (SISD), or scalar, instructions that compose the standard x86-64 instruction set.

  • The x86-64 actually provides three sets of vector instructions: the Multimedia Extensions (MMX) instruction set, the Streaming SIMD Extensions (SSE) instruction set, and the Advanced Vector Extensions (AVX) instruction set.

The SSE/AVX Architecture

  • The SSE/AVX architectures have three main generations:

    • The SSE architecture, which (on 64-bit CPUs) provided sixteen 128-bit XMM registers supporting integer and floating-point data types

    • The AVX/AVX2 architecture, which supported sixteen 256-bit YMM registers (also supporting integer and floating-point data types)

    • The AVX-512 architecture, which supported up to thirty-two 512-bit ZMM registers

Streaming Data Types

  • The SSE and AVX programming models support two basic data types: scalars and vectors. Scalars hold one single- or double-precision floating-point value. Vectors hold multiple floating-point or integer values (between 2 and 32 values, depending on the scalar data type of byte, word, dword, qword, single precision, or double precision, and the register and memory size of 128 or 256 bits).

  • The XMM registers (XMM0 to XMM15) can hold a single 32-bit floatingpoint value (a scalar) or four single-precision floating-point values (a vector). The YMM registers (YMM0 to YMM15) can hold eight single-precision (32-bit) floating-point values.

Packed and scalar single-precision floating-point data type
  • The XMM registers can hold a single double-precision scalar value or a vector containing a pair of double-precision values. The YMM registers can hold a vector containing four double-precision floating-point values.

Packed and scalar double-precision floating-point type
  • The XMM registers can hold 16 byte values (YMM registers can hold 32 byte values), allowing the CPU to perform 16 (32) byte-sized computations with one instruction.

Packed byte data type
  • The XMM registers can hold eight word values (YMM registers can hold sixteen word values), allowing the CPU to perform eight (sixteen) 16-bit word-sized integer computations with one instruction.

Packed word data type
  • The XMM registers can hold four dword values (YMM registers can hold eight dword values), allowing the CPU to perform four (eight) 32-bit dword-sized integer computations with one instruction.

Packed double-word data type
  • The XMM registers can hold two qword values (YMM registers can hold four qword values), allowing the CPU to perform two (four) 64-bit qword computations with one instruction.

Packed quad-word data type

Using cpuid to Differentiate Instruction Sets

  • The cpuid instruction expects a single parameter (called a leaf function) passed in the EAX register. It returns various pieces of information about the CPU in different 32-bit registers based on the value passed in EAX. An application can test the return information to see if certain CPU features are available.

  • To determine the highest function supported, you load EAX with 0 or 8000_0000h and execute the cpuid instruction. The return value is the maximum you can pass to cpuid in EAX.

  • In addition to providing the highest function supported, the cpuid instruction with EAX = 0h (or 8000_0002h) also returns a 12-character vendor ID in the EBX, ECX, and EDX registers.

  • To determine if the CPU can execute most SSE and AVX instructions, you must execute cpuid with EAX = 01h and test various bits placed in the ECX register.

; Test for AVX2 Instructions

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 11-2", 0
           

            .data
maxFeature  dword   ?
VendorID    byte    14 dup (0)

            
            .code
            externdef printf:proc

; Used for debugging:

print       proc
            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11
            
            push    rbp
            mov     rbp, rsp
            sub     rsp, 40
            and     rsp, -16
            
            mov     rcx, [rbp+72]   ;Return address
            call    printf
            
            mov     rcx, [rbp+72]
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     [rbp+72], rcx
            
            leave
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print       endp

         
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ;Shadow storage

            xor     eax, eax
            cpuid
            mov     maxFeature, eax
            mov     dword ptr VendorID, ebx 
            mov     dword ptr VendorID[4], edx 
            mov     dword ptr VendorID[8], ecx
            
            lea     rdx, VendorID
            mov     r8d, eax
            call    print
            byte    "CPUID(0): Vendor ID='%s', max feature=0%xh", nl, 0
            
            
; Leaf function 1 is available on all CPUs that support
; CPUID, no need to test for it. 

            mov     eax, 1
            cpuid
            mov     r8d, edx
            mov     edx, ecx
            call    print
            byte    "cpuid(1), ECX=%08x, EDX=%08x", nl, 0
            
; Most likely, leaf function 7 is supported on all modern CPUs
; (e.g., x86-64), but we'll test its availability nonetheless.

            cmp     maxFeature, 7
            jb      Unsupported
            
            mov     eax, 7
            xor     ecx, ecx
            cpuid
            mov     edx, ebx
            mov     r8d, ecx
            call    print
            byte    "cpuid(1), EBX=%08x, ECX=%08x", nl, 0
            
; Test for extended bit manipulation instructions 
; (BMI1 and BMI2):

            and     ebx, 108h       ;Test bits 3 & 8
            cmp     ebx, 108h       ;Both must be set
            jne     Unsupported
            call    print
            byte    "CPU supports BMI1 & BMI2", nl, 0
            jmp     allDone 
             
             
Unsupported:
            call    print
            byte    "CPU does not support BMI1 & BMI2 "
            byte    "instructions", nl, 0
             
allDone:    leave
            pop     rbx
            ret     ;Returns to caller
asmMain     endp
            end

Memory Operand Alignment

  • Almost all SSE, AVX, and AVX2 instructions will generate a memory alignment fault if you attempt to access a 128-bit object at an address that is not 16-byte aligned. Always ensure that your SSE packed operands are properly aligned.

  • Because segments have a default alignment of para (16 bytes), you can easily ensure that any 16-byte packed data objects are 16-byte-aligned by using the align directive:

align 16
  • If you are using AVX instructions to access 256-bit (32-byte) memory operands, you must ensure that those memory operands begin on a 32-byte address boundary. Unfortunately, align 32 won’t work, because the default segment alignment is para (16-byte) alignment, and the segment’s alignment must be greater than or equal to the operand field of any align directives appearing within that segment. Therefore, to be able to define 256-bit variables usable by AVX instructions, you must explicitly define a segment that is aligned on a 32-byte boundary, such as the following:

avxData  segment align(32)
         align 32         ; This is actually redundant here
someData oword 0, 1       ; 256 bits of data
           .
           .
           .
avxData  ends
  • Inside a procedure, you can allocate storage for a 16-, 32-, or 64-bytealigned variable by over-allocating the storage, adding the size minus 1 of the object to the allocated address, and then using the and instruction to zero out LO bits of the address (4 bits for 16-byte-aligned objects, 5 bits for 32-byte-aligned objects, and 6 bits for 64-byte-aligned objects). Then reference the object by using this pointer.

sseproc  proc
sseptr   equ <[rbp - 8]>
avxptr   equ <[rbp - 16]>
avx2ptr  equ <[rbp - 24]>
         push rbp
         mov rbp, rsp
         sub rsp, 160

; Load RAX with an address 64 bytes
; above the current stack pointer. A
; 64-byte-aligned address will be somewhere
; between RSP and RSP + 63.

         lea rax, [rsp + 63]

; Mask out the LO 6 bits of RAX. This
; generates an address in RAX that is
; aligned on a 64-byte boundary and is
; between RSP and RSP + 63:
 
         and rax, -64 ; 0FFFF...FC0h

; Save this 64-byte-aligned address as
; the pointer to the AVX2 data:
 
         mov avx2ptr, rax

; Add 64 to AVX2's address. This skips
; over AVX2's data. The address is also
; 64-byte-aligned (which means it is
; also 32-byte-aligned). Use this as
; the address of AVX's data:
 
         add rax, 64
         mov avxptr, rax

; Add 32 to AVX's address. This skips
; over AVX's data. The address is also
; 32-byte-aligned (which means it is
; also 16-byte-aligned). Use this as
; the address of SSE's data:
 
         add rax, 32
         mov sseptr, rax
          .
          . Code that accesses the
          . AVX2, AVX, and SSE data
          . areas using avx2ptr,
          . avxptr, and sseptr
 
         leave
         ret
sseproc  endp

SIMD Data Movement Instructions

The (v)movd and (v)movq Instructions

  • For the SSE instruction set, the movd (move dword) and movq (move qword) instructions copy the value from a 32- or 64-bit general-purpose register or memory location into the LO dword or qword of an XMM register.

movd xmmn, reg32/mem32
movq xmmn, reg64/mem64
Moving a 32-bit value from memory to an XMM register (with zero extension)
Moving a 64-bit value from memory to an XMM register (with zero extension)
  • The following instructions store the LO 32 or 64 bits of an XMM register into a dword or qword memory location or general-purpose register:

movd reg32/mem32, xmmn
movq reg64/mem64, xmmn
  • For the AVX instructions, you use the following instructions

vmovd xmmn, reg32/mem32
vmovd reg32/mem32, xmmn
vmovq xmmn, reg64/mem64
vmovq reg64/mem64, xmmn

The (v)movaps, (v)movapd, and (v)movdqa Instructions

  • The movaps (move aligned, packed single), movapd (move aligned, packed double), and movdqa (move double quad-word aligned) instructions move 16 bytes of data between memory and an XMM register or between two XMM registers. The AVX versions (with the v prefix) move 16 or 32 bytes between memory and an XMM or a YMM register or between two XMM or YMM registers.

movaps xmmn, mem128     vmovaps xmmn, mem128      vmovaps ymmn, mem256
movaps mem128, xmmn     vmovaps mem128, xmmn      vmovaps mem256, ymmn
movaps xmmn, xmmn       vmovaps xmmn, xmmn        vmovaps ymmn, ymmn
movapd xmmn, mem128     vmovapd xmmn, mem128      vmovapd ymmn, mem256
movapd mem128, xmmn     vmovapd mem128, xmmn      vmovapd mem256, ymmn
movapd xmmn, xmmn       vmovapd xmmn, xmmn        vmovapd ymmn, ymmn
movdqa xmmn, mem128     vmovdqa xmmn, mem128      vmovdqa ymmn, mem256
movdqa mem128, xmmn     vmovdqa mem128, xmmn      vmovdqa mem256, ymmn
movdqa xmmn, xmmn       vmovdqa xmmn, xmmn        vmovdqa ymmn, ymmn
  • The mem128 operand should be a vector (array) of four single-precision floating-point values for the (v)movaps instruction; it should be a vector of two double-precision floating-point values for the (v)movapd instruction; it should be a 16-byte value (16 bytes, 8 words, 4 dwords, or 2 qwords) when using the (v)movdqa instruction.

  • The mem256 operand should be a vector (array) of eight single-precision floating-point values for the vmovaps instruction; it should be a vector of four double-precision floating-point values for the vmovapd instruction; it should be a 32-byte value (32 bytes, 16 words, 8 dwords, or 4 qwords) when using the vmovdqa instruction.

  • You will need to use operand type coercion if the instruction doesn’t match one of the following types:

    • The movaps instruction allows real4, dword, and oword operands.

    • The movapd instruction allows real8, qword, and oword operands.

    • The movdqa instruction allows only oword operands.

    • The vmovaps instruction allows real4, dword, and ymmword ptr operands (when using a YMM register).

    • The vmovapd instruction allows real8, qword, and ymmword ptr operands (when using a YMM register).

    • The vmovdqa instruction allows only ymmword ptr operands (when using a YMM register).

The (v)movups, (v)movupd, and (v)movdqu Instructions

  • When you cannot guarantee that packed data memory operands lie on a 16- or 32-byte address boundary, you can use the (v)movups (move unaligned packed single-precision), (v)movupd (move unaligned packed double-precision), and (v)movdqu (move double quad-word unaligned) instructions to move data between XMM or YMM registers and memory.

  • These instructions typically run slower than their aligned equivalents.

The (v)movlps and (v)movlpd Instructions

  • The (v)movl* instructions and (v)movh* instructions allow you to merge two pairs of single-precision or a pair of double-precision floating-point operands from two different sources into a single XMM register.

movlps xmmdest, mem64
movlps mem64, xmmsrc
vmovlps xmmdest, xmmsrc, mem64
vmovlps mem64, xmmsrc
  • The movlps xmmdest, mem64 form copies a pair of single-precision floating point values into the two LO 32-bit lanes of a destination XMM register.

movlps instruction
  • The movlps mem64, xmmsrc form copies the LO 64 bits (the two LO single precision lanes) from the XMM source register to the specified memory location.

  • The vmovlps instruction has three operands: a destination XMM register, a source XMM register, and a source (64-bit) memory location. This instruction copies the two single-precision values from the memory location into the LO 64 bits of the destination XMM register. It copies the HO 64 bits of the source register into the HO 64 bits of the destination register.

vmovlps instruction
  • Like movsd, the movlpd (move low packed double) instruction copies the LO 64 bits (a double-precision floating-point value) of the source operand to the LO 64 bits of the destination operand. The difference is that the movlpd instruction doesn’t zero-extend the value when moving data from memory into an XMM register, whereas the movsd instruction will zero-extend.

The movhps and movhpd Instructions

  • The movhps and movhpd instructions move a 64-bit value into the HO quad word of a destination XMM register.

  • The movhps and movhpd instructions can also store the HO quad word of an XMM register into memory. These instructions do not affect bits 128 to 255 of the YMM registers.

movhps xmmn, mem64
movhps mem64, xmmn
movhpd xmmn, mem64
movhpd mem64, xmmn
  • You would normally use a movlps instruction followed by a movhps instruction to load four single-precision floating-point values into an XMM register, taking the floating-point values from two different data sources.

The vmovhps and vmovhpd Instructions

  • Although the AVX instruction extensions provide vmovhps and vmovhpd instructions, they are not a simple extension of the SSE movhps and movhpd instructions.

vmovhps xmmdest, xmmsrc, mem64
vmovhps mem64, xmmsrc
vmovhpd xmmdest, xmmsrc, mem64
vmovhpd mem64, xmmsrc
  • The instructions that store data into a 64-bit memory location behave similarly to the movhps and movhpd instructions.

  • The instructions that load data into an XMM register have two source operands. They load a full 128 bits into the destination XMM register. The HO 64 bits come from the memory operand and the LO 64 bits come from the LO quad word of the source XMM register.

vmovhpd and vmovhps instructions
  • Unlike for the movhps instruction, MASM properly accepts real4 source operands for the vmovhps instruction/

The movlhps and vmovlhps Instructions

  • The movlhps instruction moves a pair of 32-bit single-precision floating-point values from the LO qword of the source XMM register into the HO 64 bits of a destination XMM register. It leaves the LO 64 bits of the destination register unchanged.

  • If the destination register is on a CPU that supports 256-bit AVX registers, this instruction also leaves the HO 128 bits of the overlaid YMM register unchanged.

movlhps xmmdest, xmmsrc
vmovlhps xmmdest, xmmsrc1, xmmsrc2
  • You cannot use this instruction to move data between memory and an XMM register; it transfers data only between XMM registers. No doubleprecision version of this instruction exists.

  • The vmovlhps instruction is similar to movlhps, with the following differences:

    • vmovlhps requires three operands: two source XMM registers and a destination XMM register.

    • vmovlhps copies the LO quad word of the first source register into the LO quad word of the destination register.

    • vmovlhps copies the LO quad word of the second source register into bits 64 to 127 of the destination register.

    • vmovlhps zero-extends the result into the upper 128 bits of the overlaid YMM register.

The movhlps and vmovhlps Instructions

  • The movhlps instruction copies the pair of 32-bit single-precision floatingpoint values from the HO qword of the source operand to the LO qword of the destination register, leaving the HO 64 bits of the destination register unchanged. This instruction copies data only between XMM registers; it does not allow a memory operand.

movhlps xmmdest, xmmsrc
  • The vmovhlps instruction requires three XMM register operands. This instruction copies the HO 64 bits of the first source register into the HO 64 bits of the destination register, copies the HO 64 bits of the second source register into bits 0 to 63 of the destination register, and finally, zero-extends the result into the upper bits of the overlaid YMM register.

vmovhlps xmmdest, xmmsrc1, xmmsrc2

The (v)movshdup and (v)movsldup Instructions

  • The movshdup instruction moves the two odd-index single-precision floatingpoint values from the source operand (memory or XMM register) and duplicates each element into the destination XMM register.

  • This instruction ignores the single-precision floating-point values at even-lane indexes into the XMM register.

  • The vmovshdup instruction works the same way but on YMM registers, copying four single-precision values rather than two.

movshdup xmmdest, mem128/xmmsrc
vmovshdup xmmdest, mem128/xmmsrc
vmovshdup ymmdest, mem256/ymmsrc
movshdup and vmovshdup instructions
  • The movsldup instruction works just like the movshdup instruction, except it copies and duplicates the two single-precision values at even indexes in the source XMM register to the destination XMM register. Likewise, the vmovsldup instruction copies and duplicates the four double-precision values in the source YMM register at even indexes.

movsldup xmmdest, mem128/xmmsrc
vmovsldup xmmdest, mem128/xmmsrc
vmovsldup ymmdest, mem256/ymmsrc
movsldup and vmovsldup instructions

The (v)movddup Instruction

  • The movddup instruction copies and duplicates a double-precision value from the LO 64 bits of an XMM register or a 64-bit memory location into the LO 64 bits of a destination XMM register; then it also duplicates this value into bits 64 to 127 of that same destination register.

  • This instruction does not disturb the HO 128 bits of a YMM register.

movddup xmmdest, mem64/xmmsrc
  • The vmovddup instruction operates on an XMM or a YMM destination register and an XMM or a YMM source register or 128- or 256-bit memory location.

  • The 128-bit version works just like the movddup instruction except it zeroes the HO bits of the destination YMM register. The 256-bit version copies a pair of double-precision values at even indexes (0 and 2) in the source value to their corresponding indexes in the destination YMM register and duplicates those values at the odd indexes in the destination.

vmovddup instruction behavior
movddup xmmdest, mem64/xmmsrc
vmovddup ymmdest, mem256/ymmsrc

The (v)lddqu Instruction

  • The (v)lddqu instruction is operationally identical to (v)movdqu. You can sometimes use this instruction to improve performance if the (memory) source operand is not aligned properly and crosses a cache line boundary in memory.

lddqu xmmdest, mem128
vlddqu xmmdest, mem128
vlddqu ymmdest, mem256

The Shuffle and Unpack Instructions

The (v)pshufb Instructions

  • The pshufb instruction was the first packed byte shuffle SIMD instruction. The first (destination) operand is an XMM register whose byte lanes pshufb will shuffle (rearrange). The second operand (either an XMM register or a 128-bit oword memory location) is an array of 16 byte values holding indexes that control the shuffle operation.

pshufb xmmdest, xmm/mem128
  • If the second operand is a memory location, that oword value must be aligned on a 16-byte boundary. Each byte (lane) in the second operand selects a value for the corresponding byte lane in the first operand.

Lane index correspondence for pshufb instruction
  • The 16-byte indexes in the second operand each take the form shown below

phsufb byte index
  • The pshufb instruction ignores bits 4 to 6 in an index byte. Bit 7 is the clear bit; if this bit contains a 1, the pshufb instruction ignores the lane index bits and stores a 0 into the corresponding byte in XMMdest. If the clear bit contains a 0, the pshufb instruction does a shuffle operation.

  • The pshufb shuffle operation takes place on a lane-by-lane basis. The instruction first makes a temporary copy of XMMdest. Then for each index byte (whose HO bit is 0), the pshufb copies the lane specified by the LO 4 bits of the index from the XMMdest lane that matches the index’s lane.

Shuffle operation
  • The AVX variant adds a source register (rather than using XMMdest as both the source and destination registers), and, rather than creating a temporary copy of XMMdest prior to the operation and picking the values from that copy, the vpshufb instructions select the source bytes from the XMMsrc register. Other than that, and the fact that these instructions zero the HO bits of YMMdest, the 128-bit variant operates identically to the SSE pshufb instruction.

  • The AVX instruction allows you to specify 256-bit YMM registers in addition to 128-bit XMM registers.

vpshufb xmmdest, xmmsrc, xmmindex/mem128
vpshufb ymmdest, ymmsrc, ymmindex/mem256

The (v)pshufd Instructions

  • These instructions shuffle dwords in XMM and YMM registers (not double-precision values) similarly to the (v)pshufb instructions. However, the shuffle index is specified differently from (v)pshufb.

pshufd xmmdest, xmmsrc/mem128, imm8
vpshufd xmmdest, xmmsrc/mem128, imm8
vpshufd ymmdest, ymmsrc/mem256, imm8
  • The first operand (XMMdest or YMMdest) is the destination operand where the shuffled values will be stored. The second operand is the source from which the instruction will select the double words to place in the destination register. The third operand is an 8-bit immediate value that specifies the indexes for the double words to select from the source operand.

  • The value in bits 0 to 1 selects a particular dword from the source operand to place in dword 0 of the XMMdest operand. The value in bits 2 to 3 selects a dword from the source operand to place in dword 1 of the XMMdest operand. The value in bits 4 to 5 selects a dword from the source operand to place in dword 2 of the XMMdest operand. Finally, the value in bits 6 to 7 selects a dword from the source operand to place in dword 3 of the XMMdest operand.

  • The difference between the 128-bit pshufd and vpshufd instructions is that pshufd leaves the HO 128 bits of the underlying YMM register unchanged and vpshufd zeroes the HO 128 bits of the underlying YMM register.

  • The 256-bit variant of vpshufd (when using YMM registers as the source and destination operands) still uses an 8-bit immediate operand as the index value. Each 2-bit index value manipulates two dword values in the YMM registers. Bits 0 to 1 control dwords 0 and 4, bits 2 to 3 control dwords 1 and 5, bits 4 to 5 control dwords 2 and 6, and bits 6 to 7 control dwords 3 and 7.

  • The 256-bit version is slightly less flexible as it copies two dwords at a time, rather than one.

The (v)pshuflw and (v)pshufhw Instructions

  • The pshuflw and vpshuflw and the pshufhw and vpshufhw instructions provide support for 16-bit word shuffles within an XMM or a YMM register.

pshuflw xmmdest, xmmsrc/mem128, imm8
pshufhw xmmdest, xmmsrc/mem128, imm8

vpshuflw xmmdest, xmmsrc/mem128, imm8
vpshufhw xmmdest, xmmsrc/mem128, imm8

vpshuflw ymmdest, ymmsrc/mem256, imm8
vpshufhw ymmdest, ymmsrc/mem256, imm8
  • The 128-bit lw variants copy the HO 64 bits of the source operand to the same positions in the XMMdest operand. Then they use the index (imm8) operand to select word lanes 0 to 3 in the LO qword of the XMMsrc/mem128 operand to move to the LO 4 lanes of the destination operand. Note that pshuflw does not modify the HO 128 bits of the overlaid YMM register, whereas vpshuflw zeroes those HO bits.

(v)pshuflw operation
  • The 256-bit vpshuflw instruction (with a YMM destination register) copies two pairs of words at a time—one pair in the HO 128 bits and one pair in the LO 128 bits of the YMM destination register and 256-bit source locations. The index (imm8) selection is the same for the LO and HO 128 bits.

vpshuflw operation
  • The 128-bit hw variants copy the LO 64 bits of the source operand to the same positions in the destination operand. Then they use the index operand to select words 4 to 7 (indexed as 0 to 3) in the 128-bit source operand to move to the HO four word lanes of the destination operand.

(v)pshufhw operation
  • The 256-bit vpshufhw instruction (with a YMM destination register) copies two pairs of words at a time—one in the HO 128 bits and one in the LO 128 bits of the YMM destination register and 256-bit source locations.

vpshufhw operation

shufps and shufpd Instructions

  • The shuffle instructions (shufps and shufpd) extract single- or double-precision values from the source operands and place them in specified positions in the destination operand. The third operand, an 8-bit immediate value, selects which values to extract from the source to move into the destination register.

shufps xmmsrc1/dest, xmmsrc2/mem128, imm8
shufpd xmmsrc1/dest, xmmsrc2/mem128, imm8
  • For the shufps instruction, the second source operand is an 8-bit immediate value that is actually a four-element array of 2-bit values.

  • imm8 bits 0 and 1 select a single-precision value from one of the four lanes in the XMMsrc1/dest operand to store into lane 0 of the destination operation. Bits 2 and 3 select a single-precision value from one of the four lanes in the XMMsrc1/dest operand to store into lane 1 of the destination operation (the destination operand is also XMMsrc1/dest). imm8 bits 4 and 5 select a single-precision value from one of the four lanes in the XMMsrc2/memsrc2 operand to store into lane 2 of the destination operation. Bits 6 and 7 select a single-precision value from one of the four lanes in the XMMsrc2/memsrc2 operand to store into lane 3 of the destination operation.

shufps operation
  • The shufpd instruction works similarly, shuffling double-precision values. As there are only two double-precision values in an XMM register, it takes only a single bit to choose between the values.

  • As a result, the third operand, the imm8 value, is actually just a 2-bit value; the instruction ignores bits 2 to 7 in the imm8 operand. Bit 0 of the imm8 operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMMsrc1/dest operand to place into lane 0 and bits 0 to 63 of XMMdest. Bit 1 of the imm8 operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMMsrc/mem128 operand to place into lane 1 and bits 64 to 127 of XMMdest.

shufpd operation

The vshufps and vshufpd Instructions

  • The vshufps and vshufpd instructions are similar to shufps and shufpd. They allow you to shuffle the values in 128-bit XMM registers or 256-bit YMM registers.

  • The vshufps and vshufpd instructions have four operands: a destination XMM or YMM register, two source operands (src1 must be an XMM or a YMM register, and src2 can be an XMM or a YMM register or a 128- or 256-bit memory location), and an imm8 operand.

vshufps xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vshufpd xmmdest, xmmsrc1, xmmsrc2/mem128, imm8

vshufps ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
vshufpd ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
  • For the 256-bit vshufps instructions, the imm8 operand is an array of four 2-bit values. These 2-bit values select one of four single-precision values from the source locations as shown below:

  • The vshufps instruction also allows you to specify XMM and 128-bit memory operands. In this form, it behaves quite similarly to the shufps instruction except that you get to specify two different 128-bit source operands and it zeroes the HO 128 bits of the corresponding YMM register.

  • The vshufpd instruction is an extension of shufpd to 256 bits. As there are four double-precision values present in a 256-bit YMM register, vshufpd needs 4 bits to select the source indexes

  • Like the vshufps instruction, vshufpd also allows you to specify XMM registers if you want a three-operand version of shufpd.

The (v)unpcklps, (v)unpckhps, (v)unpcklpd, and (v)unpckhpd Instructions

  • The unpack (and merge) instructions are a simplified variant of the shuffle instructions. These instructions copy single- and double-precision values from fixed locations in their source operands and insert those values into fixed locations in the destination operand.

  • The unpcklps and unpckhps instructions choose half their single-precision operands from one of two sources, merge these values, and then store the merged result into the destination operand

unpcklps xmmdest, xmmsrc/mem128
unpckhps xmmdest, xmmsrc/mem128
  • The XMMdest operand serves as both the first source operand and the destination operand. The XMMsrc/mem128 operand is the second source operand.

  • The unpcklps instruction copies the two LO single-precision values from the source operand to bit positions 32 to 63 (dword 1) and 96 to 127 (dword 3). It leaves dword 0 in the destination operand alone and copies the value originally in dword 1 to dword 2 in the destination

unpcklps instruction operation
  • The unpckhps instruction copies the two HO single-precision values from the two sources to the destination register

unpckhps instruction operation
  • The unpcklpd and unpckhpd instructions do the same thing as unpcklps and unpckhps except, of course, they operate on double-precision values rather than single-precision values.

unpcklpd instruction operation
unpckhpd instruction operation
  • The vunpcklps, vunpckhps, vunpcklpd, and vunpckhpd instructions have the following syntax:

vunpcklps xmmdest, xmmsrc1, xmmsrc2/mem128
vunpckhps xmmdest, xmmsrc1, xmmsrc2/mem128

vunpcklps ymmdest, ymmsrc1, ymmsrc2/mem256
vunpckhps ymmdest, ymmsrc1, ymmsrc2/mem256
  • They work similarly to the non-v variants, with a couple of differences:

    • The AVX variants support using the YMM registers as well as the XMM registers.

    • The AVX variants require three operands. The first (destination) and second (source1) operands must be XMM or YMM registers. The third (source2) operand can be an XMM or a YMM register or a 128- or 256-bit memory location. The two-operand form is just a special case of the three-operand form, where the first and second operands specify the same register name.

    • The 128-bit variants zero out the HO bits of the YMM register rather than leaving those bits unchanged.

  • The interleaving extension happens in the intuitive way as shown below, with vunpcklps:

vunpcklps instruction operation
  • The vunpckhps instruction does the following:

vunpckhps instruction operation

The punpck* Instructions

  • The punpck* instructions provide a set of integer unpack instructions to complement the floating-point variants.

  • The punpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination SSE register. These instructions do not modify the upper 128 bits of any overlaid YMM register.

punpcklbw xmmdest, xmmsrc
punpcklbw xmmdest, memsrc
punpckhbw xmmdest, xmmsrc
punpckhbw xmmdest, memsrc
punpcklwd xmmdest, xmmsrc
punpcklwd xmmdest, memsrc
punpckhwd xmmdest, xmmsrc
punpckhwd xmmdest, memsrc
punpckldq xmmdest, xmmsrc
punpckldq xmmdest, memsrc
punpckhdq xmmdest, xmmsrc
punpckhdq xmmdest, memsrc
punpcklqdq xmmdest, xmmsrc
punpcklqdq xmmdest, memsrc
punpckhqdq xmmdest, xmmsrc
punpckhqdq xmmdest, memsrc

The vpunpck* SSE Instructions

  • The AVX vpunpck* instructions provide a set of AVX integer unpack instructions to complement the SSE variants.

  • The vpunpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination AVX or SSE register.

vpunpcklbw xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhbw xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpcklwd xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhwd xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckldq xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhdq xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpcklqdq xmmdest, xmmsrc1, xmmsrc2/mem128
vpunpckhqdq xmmdest, xmmsrc1, xmmsrc2/mem128
  • Functionally, the only difference between these AVX instructions (vunpck*) and the SSE (unpck*) instructions is that the SSE variants leave the upper bits of the YMM AVX registers (bits 128 to 255) unchanged, whereas the AVX variants zero-extend the result to 256 bits.

The vpunpck* AVX Instructions

  • The AVX vunpck* instructions also support the use of the AVX YMM registers, in which case the unpack and merge operation extends from 128 bits to 256 bits.

vpunpcklbw ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhbw ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpcklwd ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhwd ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckldq ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhdq ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpcklqdq ymmdest, ymmsrc1, ymmsrc2/mem256
vpunpckhqdq ymmdest, ymmsrc1, ymmsrc2/mem256

The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq Instructions

  • The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq instructions extract a byte, word, dword, or qword from a 128-bit XMM register and copy this data to a general-purpose register or memory location.

pextrb reg32, xmmsrc, imm8  ; imm8 = 0 to 15
pextrb reg64, xmmsrc, imm8  ; imm8 = 0 to 15
pextrb mem8, xmmsrc, imm8   ; imm8 = 0 to 15

vpextrb reg32, xmmsrc, imm8 ; imm8 = 0 to 15
vpextrb reg64, xmmsrc, imm8 ; imm8 = 0 to 15
vpextrb mem8, xmmsrc, imm8  ; imm8 = 0 to 15

pextrw reg32, xmmsrc, imm8  ; imm8 = 0 to 7
pextrw reg64, xmmsrc, imm8  ; imm8 = 0 to 7
pextrw mem16, xmmsrc, imm8  ; imm8 = 0 to 7
vpextrw reg32, xmmsrc, imm8 ; imm8 = 0 to 7
vpextrw reg64, xmmsrc, imm8 ; imm8 = 0 to 7
vpextrw mem16, xmmsrc, imm8 ; imm8 = 0 to 7

pextrd reg32, xmmsrc, imm8  ; imm8 = 0 to 3
pextrd mem32, xmmsrc, imm8  ; imm8 = 0 to 3
vpextrd mem64, xmmsrc, imm8 ; imm8 = 0 to 3
vpextrd reg32, xmmsrc, imm8 ; imm8 = 0 to 3
vpextrd reg64, xmmsrc, imm8 ; imm8 = 0 to 3
vpextrd mem32, xmmsrc, imm8 ; imm8 = 0 to 3

pextrq reg64, xmmsrc, imm8  ; imm8 = 0 to 1
pextrq mem64, xmmsrc, imm8  ; imm8 = 0 to 1
vpextrq reg64, xmmsrc, imm8 ; imm8 = 0 to 1
vpextrq mem64, xmmsrc, imm8 ; imm8 = 0 to 1
  • The byte and word instructions expect a 32- or 64-bit general-purpose register as their destination or a memory location that is the same size as the instruction

  • The double-word and quad-word variants require a 32-bit or 64-bit general-purpose register, respectively. If the destination operand is a 32- or 64-bit general-purpose register, the instruction zero-extends the value to 32 or 64 bits, if necessary.

  • These instructions do not support extracting data from the upper 128 bits of a YMM register

The (v)pinsrb, (v)pinsrw, (v)pinsrd, and (v)pinsrq Instructions

  • The (v)pinsr{b,w,d,q} instructions take a byte, word, dword, or qword from a general-purpose register or memory location and store that data to a lane of an XMM register

pinsrb xmmdest, reg32, imm8           ; imm8 = 0 to 15
pinsrb xmmdest, mem8, imm8            ; imm8 = 0 to 15
vpinsrb xmmdest, xmmsrc2, reg32, imm8 ; imm8 = 0 to 15
vpinsrb xmmdest, xmmsrc2, mem8, imm8  ; imm8 = 0 to 15

pinsrw xmmdest, reg32, imm8           ; imm8 = 0 to 7
pinsrw xmmdest, mem16, imm8           ; imm8 = 0 to 7
vpinsrw xmmdest, xmmsrc2, reg32, imm8 ; imm8 = 0 to 7
vpinsrw xmmdest, xmmsrc2, mem16, imm8 ; imm8 = 0 to 7

pinsrd xmmdest, reg32, imm8           ; imm8 = 0 to 3
pinsrd xmmdest, mem32, imm8           ; imm8 = 0 to 3
vpinsrd xmmdest, xmmsrc2, reg32, imm8 ; imm8 = 0 to 3
vpinsrd xmmdest, xmmsrc2, mem32, imm8 ; imm8 = 0 to 3

pinsrq xmmdest, reg64, imm8           ; imm8 = 0 to 1
pinsrq xmmdest, xmmsrc2, mem64, imm8  ; imm8 = 0 to 1
vpinsrq xmmdest, xmmsrc2, reg64, imm8 ; imm8 = 0 to 1
vpinsrq xmmdest, xmmsrc2, mem64, imm8 ; imm8 = 0 to 1
  • These instructions fetch a byte, word, dword, or qword from the generalpurpose register or memory location and copy that to the lane in the XMM register specified by the 8-bit immediate value. The pinsr{b,w,d,q} instructions leave any HO bits in the underlying YMM register unchanged

  • The vpinsr{b,w,d,q} instructions copy the data from the XMM source register into the destination register and then copy the byte, word, dword, or quad word to the specified location in the destination register. These instructions zero-extend the value throughout the HO bits of the underlying YMM register.

The (v)extractps and (v)insertps Instructions

  • The extractps and vextractps instructions are functionally equivalent to pextrd and vpextrd. They extract a 32-bit value from an XMM register and move it into a 32-bit general-purpose register or a 32-bit memory location.

extractps reg32, xmmsrc, imm8
extractps mem32, xmmsrc, imm8
vextractps reg32, xmmsrc, imm8
vextractps mem32, xmmsrc, imm8
  • The insertps and vinsertps instructions insert a 32-bit floating-point value into an XMM register and, optionally, zero out other lanes in the XMM register.

insertps xmmdest, xmmsrc, imm8
insertps xmmdest, mem32, imm8
vinsertps xmmdest, xmmsrc1, xmmsrc2, imm8
vinsertps xmmdest, xmmsrc1, mem32, imm8
  • On CPUs with the AVX extensions, insertps does not modify the upper bits of the YMM registers; vinsertps zeroes the upper bits. The vinsertps instruction first copies the XMMsrc1 register to XMMdest before performing the insertion operation. The HO bits of the corresponding YMM register are set to 0.

SIMD Arithmetic and Logical Operations

Logical Instructions

SSE/AVX Logical Instructions
  • The syntax for these instructions is the following:

andpd xmmdest, xmmsrc/mem128
vandpd xmmdest, xmmsrc1, xmmsrc2/mem128
vandpd ymmdest, ymmsrc1, ymmsrc2/mem256

andnpd xmmdest, xmmsrc/mem128
vandnpd xmmdest, xmmsrc1, xmmsrc2/mem128
vandnpd ymmdest, ymmsrc1, ymmsrc2/mem256

orpd xmmdest, xmmsrc/mem128
vorpd xmmdest, xmmsrc1, xmmsrc2/mem128
vorpd ymmdest, ymmsrc1, ymmsrc2/mem256

xorpd xmmdest, xmmsrc/mem128
vxorpd xmmdest, xmmsrc1, xmmsrc2/mem128
vxorpd ymmdest, ymmsrc1, ymmsrc2/mem256
  • The SSE instructions (without the v prefix) leave the HO bits of the underlying YMM register unchanged (if applicable). The AVX instructions (with the v prefix) that have 128-bit operands will zero-extend their result into the HO bits of the YMM register.

The (v)ptest Instructions

  • The ptest instruction (packed test) is similar to the standard integer test instruction. The ptest instruction performs a logical AND between the two operands and sets the zero flag if the result is 0.

  • The ptest instruction sets the carry flag if the logical AND of the second operand with the inverted bits of the first operand produces 0.

ptest xmmsrc1, xmmsrc2/mem128
vptest xmmsrc1, xmmsrc2/mem128
vptest ymmsrc1, ymmsrc2/mem256

The Byte Shift Instructions

  • The SSE and AVX instruction set extensions also support a set of logical and arithmetic shift instructions. The first two to consider are pslldq and psrldq.

pslldq xmmdest, imm8
vpslldq xmmdest, xmmsrc, imm8
vpslldq ymmdest, ymmsrc, imm8
psrldq xmmdest, imm8
vpsrldq xmmdest, xmmsrc, imm8
vpsrldq ymmdest, ymmsrc, imm8
  • The pslldq instruction shifts its destination XMM register to the left by the number of bytes specified by the imm8 operand. This instruction shifts 0s into the vacated LO bytes.

  • The vpslldq instruction takes the value in the source register (XMM or YMM), shifts that value to the left by imm8 bytes, and then stores the result into the destination register. For the 128-bit variant, this instruction zeroextends the result into bits 128 to 255 of the underlying YMM register.

  • The psrldq and vpsrldq instructions operate similarly to (v)pslldq except, of course, they shift their operands to the right rather than to the left. These are logical shift-right operations, so they shift 0s into the HO bytes of their operand, and bits shifted out of bit 0 are lost.

  • The pslldq and psrldq instructions shift bytes rather than bits. For example, many SSE instructions produce byte masks 0 or 0FFh, representing Boolean results. These instructions shift the equivalent of a bit in one of these byte masks by shifting whole bytes at a time.

The Bit Shift Instructions

  • These instructions provide word, dword, and qword variants of the logical shift-left, logical shift-right, and arithmetic shift-right operations

shift xmmdest, imm8
shift xmmdest, xmmsrc/mem128
vshift xmmdest, xmmsrc, imm8
vshift xmmdest, xmmsrc, mem128
vshift ymmdest, ymmsrc, imm8
vshift ymmdest, ymmsrc, xmm/mem128
  • shift = psllw, pslld, psllq, psrlw, psrld, psrlq, psraw, or psrad,

  • vshift = vpsllw, vpslld, vpsllq, vpsrlw, vpsrld, vpsrlq, vpsraw, vpsrad, or vpsraq

  • The (v)psl* instructions shift their operands to the left; the (v)psr* instructions shift their operands to the right.

  • The (v)psll* and (v)psrl* instructions are logical shift instructions and shift 0s into the bits vacated by the shift. Any bits shifted out of the operand are lost.

  • The (v)psra* instructions are arithmetic shift-right instructions. They replicate the HO bit in each lane when shifting that lane’s bits to the right; all bits shifted out of the LO bit are lost.

  • The w suffix instructions shift 16-bit operands (eight lanes for 128-bit destination operands, sixteen lanes for 256-bit destinations).

  • The d suffix instructions shift 32-bit dword operands (four lanes for 128-bit destination operands, eight lanes for 256-bit destination operands).

  • The q suffix instructions shift 64-bit operands (two lanes for 128-bit operands, four lanes for 256-bit operands).

SIMD Integer Addition

  • These addition instructions are known as vertical additions because if we stack the two source operands on top of each other, the lane additions occur vertically.

  • The packed additions ignore any overflow from the addition operation, keeping only the LO byte, word, dword, or qword of each addition. As long as overflow is never possible, this is not an issue.

  • A cleaner solution is to use saturation arithmetic. For unsigned addition, saturation arithmetic clips (or saturates) an overflow to the largest possible value that the instruction’s size can handle. For signed saturation arithmetic, clipping occurs at the largest positive and smallest negative values (for example, 7Fh/+127 for positive values and 80h/–128 for negative values).

  • The padd* instructions leave the HO bits of any corresponding YMM destination undisturbed; the vpadd* variants clear the HO bits.

Horizontal Additions

  • The SSE/AVX instruction sets also support three horizontal addition instructions. The horizontal addition instructions add adjacent words or dwords in their two source operands and store the sum of the result into a destination lane

Horizontal Addition Instructions
Horizontal addition operation
  • The phaddw instruction does not affect the HO bits of the corresponding YMM destination register, only the LO 128 bits. It has the following syntax:

phaddw xmmdest, xmmsrc/mem128

; Computes the following

temp[0 to 15]    = xmmdest[0 to 15]         + xmmdest[16 to 31]
temp[16 to 31]   = xmmdest[32 to 47]        + xmmdest[48 to 63]
temp[32 to 47]   = xmmdest[64 to 79]        + xmmdest[80 to 95]
temp[48 to 63]   = xmmdest[96 to 111]       + xmmdest[112 to 127]
temp[64 to 79]   = xmmsrc/mem128[0 to 15]   + xmmsrc/mem128[16 to 31]
temp[80 to 95]   = xmmsrc/mem128[32 to 47]  + xmmsrc/mem128[48 to 63]
temp[96 to 111]  = xmmsrc/mem128[64 to 79]  + xmmsrc/mem128[80 to 95]
temp[112 to 127] = xmmsrc/mem128[96 to 111] + xmmsrc/mem128[112 to 127]
xmmdest = temp
  • The 128-bit vphaddw instruction zeroes out the HO 128 bits of the corresponding YMM destination register. It has the following syntax:

vphaddw xmmdest, xmmsrc1, xmmsrc2/mem128

; computes the following

xmmdest[0 to 15]    = xmmsrc1[0 to 15]          + xmmsrc1[16 to 31]
xmmdest[16 to 31]   = xmmsrc1[32 to 47]         + xmmsrc1[48 to 63]
xmmdest[32 to 47]   = xmmsrc1[64 to 79]         + xmmsrc1[80 to 95]
xmmdest[48 to 63]   = xmmsrc1[96 to 111]        + xmmsrc1[112 to 127]
xmmdest[64 to 79]   = xmmsrc2/mem128[0 to 15]   + xmmsrc2/mem128[16 to 31]
xmmdest[80 to 95]   = xmmsrc2/mem128[32 to 47]  + xmmsrc2/mem128[48 to 63]
xmmdest[96 to 111]  = xmmsrc2/mem128[64 to 79]  + xmmsrc2/mem128[80 to 95]
xmmdest[111 to 127] = xmmsrc2/mem128[96 to 111] + xmmsrc2/mem128[112 to 127]
  • vphaddw does not simply extend the 128-bit version in the intuitive way. Instead, it mixes up computations as follows (where SRC1 is YMMsrc1 and SRC2 is YMMsrc2/mem256):

vphaddw ymmdest, ymmsrc1, ymmsrc2/mem256

; compute the following

ymmdest[0 to 15]    = SRC1[16 to 31]   + SRC1[0 to 15]
ymmdest[16 to 31]   = SRC1[48 to 63]   + SRC1[32 to 47]
ymmdest[32 to 47]   = SRC1[80 to 95]   + SRC1[64 to 79]
ymmdest[48 to 63]   = SRC1[112 to 127] + SRC1[96 to 111]
ymmdest[64 to 79]   = SRC2[16 to 31]   + SRC2[0 to 15]
ymmdest[80 to 95]   = SRC2[48 to 63]   + SRC2[32 to 47]
ymmdest[96 to 111]  = SRC2[80 to 95]   + SRC2[64 to 79]
ymmdest[112 to 127] = SRC2[112 to 127] + SRC2[96 to 111]
ymmdest[128 to 143] = SRC1[144 to 159] + SRC1[128 to 143]
ymmdest[144 to 159] = SRC1[176 to 191] + SRC1[160 to 175]
ymmdest[160 to 175] = SRC1[208 to 223] + SRC1[192 to 207]
ymmdest[176 to 191] = SRC1[240 to 255] + SRC1[224 to 239]
ymmdest[192 to 207] = SRC2[144 to 159] + SRC2[128 to 143]
ymmdest[208 to 223] = SRC2[176 to 191] + SRC2[160 to 175]
ymmdest[224 to 239] = SRC2[208 to 223] + SRC2[192 to 207]
ymmdest[240 to 255] = SRC2[240 to 255] + SRC2[224 to 239]

Double-Word–Sized Horizontal Additions

  • The phaddd instruction has the following syntax:

phaddd xmmdest, xmmsrc/mem128

; computes the following

temp[0 to 31]   = xmmdest[0 to 31]        + xmmdest[32 to 63]
temp[32 to 63]  = xmmdest[64 to 95]       + xmmdest[96 to 127]
temp[64 to 95]  = xmmsrc/mem128[0 to 31]  + xmmsrc/mem128[32 to 63]
temp[96 to 127] = xmmsrc/mem128[64 to 95] + xmmsrc/mem128[96 to 127]
xmmdest = temp
  • The 128-bit vphaddd instruction has this syntax

vphaddd xmmdest, xmmsrc1, xmmsrc2/mem128

; computes the following

xmmdest[0 to 31]     = xmmsrc1[0 to 31]         + xmmsrc1[32 to 63]
xmmdest[32 to 63]    = xmmsrc1[64 to 95]        + xmmsrc1[96 to 127]
xmmdest[64 to 95]    = xmmsrc2/mem128[0 to 31]  + xmmsrc2/mem128[32 to 63]
xmmdest[96 to 127]   = xmmsrc2/mem128[64 to 95] + xmmsrc2/mem128[96 to 127]
(ymmdest[128 to 255] = 0)
  • The 256-bit vphaddd instruction has the following syntax:

vphaddd ymmdest, ymmsrc1, ymmsrc2/mem256

; computes the following

ymmdest[0 to 31]    = ymmsrc1[32 to 63]          + ymmsrc1[0 to 31]
ymmdest[32 to 63]   = ymmsrc1[96 to 127]         + ymmsrc1[64 to 95]
ymmdest[64 to 95]   = ymmsrc2/mem128[32 to 63]   + ymmsrc2/mem128[0 to 31]
ymmdest[96 to 127]  = ymmsrc2/mem128[96 to 127]  + ymmsrc2/mem128[64 to 95]
ymmdest[128 to 159] = ymmsrc1[160 to 191]        + ymmsrc1[128 to 159]
ymmdest[160 to 191] = ymmsrc1[224 to 255]        + ymmsrc1[192 to 223]
ymmdest[192 to 223] = ymmsrc2/mem128[160 to 191] + ymmsrc2/mem128[128 to 159]
ymmdest[224 to 255] = ymmsrc2/mem128[224 to 255] + ymmsrc2/mem128[192 to 223]
  • The (v)phaddsw instructions take the following forms:

phaddsw xmmdest, xmmsrc/mem128
vphaddsw xmmdest, xmmsrc1, xmmsrc2/mem128
vphaddsw ymmdest, ymmsrc1, ymmsrc2/mem256
  • The (v)phaddsw instruction (horizontal signed integer add with saturate, word) is a slightly different form of (v)phaddw: rather than storing only the LO bits into the result in the destination lane, this instruction saturates the result.

SIMD Integer Subtraction

  • SIMD addition instructions, they do not affect any flags; any carry, borrow, overflow, or underflow information is lost. These instructions subtract the second source operand from the first source operand and store the result into the destination operand.

SIMD Integer Subtraction Instructions
  • Likewise, there is a set of signed and unsigned byte and word saturating subtraction instructions. For the signed instructions, the byte-sized instructions saturate positive overflow to 7Fh (+127) and negative underflow to 80h (–128). The word-sized instructions saturate to 7FFFh (+32,767) and 8000h (–32,768). The unsigned saturation instructions saturate to 0FFFFh (+65,535) and 0.

SIMD Integer Saturating Subtraction Instructions

SIMD Integer Multiplication

SIMD 16-Bit Packed Integer Multiplication Instructions
  • The (v)pmullw instruction multiplies the 16-bit values appearing in the lanes of the source operand and stores the LO word of the result into the corresponding destination lane. This instruction is applicable to both signed and unsigned values.

  • The (v)pmulhw instruction computes the product of two signed word values and stores the HO word of the result into the destination lanes.

  • For unsigned operands, (v)pmulhuw performs the same task. By executing both (v)pmullw and (v)pmulh(u)w with the same operands, you can compute the full 32-bit result of a 16×16-bit multiplication.

SIMD 32- and 64-Bit Packed Integer Multiplication Instructions
  • At some point along the way, Intel introduced (v)pmuldq and (v)pmuludq to perform signed and unsigned 32×32-bit multiplications, producing a 64-bit result

pmuldq xmmdest, xmm/mem128
vpmuldq xmmdest, xmmsrc1, xmm/mem128
vpmuldq ymmdest, ymmsrc1, ymm/mem256

pmuludq xmmdest, xmm/mem128
vpmuludq xmmdest, xmmsrc1, xmm/mem128
vpmuludq ymmdest, ymmsrc1, ymm/mem256
  • The 128-bit variants multiply the double words appearing in lanes 0 and 2 and store the 64-bit results into qword lanes 0 and 1. On CPUs with AVX registers, pmuldq and pmuludq do not affect the HO 128 bits of the YMM register. The vpmuldq and vpmuludq instructions zero-extend the result to 256 bits.

  • The 256-bit variants multiply the double words appearing in lanes 0, 2, 4, and 6, producing 64-bit results that they store in qword lanes 0, 1, 2, and 3

  • The pclmulqdq instruction provides the ability to multiply two qword values, producing a 128-bit result

pclmulqdq xmmdest, xmm/mem128, imm8
vpclmulqdq xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
  • These instructions multiply a pair of qword values found in XMMdest and XMMsrc and leave the 128-bit result in XMMdest. The imm8 operand specifies which qwords to use as the source operands.

  • As usual, pclmulqdq leaves the HO 128 bits of the corresponding YMM destination register unchanged, while vpcmulqdq zeroes those bits.

SIMD Integer Averages

  • The (v)pavgb and (v)pavgw instructions compute the average of two sets of bytes or words. These instructions sum the value in the byte or word lanes of their source and destination operands, divide the result by 2, round the results, and leave the averaged results sitting in the destination operand lanes.

pavgb xmmdest, xmm/mem128
vpavgb xmmdest, xmmsrc1, xmmsrc2/mem128
vpavgb ymmdest, ymmsrc1, ymmsrc2/mem256

pavgw xmmdest, xmm/mem128
vpavgw xmmdest, xmmsrc1, xmmsrc2/mem128
vpavgw ymmdest, ymmsrc1, ymmsrc2/mem256
  • The 128-bit pavgb and vpavgb instructions compute 16 byte-sized averages (for the 16 lanes in the source and destination operands). The 256-bit variant of the vpavgb instruction computes 32 byte-sized averages.

  • The 128-bit pavgw and vpavgw instructions compute eight word-sized averages (for the eight lanes in the source and destination operands). The 256-bit variant of the vpavgw instruction computes 16 byte-sized averages.

SIMD Integer Minimum and Maximum

  • These instructions scan the lanes of a pair of 128- or 256-bit operands and copy the maximum or minimum value from that lane to the same lane in the destination operand.

  • The general syntax is as follows -

pmxxyz xmmdest, xmmsrc/mem128
vpmxxyz xmmdest, xmmsrc1, xmmsrc2/mem128
vpmxxyz ymmdest, ymmsrc1, ymmsrc2/mem256

SIMD Integer Absolute Value

  • The SSE/AVX instruction set extensions provide three sets of instructions for computing the absolute values of signed byte, word, and double-word integers: (v)pabsb, (v)pabsw, and (v)pabsd

pabsb xmmdest, xmmsrc/mem128
vpabsb xmmdest, xmmsrc/mem128
vpabsb ymmdest, ymmsrc/mem256

pabsw xmmdest, xmmsrc/mem128
vpabsw xmmdest, xmmsrc/mem128
vpabsw ymmdest, ymmsrc/mem256

pabsd xmmdest, xmmsrc/mem128
vpabsd xmmdest, xmmsrc/mem128
vpabsd ymmdest, ymmsrc/mem256

SIMD Integer Sign Adjustment Instructions

  • The (v)psignb, (v)psignw, and (v)psignd instructions apply the sign found in a source lane to the corresponding destination lane.

psignb xmmdest, xmmsrc/mem128
vpsignb xmmdest, xmmsrc1, xmmsrc2/mem128
vpsignb ymmdest, ymmsrc1, ymmsrc2/mem256

psignw xmmdest, xmmsrc/mem128
vpsignw xmmdest, xmmsrc1, xmmsrc2/mem128
vpsignw ymmdest, ymmsrc1, ymmsrc2/mem256

psignd xmmdest, xmmsrc/mem128
vpsignd xmmdest, xmmsrc1, xmmsrc2/mem128
vpsignd ymmdest, ymmsrc1, ymmsrc2/mem256

SSE Compare-for-Equality Instructions

pcmpeqb xmmdest, xmmsrc/mem128 ; Compares 16 bytes
pcmpeqw xmmdest, xmmsrc/mem128 ; Compares 8 words
pcmpeqd xmmdest, xmmsrc/mem128 ; Compares 4 dwords
pcmpeqq xmmdest, xmmsrc/mem128 ; Compares 2 qwords

; computes -

; xmmdest[lane] = xmmdest[lane] == xmmsrc/mem128[lane]

; where lane varies from 0 to 15 for pcmpeqb, 0 to 7 for pcmpeqw, 0 to 3 for pcmpeqd,
; and 0 to 1 for pcmpeqq.
  • The == operator produces a value of all 1 bits if the two values in the same lane are equal; it produces all 0 bits if the values are not equal.

SSE Compare-for-Greater-Than Instructions

pcmpgtb xmmdest, xmmsrc/mem128 ; Compares 16 bytes
pcmpgtw xmmdest, xmmsrc/mem128 ; Compares 8 words
pcmpgtd xmmdest, xmmsrc/mem128 ; Compares 4 dwords
pcmpgtq xmmdest, xmmsrc/mem128 ; Compares 2 qwords

; computes -

; xmmdest[lane] = xmmdest[lane] > xmmsrc/mem128[lane]

; where lane is the same as for the compare-for-equality instructions and
; the > operator produces a value of all 1 bits if the signed integer in
; the XMMdest lane is greater than the signed value in the corresponding
; XMMsrc/MEM128 lane.

AVX Comparison Instructions

  • The 128-bit variants of these instructions have the following syntax:

vpcmpeqb xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 16 bytes
vpcmpeqw xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 8 words
vpcmpeqd xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 4 dwords
vpcmpeqq xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 2 qwords

vpcmpgtb xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 16 bytes
vpcmpgtw xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 8 words
vpcmpgtd xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 4 dwords
vpcmpgtq xmmdest, xmmsrc1, xmmsrc2/mem128 ; Compares 2 qwords

; computes - 

; xmmdest[lane] = xmmsrc1[lane] == xmmsrc2/mem128[lane]
; xmmdest[lane] = xmmsrc1[lane] > xmmsrc2/mem128[lane]
  • The 256-bit variants of these instructions have the following syntax:

vpcmpeqb ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 32 bytes
vpcmpeqw ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 16 words
vpcmpeqd ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 8 dwords
vpcmpeqq ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 4 qwords

vpcmpgtb ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 32 bytes
vpcmpgtw ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 16 words
vpcmpgtd ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 8 dwords
vpcmpgtq ymmdest, ymmsrc1, ymmsrc2/mem256 ; Compares 4 qwords

; computes - 

; ymmdest[lane] = ymmsrc1[lane] == ymmsrc2/mem256[lane]
; ymmdest[lane] = ymmsrc1[lane] > ymmsrc2/mem256[lane]

The (v)pmovmskb Instructions

  • The (v)pmovmskb instruction extracts the HO bit from all the bytes in an XMM or YMM register and stores the 16 or 32 bits (respectively) into a general-purpose register. These instructions set all HO bits of the general purpose register to 0.

pmovmskb reg, xmmsrc
vpmovmskb reg, xmmsrc
vpmovmskb reg, ymmsrc
  • The 128-bit instructions fill only bits 0 through 15 of the destination register (zeroing out all other bits). The 256-bit form of the vpmovmskb instruction fills bits 0 through 31 of the destination register (zeroing out HO bits if you specify a 64-bit register).

Integer Conversions

Packed Zero-Extension Instructions

SSE4.1 and AVX Packed Zero-Extension Instructions
AVX2 Packed Zero-Extension Instructions

Packed Sign-Extension Instructions

SSE Packed Sign-Extension Instructions
AVX Packed Sign-Extension Instructions

Packed Sign Extension with Saturation

SSE Packed Sign-Extension with Saturation Instructions
  • The saturate operation checks its operand to see if the value exceeds the range of the result (–128 to +127 for signed bytes, 0 to 255 for unsigned bytes, –32,768 to +32,767 for signed words, and 0 to 65,535 for unsigned words).

  • When saturating to a byte, if the signed source value is less than –128, byte saturation sets the value to –128. When saturating to a word, if the signed source value is less than –32,786, signed saturation sets the value to –32,768. Similarly, if a signed byte or word value exceeds +127 or +32,767, then saturation replaces the value with +127 or +32,767, respectively.

  • For unsigned operations, saturation limits the value to +255 (for bytes) or +65,535 (for words). Unsigned values are never less than 0, so unsigned saturation clips values to only +255 or +65,535.

  • AVX-capable CPUs provide 128-bit variants of these instructions that support three operands: two source operands and an independent destination operand.

vpacksswb xmmdest, xmmsrc1, xmmsrc2/mem128
vpackuswb xmmdest, xmmsrc1, xmmsrc2/mem128
vpackssdw xmmdest, xmmsrc1, xmmsrc2/mem128
vpackusdw xmmdest, xmmsrc1, xmmsrc2/mem128
  • Also, the SSE instructions do not modify the upper bits of the YMM register (if present on the CPU), whereas the AVX instructions store 0s into the upper YMM register bits

AVX Packed Sign-Extension with Saturation Instructions

SIMD Floating-Point Instructions

Arithmetic Instructions

  • The 128-bit SSE packed floating-point instructions have the following generic syntax (where instr is one of the floating-point instructions in the table below

instrps xmmdest, xmmsrc/mem128
instrpd xmmdest, xmmsrc/mem128
  • The packed single (*ps) instructions perform four single-precision floatingpoint operations simultaneously. The packed double (*pd) instructions perform two double-precision floating-point operations simultaneously. They compute -

xmmdest[lane] = xmmdest[lane] op xmmsrc/mem128[lane]

; where lane varies from 0 to 3 for packed single-precision instructions
; and from 0 to 1 for packed double-precision instructions. op represents the
; operation (such as addition or subtraction)
  • The 128-bit AVX packed floating-point instructions have this syntax:

vinstrps xmmdest, xmmsrc1, xmmsrc2/mem128 ; For dyadic operations
vinstrpd xmmdest, xmmsrc1, xmmsrc2/mem128 ; For dyadic operations
vinstrps xmmdest, xmmsrc/mem128           ; For monadic operations
vinstrpd xmmdest, xmmsrc/mem128           ; For monadic operations

; they compute
;
; xmmdest[lane] = xmmsrc1[lane] op xmmsrc2/mem128[lane]
;
; where op corresponds to the operation associated with the specific instruction
  • These 128-bit AVX instructions clear the HO bits of the underlying YMMdest register.

  • The 256-bit AVX packed floating-point instructions have this syntax:

vinstrps ymmdest, ymmsrc1, ymmsrc2/mem256 ; For dyadic operations
vinstrpd ymmdest, ymmsrc1, ymmsrc2/mem256 ; For dyadic operations
vinstrps ymmdest, ymmsrc/mem256           ; For monadic operations
vinstrpd ymmdest, ymmsrc/mem256           ; For monadic operations

; These instructions compute
; 
; ymmdest[lane] = ymmsrc1[lane] op ymmsrc/mem256[lane]
;
; where op corresponds to the operation associated with the specific instruction
  • The SSE/AVX instruction set extensions also include floating-point horizontal addition and subtraction instructions.

haddps xmmdest, xmmsrc/mem128
vhaddps xmmdest, xmmsrc1, xmmsrc2/mem128
vhaddps ymmdest, ymmsrc1, ymmsrc2/mem256
haddpd xmmdest, xmmsrc/mem128
vhaddpd xmmdest, xmmsrc1, xmmsrc2/mem128
vhaddpd ymmdest, ymmsrc1, ymmsrc2/mem256

hsubps xmmdest, xmmsrc/mem128
vhsubps xmmdest, xmmsrc1, xmmsrc2/mem128
vhsubps ymmdest, ymmsrc1, ymmsrc2/mem256
hsubpd xmmdest, xmmsrc/mem128
vhsubpd xmmdest, xmmsrc1, xmmsrc2/mem128
vhsubpd ymmdest, ymmsrc1, ymmsrc2/mem256

SSE and AVX Comparisons

  • There are two sets of basic floating-point comparisons: (v)cmpps, which compares a set of packed single-precision values, and (v)cmppd, which compares a set of packed double-precision values. These instructions use an imm8 operand whose value specifies the type of comparison.

cmpps xmmdest, xmmsrc/mem128, imm8
vcmpps xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmpps ymmdest, ymmsrc1, ymmsrc2/mem256, imm8

cmppd xmmdest, xmmsrc/mem128, imm8
vcmppd xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmppd ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
  • The imm8 operand specifies the type of the comparison. There are 32 possible comparisons

imm8 Values for cmpps and cmppd Instructions

Instruction Synonyms

  • MASM supports the use of certain synonyms so you don’t have to memorize the 32 encodings.

AVX Extended Comparisons

  • The AVX versions of these instructions allow three register operands: a destination XMM or YMM register, a source XMM or YMM register, and a source XMM or YMM register or 128-bit or 256-bit memory location

vcmpps xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmpps ymmdest, ymmsrc1, ymmsrc2/mem256, imm8

vcmppd xmmdest, xmmsrc1, xmmsrc2/mem128, imm8
vcmppd ymmdest, ymmsrc1, ymmsrc2/mem256, imm8
  • The 128-bit vcmpps instruction compares the four single-precision floatingpoint values in each lane of the XMMsrc1 register against the values in the corresponding XMMsrc2/mem128 lanes and stores the true (all 1 bits) or false (all 0 bits) result into the corresponding lane of the XMMdest register.

  • The 256-bit vcmpps instruction compares the eight single-precision floating-point values in each lane of the YMMsrc1 register against the values in the corresponding YMMsrc2/mem256 lanes and stores the true or false result into the corresponding lane of the YMMdest register.

  • The vcmppd instructions compare the double-precision values in the two lanes (128-bit version) or four lanes (256-bit version) and store the result into the corresponding lane of the destination register.

  • The AVX instructions provide synonyms that eliminate the need to memorize 32 imm8 values

AVX Packed Compare Instructions
  • The vcmpfalse* instructions always set the destination lanes to false (0 bits), and the vcmptrue* instructions always set the destination lanes to true (1 bits).

The (v)movmskps, (v)movmskpd Instructions

  • The movmskps and movmskpd instructions extract the sign bits from their packed single- and double-precision floating-point source operands and store these bits into the LO 4 (or 8) bits of a general-purpose register.

movmskps reg, xmmsrc
movmskpd reg, xmmsrc
vmovmskps reg, ymmsrc
vmovmskpd reg, ymmsrc
  • The movmskps instruction extracts the sign bits from the four singleprecision floating-point values in the XMM source register and copies these bits to the LO 4 bits of the destination register

  • The movmskpd instruction copies the sign bits from the two double-precision floating-point values in the source XMM register to bits 0 and 1 of the destination register

  • The vmovmskps instruction extracts the sign bits from the four and eight single-precision floating-point values in the XMM and YMM source register and copies these bits to the LO 4 and 8 bits of the destination register.

  • The vmovmskpd instruction copies the sign bits from the four doubleprecision floating-point values in the source YMM register to bits 0 to 3 of the destination register

vmovmskpd operation

Floating-Point Conversion Instructions

  • The table below contains various SIMD floating point conversion instructions:

Dynamically Testing For A CPU Feature

; Generic print procedure and dynamically
; selecting CPU features.


        option  casemap:none

nl          =       10



; SSE4.2 feature flags (in ECX):

SSE42       =       00180000h       ;Bits 19 & 20
AVXSupport  =       10000000h       ;Bit 28

; CPUID bits (EAX=7, EBX register)

AVX2Support  =      20h             ;Bit 5 = AVX
  
            
            .data
            align   qword
print       qword   choosePrint     ;Pointer to print function

; Floating-point values for testing purposes:

fp1         real8   1.0
fp2         real8   2.0
fp3         real8   3.0
fp4         real8   4.0
fp5         real8   5.0
            

            .code
            externdef printf:proc
            
;***************************************************************
;            
; print-
;
;  "Quick" form of printf that allows the format string to
; follow the call in the code stream. Supports up to five
; additional parameters in RDX, R8, R9, R10, and R11.
;
; This function saves all the Microsoft-ABI volatile,
; parameter, and return result registers so that code
; can call it without worrying about any registers being
; modified (this code assumes that Windows ABI treats
; YMM4..YMM15 as non-volatile).
;
; Of course, this code assumes that AVX instructions are
; available on the CPU.

; Allows up to 5 arguments in:
;
;  RDX-  Arg #1 
;  R8-   Arg #2  
;  R9-   Arg #3  
;  R10-  Arg #4
;  R11-  Arg #5
;
; Note that you must pass floating-point values in
; these registers, as well. The printf function
; expects real values in the integer registers. 
; 
;
; There are two versions of this function, one that
; will run on CPUs without AVX capabilities (no YMM
; registers) and one that will run on CPUs that
; have AVX capabilities (YMM registers). The difference
; between the two is which registers they preserve
; (print_SSE only preserves XMM registers and will
; run properly on CPUs that don't have YMM register
; support; print_AVX will preserve the volatile YMM
; registers on CPUs with AVX support);

; On first call, determine if we support AVX instructions
; and set the "print" pointer to point at print_AVX or
; print_SSE:
 
choosePrint proc
            push    rax     ;Preserve registers that get
            push    rbx     ; tweaked by CPUID
            push    rcx
            push    rdx
            
            mov     eax, 1
            cpuid
            test    ecx, AVXSupport ;Test bit 28 for AVX
            jnz     doAVXPrint
            
            lea     rax, print_SSE  ;From now on, call
            mov     print, rax      ;print_SSE directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_SSE.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_SSE
            
doAVXPrint: lea     rax, print_AVX  ;From now on, call
            mov     print, rax      ;print_AVX directly
            
; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_AUX.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_AVX

choosePrint endp

; Version of print that will preserve volatile
; AVX registers (YMM0..YMM3):

print_AVX   proc

; Preserve all the volatile register
; (be nice to the assembly code that
;  calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11
            
; YMM0..YMM3 are considered volatile, so preserve them

            sub     rsp, 128
            vmovdqu ymmword ptr [rsp+000], ymm0
            vmovdqu ymmword ptr [rsp+032], ymm1
            vmovdqu ymmword ptr [rsp+064], ymm2
            vmovdqu ymmword ptr [rsp+096], ymm3
            
            
            push    rbp

returnAdrs  textequ <[rbp+200]>

            mov     rbp, rsp
            sub     rsp, 128
            and     rsp, -16
            
; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:

            mov     rcx, returnAdrs
            
; To handle more than three arguments (4 counting
; RCX) must pass data on stack. However, to the
; print caller the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):

            mov     [rsp+32], r10
            mov     [rsp+40], r11
            call    printf
            
; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.
            
            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx
            
            leave
            vmovdqu ymm0, ymmword ptr [rsp+000]
            vmovdqu ymm1, ymmword ptr [rsp+032]
            vmovdqu ymm2, ymmword ptr [rsp+064]
            vmovdqu ymm3, ymmword ptr [rsp+096]
            add     rsp, 128
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_AVX   endp


; Version that will run on CPUs without
; AVX support and will preserve the
; volatile SSE registers (XMM0..XMM3):

print_SSE   proc

; Preserve all the volatile register
; (be nice to the assembly code that
;  calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11
            
; XMM0..XMM3 are considered volatile, so preserve them

            sub     rsp, 64
            movdqu  xmmword ptr [rsp+00], xmm0
            movdqu  xmmword ptr [rsp+16], xmm1
            movdqu  xmmword ptr [rsp+32], xmm2
            movdqu  xmmword ptr [rsp+48], xmm3
            
            
            push    rbp

returnAdrs  textequ <[rbp+136]>

            mov     rbp, rsp
            sub     rsp, 128
            and     rsp, -16
            
; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:

            mov     rcx, returnAdrs
            
; To handle more than three arguments (4 counting
; RCX) must pass data on stack. However, to the
; print caller the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):

            mov     [rsp+32], r10
            mov     [rsp+40], r11
            call    printf
            
; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.
            
            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx
            
            leave
            movdqu  xmm0, xmmword ptr [rsp+00]
            movdqu  xmm1, xmmword ptr [rsp+16]
            movdqu  xmm2, xmmword ptr [rsp+32]
            movdqu  xmm3, xmmword ptr [rsp+48]
            add     rsp, 64
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_SSE   endp            

;***************************************************************
;            
; Here is the "asmMain" function.
        
            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ;Shadow storage

; Trivial example, no arguments:

            call    print
            byte    "Hello, World!", nl, 0
            
; Simple example with integer arguments:

            mov     rdx, 1          ;Argument #1 for printf
            mov     r8, 2           ;Argument #2 for printf
            mov     r9, 3           ;Argument #3 for printf
            mov     r10, 4          ;Argument #4 for printf
            mov     r11, 5          ;Argument #5 for printf
            call    print
            byte    "Arg 1=%d, Arg2=%d, Arg3=%d "
            byte    "Arg 4=%d, Arg5=%d", nl, 0
            
; Demonstration of floating-point operands. Note that
; args 1, 2, and 3 must be passed in rdx, r8, and
; r9 (despite Microsoft ABI requirements). You'll have
; to load parameters 4 and 5 into R10 and R11.

            mov     rdx, qword ptr fp1
            mov     r8,  qword ptr fp2
            mov     r9,  qword ptr fp3
            mov     r10, qword ptr fp4
            mov     r11, qword ptr fp5
            call    print
            byte    "Arg1=%6.1f, Arg2=%6.1f, Arg3=%6.1f "
            byte    "Arg4=%6.1f, Arg5=%6.1f ", nl, 0
                         
             
allDone:    leave
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ;Returns to caller
asmMain     endp
            end

Last updated