Arithmetic

Chapter 6

Sign And Zero-Extension Instructions

Several arithmetic operations require sign- or zero-extended values before the operation.

For general sign-extension operations, the x86-64 provides an extension of the mov instruction, movsx

movsxd dest, source ; If dest is 64 bits and source is 32 bits
movsx dest, source ; For all other operand combinations

In order to zero extend the register, move it to itself or use movzx

mov eax, eax ; Zero-extends EAX into RAX
movzx eax, eax

The mul and imul Instructions

Single operand mul and imul are used for multiprecision work and generic multi-operand mul and imul instructions are used for generic use.
x86-64 always assumes the accumulator (AL, AX, EAX, or RAX) is the destination operand.

mul reg8 ; Returns AX
mul reg16 ; Returns DX:AX
mul reg32 ; Returns EDX:EAX
mul reg64 ; Returns RDX:RAX
mul mem8 ; Returns AX
mul mem16 ; Returns DX:AX
mul mem32 ; Returns EDX:EAX
mul mem64 ; Returns RDX:RAX

imul reg8 ; Returns AX
imul reg16 ; Returns DX:AX
imul reg32 ; Returns EDX:EAX
imul reg64 ; Returns RDX:RAX
imul mem8 ; Returns AX
imul mem16 ; Returns DX:AX
imul mem32 ; Returns EDX:EAX
imul mem64 ; Returns RDX:RAX

However, the generic imul is not a complete replacement for these two instructions; in addition to the number of operands, several differences exist.

There isn’t an 8×8-bit multi-operand imul instruction available
The generic imul instruction does not produce a 2×n-bit result, but truncates the result to n bits. That is, a 16×16-bit multiplication produces a 16-bit result. Likewise, a 32×32-bit multiplication produces a 32-bit result. These instructions set the carry and overflow flags if the result does not fit into the destination register.

The div and idiv instructions

The x86-64 divide instructions perform a 128/64-bit division, a 64/32-bit division, a 32/16-bit division, or a 16/8-bit division. There is no variant of the div or idiv instructions that allows you to divide a value by a constant. If you want to divide a value by a constant, you need to create a memory object (preferably in the .const section) that is initialized with the constant.

        .const
ten dword 10
 .
 .
 .
 div ten ; Divides EDX:EAX by 10

The div instruction is an unsigned division operation.

div reg8
div reg16
div reg32
div reg64

div mem8
div mem16
div mem32
div mem64

idiv reg8
idiv reg16
idiv reg32
idiv reg64

idiv mem8
idiv mem16
idiv mem32
idiv mem64

If the operand is an 8-bit operand, div divides the AX register by the operand, leaving the quotient in AL and the remainder (modulo) in AH. If the operand is a 16-bit quantity, the div instruction divides the 32-bit quantity in DX:AX by the operand, leaving the quotient in AX and the remainder in DX. With 32-bit operands, div divides the 64-bit value in EDX:EAX by the operand, leaving the quotient in EAX and the remainder in EDX. Finally, with 64-bit operands, div divides the 128-bit value in RDX:RAX by the operand, leaving the quotient in RAX and the remainder in RDX.
You cannot, on the x86-64, simply divide one unsigned 8-bit value by another. If the denominator is an 8-bit value, the numerator must be a 16-bit value. If you need to divide one unsigned 8-bit value by another, you must zero-extend the numerator to 16 bits by loading the numerator into the AL register and then moving 0 into the AH register.
When dealing with signed integer values, you will need to sign-extend AL into AX, AX into DX, EAX into EDX, or RAX into RDX before executing idiv. To do so, use the cbw, cwd, cdq, or cqo instructions.
The x86-64’s divide instructions have one other issue: you can get a fatal error when using this instruction. First, of course, you can attempt to divide a value by 0. Another problem is that the quotient may be too large to fit into the RAX, EAX, AX, or AL register. For example, the 16/8-bit division 8000h/2 produces the quotient 4000h with a remainder of 0. 4000h will not fit into 8 bits. If this happens, or you attempt to divide by 0, the x86-64 will generate a division exception or integer overflow exception. This usually means your program will crash. If this happens to you, chances are you didn’t sign- or zero-extend your numerator before executing the division operation. Because this error may cause your program to crash, you should be very careful about the values you select when using division.

The cmp instruction

The setcc Instructions

The setcc (set on condition) instructions set a single-byte operand (register or memory) to 0 or 1 depending on the values in the FLAGS register. These instructions store a 0 in the corresponding operand if the condition is false, and they store a 1 in the 8-bit operand if the condition is true.

setcc reg8
setcc mem8

The cmp instruction works synergistically with the setcc instructions. Immediately after a cmp operation, the processor flags provide information concerning the relative values of those operands. They allow you to see if one operand is less than, equal to, or greater than the other.

The test instruction

The test instruction computes the logical AND of its two operands and sets the condition code flags based on the result; it does not, however, store the result of the logical AND back into the destination operand.

test operand1, operand2

The test instruction sets the zero flag if the result of the logical AND operation is 0. It sets the sign flag if the HO bit of the result contains a 1. The test instruction always clears the carry and overflow flags.
The test instruction can also check whether all the bits in a specified set of bits contain 0. The instruction test al, 0fh sets the zero flag if and only if the LO 4 bits of AL all contain 0.
One important use of the test instruction is to check whether a register contains 0. The instruction test reg, reg, where both operands are the same register, will logically AND that register with itself. If the register contains 0, the result is 0 and the CPU will set the zero flag. However, if the register contains a nonzero value, logically ANDing that value with itself produces that same nonzero value, so the CPU clears the zero flag.

test eax, eax
setz bl         ; BL is set to 1 if EAX contains 0

One major failing of the test instruction is that immediate (constant) operands can be no larger than 32 bits.

Arithmetic Expressions

; x = y + z:

        mov eax, y
        add eax, z
        mov x, eax

; x = y - z:
        mov eax, y
        sub eax, z
        mov x, eax
        
; x = y * z; (unsigned):

        mov eax, y
        mul z ; Don't forget this wipes out EDX
        mov x, eax

; x = y * z; (signed):

        mov eax, y
        imul eax, z ; Does not affect EDX!
        mov x, eax

; x = y div z; (unsigned div):

        mov eax, y
        xor edx, edx ; Zero-extend EAX into EDX
        div z
        mov x, eax

; x = y idiv z; (signed div):

        mov eax, y
        cdq ; Sign-extend EAX into EDX
        idiv z
        mov x, eax

; x = y % z; (unsigned remainder):

        mov eax, y
        xor edx, edx ; Zero-extend EAX into EDX
        div z
        mov x, edx ; Note that remainder is in EDX

; x = y % z; (signed remainder):

        mov eax, y
        cdq ; Sign-extend EAX into EDX
        idiv z
        mov x, edx ; Remainder is in EDX
        
; var1 = -var2;

        mov eax, var2
        neg eax
        mov var1, eax

Commutative Operators

Assembly Tricks

Multiplying without mul or imul

In general, shifting an operand to the left n bits multiplies it by 2n . You can multiply any value by a constant by using a series of shifts and additions or shifts and subtractions. For example, to multiply the AX register by 10, you need only multiply it by 8 and then add two times the original value. That is, 10 × AX = 8 × AX + 2 × AX.

shl ax, 1 ; Multiply AX by 2
mov bx, ax ; Save 2 * AX for later
shl ax, 2 ; Multiply AX by 8 (*4 really,
 ; but AX contains *2)
add ax, bx ; Add in AX * 2 to AX * 8 to get AX * 10

You can also use subtraction with shifts to perform a multiplication operation.

mov ebx, eax ; Save EAX * 1
shl eax, 3 ; EAX = EAX * 8
sub eax, ebx ; EAX * 8 - EAX * 1 is EAX * 7

You can also use the lea instruction to compute certain products. The trick is to use the scaled-index addressing modes.

lea eax, [ecx][ecx] ; EAX = ECX * 2
lea eax, [eax][eax * 2] ; EAX = ECX * 3
lea eax, [eax * 4] ; EAX = ECX * 4
lea eax, [ebx][ebx * 4] ; EAX = EBX * 5
lea eax, [eax * 8] ; EAX = EAX * 8
lea eax, [edx][edx * 8] ; EAX = EDX * 9

Implementing Modulo-N Counters with AND

If you want to implement a counter variable that counts up to 2n – 1 and then resets to 0, use the following code:

inc CounterVar
and CounterVar, n_bits

Floating Point Arithmetic

FPU Registers

The x87 FPUs add 14 registers to the x86-64: eight floating-point data registers, a control register, a status register, a tag register, an instruction pointer, a data pointer, and an opcode register.
The data registers handle all floating point calculations, the control register contains bits that let you decide how the FPU handles certain degenerate cases like rounding off of inaccurate computations, the status register is similar to the FLAGS register, the tag register contains several groups of bits that determine the state of the FPU in each of the data registers, the instruction, data pointer, and opcode registers contain certain state information about the last floating-point instruction executed.

FPU Data Registers

The FPUs provide eight 80-bit data registers organized as a stack. MASM refers to these registers as ST(0), ST(1), . . . ST(7).
The biggest difference between the FPU register set and the x86-64 register set is the stack organization. On the x86-64 CPU, the AX register is always the AX register, no matter what happens. On the FPU, however, the register set is an eight-element stack of 80-bit floating-point values.
ST(0) refers to the item on the top of stack, ST(1) refers to the next item on the stack, and so on. Many floating-point instructions push and pop items on the stack; therefore, ST(1) will refer to the previous contents of ST(0) after you push something onto the stack.

FPU Control Register

Bits 10 and 11 of the FPU control register provide rounding control

Bits 8 and 9 of the control register specify the precision during computation. This capability is provided to allow compatibility with older software as required by the IEEE 754 standard.

Bits 0 to 5 are the exception masks. These are similar to the interrupt enable bit in the x86-64’s FLAGS register. If these bits contain a 1, the corresponding condition is ignored by the FPU. However, if any bit contains 0s, and the corresponding condition occurs, then the FPU immediately generates an interrupt so the program can handle the degenerate condition.
Bit 0 corresponds to an invalid operation error, which generally occurs as the result of a programming error.
Bit 1 masks the denormalized interrupt that occurs whenever you try to manipulate denormalized values. Denormalized exceptions occur when you load arbitrary extended-precision values into the FPU or work with very small numbers just beyond the range of the FPU’s capabilities.
Bit 2 masks the zero-divide exception. If this bit contains 0, the FPU will generate an interrupt if you attempt to divide a nonzero value by 0. If you do not enable the zero-divide exception, the FPU will produce NaN whenever you perform a zero division. It’s probably a good idea to enable this exception by programming a 0 into this bit.
Bit 3 masks the overflow exception. The FPU will raise the overflow exception if a calculation overflows or if you attempt to store a value that is too large to fit into the destination operand.
Bit 4, if set, masks the underflow exception. Underflow occurs when the result is too small to fit in the destination operand. Like overflow, this exception can occur whenever you store a small extended-precision value into a smaller variable (single or double precision) or when the result of a computation is too small for extended precision.
Bit 5 controls whether the precision exception can occur. A precision exception occurs whenever the FPU produces an imprecise result, generally the result of an internal rounding operation.
Bits 6 and 7, and 12 to 15, in the control register are currently undefined and reserved for future use.
The FPU provides two instructions, fldcw (load control word) and fstcw (store control word), that let you load and store the contents of the control register, respectively. The single operand to these instructions must be a 16-bit memory location. The fldcw instruction loads the control register from the specified memory location. fstcw stores the control register into the specified memory location.

fldcw mem16
fstcw mem16

;;example

        .data
fcw16 word ?

        .
        .
        .
        
        fstcw fcw16
        mov ax, fcw16
        and ax, 0f0ffh ; Clears bits 8-11
        or ax, 0c00h ; Rounding control = %11, Precision = %00
        mov fcw16, ax
        fldcw fcw16

FPU Status Register

The 16-bit FPU status register provides the status of the FPU at the instant you read it

Bits 0 through 5 are the exception flags. These bits appear in the same order as the exception masks in the control register. If the corresponding condition exists, the bit is set. These bits are independent of the exception masks in the control register.
Bit 6 indicates a stack fault. A stack fault occurs whenever a stack overflow or underflow occurs. When this bit is set, the C1 condition code bit determines whether there was a stack overflow (C1 = 1) or stack underflow (C1 = 0) condition
Bit 7 of the status register is set if any error condition bit is set. It is the logical or of bits 0 through 5. A program can test this bit to quickly determine if an error condition exists.
Bits 8, 9, 10, and 14 are the coprocessor condition code bits.
Bits 11 to 13 of the FPU status register provide the register number of the top of stack. During computations, the FPU adds (modulo 8) the logical register numbers supplied by the programmer to these 3 bits to determine the physical register number at runtime.
Bit 15 of the status register is the busy bit. It is set whenever the FPU is busy. This bit is a historical artifact from the days when the FPU was a separate chip.

FPU Data Types

The FPU supports seven data types: three integer types, a packed decimal type, and three floating-point types.

The integer type supports 16-, 32-, and 64-bit integers, although it is often faster to do the integer arithmetic by using the integer unit of the CPU. The packed decimal type provides an 18-digit signed decimal (BCD) integer. The primary purpose of the BCD format is to convert between strings and floating-point values. The remaining three data types are the 32-, 64-, and 80-bit floating-point data types.
The FPU generally stores values in a normalized format. The HO bit of the mantissa is always 1 when a floating-point number is normalized. In the 32- and 64-bit floating-point formats, the FPU does not actually store this bit; the FPU always assumes that it is 1. Therefore, 32- and 64-bit floating point numbers are always normalized. In the extended-precision 80-bit floating-point format, the FPU does not assume that the HO bit of the mantissa is 1; the HO bit of the mantissa appears as part of the string of bits.
Normalized values provide the greatest precision for a given number of bits. However, many non-normalized values cannot be represented with the 80-bit format. These values are very close to 0 and represent the set of values whose mantissa HO bit is not 0. The FPUs support a special 80-bit form known as denormalized values.
Denormalized values allow the FPU to encode very small values it cannot encode using normalized values, but denormalized values offer fewer bits of precision than normalized values. Therefore, using denormalized values in a computation may introduce slight inaccuracy.

FPU Instruction Set

The fld Instruction

The fld instruction loads a 32-, 64-, or 80-bit floating-point value onto the stack. This instruction converts 32- and 64-bit operands to an 80-bit extended precision value before pushing the value onto the floating-point stack.

fld st(1)
fld real4_variable
fld real8_variable
fld real10_variable
fld real8 ptr [rbx]

There is no way to directly load a 32-bit integer register onto the floating point stack, even if that register contains a real4 value. To do so, you must first store the integer register into a memory location, and then push that memory location onto the FPU stack by using the fld instruction.

mov tempReal4, eax ; Save real4 value in EAX to memory
fld tempReal4 ; Push that value onto the FPU stack

The fst and fstp Instruction

The fst and fstp instructions copy the value on the top of the floating-point stack to another floating-point register or to a 32-, 64-, or (fstp only) 80-bit memory variable. By incrementing the TOS pointer in the status register after accessing the data in ST(0), the fstp instruction pops the value off the top of stack when moving it to the destination location. If the destination operand is a floating-point register, the FPU stores the value at the specified register number before popping the data off the top of stack.
Executing an fstp st(0) instruction effectively pops the data off the top of stack with no data transfer.

fst real4_variable
fst real8_variable
fst realArray[rbx * 8]
fst st(2)
fstp st(1)

The fxch Instruction

The fxch instruction exchanges the value on the top of stack with one of the other FPU registers. This instruction takes two forms: one with a single FPU register as an operand and the second without any operands.
Many FPU instructions (for example, fsqrt) operate only on the top of the register stack. If you want to perform such an operation on a value that is not on top, you can use the fxch instruction to swap that register with TOS, perform the desired operation, and then use fxch to swap the TOS with the original register.

fxch st(2)
fsqrt
fxch st(2)

The fild Instruction

The fild (integer load) instruction converts a 16-, 32-, or 64-bit two’s complement integer to the 80-bit extended-precision format and pushes the result onto the stack. This instruction always expects a single operand: the address of a word, double-word, or quad-word integer variable.

fild word_variable
fild dword_val[rcx * 4]
fild qword_variable
fild sqword ptr [rbx]

The fist, fistp, and fisttp Instructions

The fist, fistp, and fisttp instructions convert the 80-bit extended-precision variable on the top of stack to a 16-, 32-, or (fistp/fistpp only) 64-bit integer and store the result away into the memory variable specified by the single operand.
The fist instruction converts the value on the top of stack to an integer and then stores the result; it does not otherwise affect the floating-point register stack. The fistp (Store Integer and Pop) and fisttp (Store Integer with Truncation and Pop) instructions pop the value off the floating point register stack after storing the converted value.

fist word_var[rbx * 2]
fist dword_var
fisttp dword_var
fistp qword_var

The fbld and fbstp Instructions

The fbld and fbstp instructions load and store 80-bit BCD values. The fbld instruction converts a BCD value to its 80-bit extended-precision equivalent and pushes the result onto the stack.
The fbstp instruction pops the extended-precision real value on TOS, converts it to an 80-bit BCD value (rounding according to the bits in the floating-point control register), and stores the converted result at the address specified by the destination memory operand. There is no fbst instruction.

; Assuming fewer than eight items on the stack, the following
; code sequence is equivalent to an fbst instruction:

 fld st(0)
 fbstp tbyte_var
 
; The following example easily converts an 80-bit BCD value to
; a 64-bit integer:

 fbld tbyte_var
 fistp qword_var

The fadd, faddp, and fiadd Instructions

The fadd instruction, with no operands, is a synonym for faddp. The faddp instruction (also with no operands) pops the two values on the top of stack, adds them, and pushes their sum back onto the stack.
The next two forms of the fadd instruction, those with two FPU register operands, behave like the x86-64’s add instruction. They add the value in the source register operand to the value in the destination register operand. One of the register operands must be ST(0).
The faddp instruction with two operands adds ST(0) (which must always be the source operand) to the destination operand and then pops ST(0). The destination operand must be one of the other FPU registers.
The last two forms, fadd with a memory operand, adds a 32- or 64-bit floating-point variable to the value in ST(0). This instruction will convert the 32- or 64-bit operands to an 80-bit extended-precision value before performing the addition. Note that this instruction does not allow an 80-bit memory operand.

; Demonstration of various forms of fadd

        option  casemap:none

nl          =       10


            .const
fmtSt0St1   byte    "st(0):%f, st(1):%f", nl, 0
fmtAdd1     byte    "fadd: st0:%f", nl, 0
fmtAdd2     byte    "faddp: st0:%f", nl, 0
fmtAdd3     byte    "fadd st(1), st(0): st0:%f, st1:%f", nl, 0
fmtAdd4     byte    "fadd st(0), st(1): st0:%f, st1:%f", nl, 0
fmtAdd5     byte    "faddp st(1), st(0): st0:%f", nl, 0
fmtAdd6     byte    "fadd mem: st0:%f", nl, 0

zero        real8   0.0
one         real8   1.0
two         real8   2.0
minusTwo    real8   -2.0
        
            .data
st0         real8   0.0
st1         real8   0.0
        
            .code
            externdef printf:proc
            
; printFP- Prints values of st0 and (possibly) st1.
;          Caller must pass in ptr to fmtStr in RCX.

printFP     proc
            sub     rsp, 40
            
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.

            mov     rdx, qword ptr st0
            mov     r8, qword ptr st1
            call    printf
            add     rsp, 40
            ret
printFP     endp

; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48   ;Shadow storage
        
; Demonstrate various fadd instructions:

            mov     rax, qword ptr one
            mov     qword ptr st1, rax
            mov     rax, qword ptr minusTwo
            mov     qword ptr st0, rax
            lea     rcx, fmtSt0St1
            call    printFP

; fadd (same as faddp)
 
            fld     one
            fld     minusTwo
            fadd                    ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtAdd1
            call    printFP            
            
; faddp:
 
            fld     one
            fld     minusTwo
            faddp                   ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtAdd2
            call    printFP            
            
; fadd st(1), st(0)
 
            fld     one
            fld     minusTwo
            fadd    st(1), st(0)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtAdd3
            call    printFP            
            
; fadd st(0), st(1)
 
            fld     one
            fld     minusTwo
            fadd    st(0), st(1)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtAdd4
            call    printFP            
            
; faddp st(1), st(0)
 
            fld     one
            fld     minusTwo
            faddp   st(1), st(0)
            fstp    st0
            
            lea     rcx, fmtAdd5
            call    printFP            
            
; faddp mem64
 
            fld     one
            fadd    two
            fstp    st0
            
            lea     rcx, fmtAdd6
            call    printFP            
            
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

The fsub, fsubp, fsubr, fsubrp, fisub, and fisubr Instructions

With no operands, fsub is the same as fsubp (without operands). With no operands, the fsubp instruction pops ST(0) and ST(1) from the register stack, computes ST(1) – ST(0), and then pushes the difference back onto the stack. The fsubr and fsubrp instructions (reverse subtraction) operate in an identical fashion except they compute ST(0) – ST(1).
With two register operands (destination, source), the fsub instruction computes destination = destination – source. One of the two registers must be ST(0). With two registers as operands, the fsubp also computes destination = destination – source, and then it pops ST(0) off the stack after computing the difference. For the fsubp instruction, the source operand must be ST(0).
With two register operands, the fsubr and fsubrp instructions work in a similar fashion to fsub and fsubp, except they compute destination = source – destination.
The fsub mem32, fsub mem64, fsubr mem32, and fsubr mem64 instructions accept a 32- or 64-bit memory operand. They convert the memory operand to an 80-bit extended-precision value and subtract this from ST(0) (fsub) or subtract ST(0) from this value (fsubr) and store the result back into ST(0).

; Demonstration of various forms of fsub/fsubrl

        option  casemap:none

nl          =       10


            .const
fmtSt0St1   byte    "st(0):%f, st(1):%f", nl, 0
fmtSub1     byte    "fsub: st0:%f", nl, 0
fmtSub2     byte    "fsubp: st0:%f", nl, 0
fmtSub3     byte    "fsub st(1), st(0): st0:%f, st1:%f", nl, 0
fmtSub4     byte    "fsub st(0), st(1): st0:%f, st1:%f", nl, 0
fmtSub5     byte    "fsubp st(1), st(0): st0:%f", nl, 0
fmtSub6     byte    "fsub mem: st0:%f", nl, 0
fmtSub7     byte    "fsubr st(1), st(0): st0:%f, st1:%f", nl, 0
fmtSub8     byte    "fsubr st(0), st(1): st0:%f, st1:%f", nl, 0
fmtSub9     byte    "fsubrp st(1), st(0): st0:%f", nl, 0
fmtSub10    byte    "fsubr mem: st0:%f", nl, 0

zero        real8   0.0
three       real8   3.0
minusTwo    real8   -2.0
        
            .data
st0         real8   0.0
st1         real8   0.0
        
            .code
            externdef printf:proc

; printFP- Prints values of st0 and (possibly) st1.
;          Caller must pass in ptr to fmtStr in RCX.

printFP     proc
            sub     rsp, 40
            
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.

            mov     rdx, qword ptr st0
            mov     r8, qword ptr st1
            call    printf
            add     rsp, 40
            ret
printFP     endp

            
            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48   ;Shadow storage
        
; Demonstrate various fsub instructions:

            mov     rax, qword ptr three
            mov     qword ptr st1, rax
            mov     rax, qword ptr minusTwo
            mov     qword ptr st0, rax
            lea     rcx, fmtSt0St1
            call    printFP

; fsub (same as fsubp)
 
            fld     three
            fld     minusTwo
            fsub                    ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtSub1
            call    printFP            
            
; fsubp:
 
            fld     three
            fld     minusTwo
            fsubp                   ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtSub2
            call    printFP            
            
; fsub st(1), st(0)
 
            fld     three
            fld     minusTwo
            fsub    st(1), st(0)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtSub3
            call    printFP            
            
; fsub st(0), st(1)
 
            fld     three
            fld     minusTwo
            fsub    st(0), st(1)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtSub4
            call    printFP            
            
; fsubp st(1), st(0)
 
            fld     three
            fld     minusTwo
            fsubp   st(1), st(0)
            fstp    st0
            
            lea     rcx, fmtSub5
            call    printFP            
            
; fsub mem64
 
            fld     three
            fsub    minusTwo
            fstp    st0
            
            lea     rcx, fmtSub6
            call    printFP            
            
	
	
; fsubr st(1), st(0)
 
            fld     three
            fld     minusTwo
            fsubr   st(1), st(0)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtSub7
            call    printFP            
            
; fsubr st(0), st(1)
 
            fld     three
            fld     minusTwo
            fsubr   st(0), st(1)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtSub8
            call    printFP            
            
; fsubrp st(1), st(0)
 
            fld     three
            fld     minusTwo
            fsubrp  st(1), st(0)
            fstp    st0
            
            lea     rcx, fmtSub9
            call    printFP            
            
; fsubr mem64
 
            fld     three
            fsubr   minusTwo
            fstp    st0
            
            lea     rcx, fmtSub10
            call    printFP            
            
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

The fmul, fmulp, and fimul Instructions

The fmul and fmulp instructions multiply two floating-point values. The fimul instruction multiples an integer and a floating-point value.
With no operands, fmul is a synonym for fmulp. The fmulp instruction, with no operands, will pop ST(0) and ST(1), multiply these values, and push their product back onto the stack. The fmul instructions with two register operands compute destination = destination × source. One of the registers (source or destination) must be ST(0).

; Demonstration of various forms of fmul

        option  casemap:none

nl          =       10


            .const
fmtSt0St1   byte    "st(0):%f, st(1):%f", nl, 0
fmtMul1     byte    "fmul: st0:%f", nl, 0
fmtMul2     byte    "fmulp: st0:%f", nl, 0
fmtMul3     byte    "fmul st(1), st(0): st0:%f, st1:%f", nl, 0
fmtMul4     byte    "fmul st(0), st(1): st0:%f, st1:%f", nl, 0
fmtMul5     byte    "fmulp st(1), st(0): st0:%f", nl, 0
fmtMul6     byte    "fmul mem: st0:%f", nl, 0

zero        real8   0.0
three       real8   3.0
minusTwo    real8   -2.0
        
            .data
st0         real8   0.0
st1         real8   0.0
        
            .code
            externdef printf:proc
            
; printFP- Prints values of st0 and (possibly) st1.
;          Caller must pass in ptr to fmtStr in RCX.

printFP     proc
            sub     rsp, 40
            
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.

            mov     rdx, qword ptr st0
            mov     r8, qword ptr st1
            call    printf
            add     rsp, 40
            ret
printFP     endp

            
            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48   ;Shadow storage
        
; Demonstrate various fmul instructions:

            mov     rax, qword ptr three
            mov     qword ptr st1, rax
            mov     rax, qword ptr minusTwo
            mov     qword ptr st0, rax
            lea     rcx, fmtSt0St1
            call    printFP

; fmul (same as fmulp)
 
            fld     three
            fld     minusTwo
            fmul                    ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtMul1
            call    printFP            
            
; fmulp:
 
            fld     three
            fld     minusTwo
            fmulp                   ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtMul2
            call    printFP            
            
; fmul st(1), st(0)
 
            fld     three
            fld     minusTwo
            fmul    st(1), st(0)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtMul3
            call    printFP            
            
; fmul st(0), st(1)
 
            fld     three
            fld     minusTwo
            fmul    st(0), st(1)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtMul4
            call    printFP            
            
; fmulp st(1), st(0)
 
            fld     three
            fld     minusTwo
            fmulp   st(1), st(0)
            fstp    st0
            
            lea     rcx, fmtMul5
            call    printFP            
            
; fmulp mem64
 
            fld     three
            fmul    minusTwo
            fstp    st0
            
            lea     rcx, fmtMul6
            call    printFP            
            
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

The fdiv, fdivp, fdivr, fdivrp, fidiv, and fidivr Instructions

With no operands, the fdiv instruction is a synonym for fdivp. The fdivp instruction with no operands computes ST(1) = ST(1) / ST(0). The fdivr and fdivrp instructions work in a similar fashion to fdiv and fdivp except that they compute ST(0) / ST(1) rather than ST(1) / ST(0).
With two register operands, these instructions compute the following quotients:

fdiv st(0), st(i) ; st(0) = st(0)/st(i)
fdiv st(i), st(0) ; st(i) = st(i)/st(0)
fdivp st(i), st(0) ; st(i) = st(i)/st(0) then pop st0
fdivr st(0), st(i) ; st(0) = st(i)/st(0)
fdivr st(i), st(0) ; st(i) = st(0)/st(i)
fdivrp st(i), st(0) ; st(i) = st(0)/st(i) then pop st0

The fdivp and fdivrp instructions also pop ST(0) after performing the division operation. The value for i in these two instructions is computed before popping ST(0).

; Demonstration of various forms of fdiv/fdivr

        option  casemap:none

nl          =       10


            .const
fmtSt0St1   byte    "st(0):%f, st(1):%f", nl, 0
fmtDiv1     byte    "fdiv: st0:%f", nl, 0
fmtDiv2     byte    "fdivp: st0:%f", nl, 0
fmtDiv3     byte    "fdiv st(1), st(0): st0:%f, st1:%f", nl, 0
fmtDiv4     byte    "fdiv st(0), st(1): st0:%f, st1:%f", nl, 0
fmtDiv5     byte    "fdivp st(1), st(0): st0:%f", nl, 0
fmtDiv6     byte    "fdiv mem: st0:%f", nl, 0
fmtDiv7     byte    "fdivr st(1), st(0): st0:%f, st1:%f", nl, 0
fmtDiv8     byte    "fdivr st(0), st(1): st0:%f, st1:%f", nl, 0
fmtDiv9     byte    "fdivrp st(1), st(0): st0:%f", nl, 0
fmtDiv10    byte    "fdivr mem: st0:%f", nl, 0

three       real8   3.0
minusTwo    real8   -2.0
        
            .data
st0         real8   0.0
st1         real8   0.0
        
            .code
            externdef printf:proc
            
; printFP- Prints values of st0 and (possibly) st1.
;          Caller must pass in ptr to fmtStr in RCX.

printFP     proc
            sub     rsp, 40
            
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.

            mov     rdx, qword ptr st0
            mov     r8, qword ptr st1
            call    printf
            add     rsp, 40
            ret
printFP     endp

            
            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48   ;Shadow storage
        
; Demonstrate various fdiv instructions:

            mov     rax, qword ptr three
            mov     qword ptr st1, rax
            mov     rax, qword ptr minusTwo
            mov     qword ptr st0, rax
            lea     rcx, fmtSt0St1
            call    printFP

; fdiv (same as fdivp)
 
            fld     three
            fld     minusTwo
            fdiv                    ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtDiv1
            call    printFP            
            
; fdivp:
 
            fld     three
            fld     minusTwo
            fdivp                   ;Pops st(0)!
            fstp    st0
            
            lea     rcx, fmtDiv2
            call    printFP            
            
; fdiv st(1), st(0)
 
            fld     three
            fld     minusTwo
            fdiv    st(1), st(0)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtDiv3
            call    printFP            
            
; fdiv st(0), st(1)
 
            fld     three
            fld     minusTwo
            fdiv    st(0), st(1)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtDiv4
            call    printFP            
            
; fdivp st(1), st(0)
 
            fld     three
            fld     minusTwo
            fdivp   st(1), st(0)
            fstp    st0
            
            lea     rcx, fmtDiv5
            call    printFP            
            
; fdiv mem64
 
            fld     three
            fdiv    minusTwo
            fstp    st0
            
            lea     rcx, fmtDiv6
            call    printFP            
            
            
            
; fdivr st(1), st(0)
 
            fld     three
            fld     minusTwo
            fdivr   st(1), st(0)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtDiv7
            call    printFP            
            
; fdivr st(0), st(1)
 
            fld     three
            fld     minusTwo
            fdivr   st(0), st(1)
            fstp    st0
            fstp    st1
            
            lea     rcx, fmtDiv8
            call    printFP            
            
; fdivrp st(1), st(0)
 
            fld     three
            fld     minusTwo
            fdivrp  st(1), st(0)
            fstp    st0
            
            lea     rcx, fmtDiv9
            call    printFP            
            
; fdivr mem64
 
            fld     three
            fdivr   minusTwo
            fstp    st0
            
            lea     rcx, fmtDiv10
            call    printFP            
            
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

The fprem and fprem1 Instructions

The fprem and fprem1 instructions compute a partial remainder (a value that may require additional computation to produce the actual remainder). You should always use fprem1 in new software; therefore, we will discuss only fprem1 here, although you use fprem in an identical fashion.
fprem1 computes the partial remainder of ST(0) / ST(1). If the difference between the exponents of ST(0) and ST(1) is less than 64, fprem1 can compute the exact remainder in one operation. Otherwise, you will have to execute fprem1 two or more times to get the correct remainder value.
The C2 condition code bit determines when the computation is complete.
fprem1 does not pop the two operands off the stack; it leaves the partial remainder in ST(0) and the original divisor in ST(1) in case you need to compute another partial product to complete the result.
The fprem1 instruction sets the stack exception flag if there aren’t two values on the top of stack. It sets the underflow and denormal exception bits if the result is too small. It sets the invalid operation bit if the values on TOS are inappropriate for this operation. It sets the C2 condition code bit if the partial remainder operation is not complete (or on stack underflow). Finally, it loads C1, C2, and C0 with bits 0, 1, and 2 of the quotient, respectively.

; Compute z = x % y:
 
        fld y
        fld x

repeatLp:

        fprem1
        fstsw ax ; Get condition code bits into AX
        and ah, 1 ; See if C2 is set
        jnz repeatLp ; Repeat until C2 is clear
        fstp z ; Store away the remainder
        fstp st(0) ; Pop old y value

The frndint Instruction

The frndint instruction rounds the value on TOS to the nearest integer by using the rounding algorithm specified in the control register.
This instruction sets the stack exception flag if there is no value on the TOS (it will also clear C1 in this case). It sets the precision and denormal exception bits if a loss of precision occurred. It sets the invalid operation flag if the value on the TOS is not a valid number.

The fabs Instruction

fabs computes the absolute value of ST(0) by clearing the mantissa sign bit of ST(0). It sets the stack exception bit and invalid operation bits if the stack is empty.

; Compute x = sqrt(abs(x)):
    fld x
    fabs
    fsqrt
    fstp x

The fchs Instruction

fchs changes the sign of ST(0)’s value by inverting the mantissa sign bit (this is the floating-point negation instruction). It sets the stack exception bit and invalid operation bits if the stack is empty.

; Compute x = -x if x is positive, x = x if x is negative.
; That is, force x to be a negative value.

         fld x
         fabs
         fchs
         fstp x

The fcom, fcomp, and fcompp Instructions

The fcom, fcomp, and fcompp instructions compare ST(0) to the specified operand and set the corresponding FPU condition code bits based on the result of the comparison.
With no operands, fcom, fcomp, and fcompp compare ST(0) against ST(1) and set the FPU flags accordingly. In addition, fcomp pops ST(0) off the stack, and fcompp pops both ST(0) and ST(1) off the stack.
With a single-register operand, fcom and fcomp compare ST(0) against the specified register. fcomp also pops ST(0) after the comparison.
These instructions set C2 if the two operands are not comparable. These instructions set the stack fault bit if there aren’t two items on the top of the register stack. They set the denormalized exception bit if either or both operands are denormalized. They set the invalid operation flag if either or both operands are NaNs. These instructions always clear the C1 condition code.
Because all x86-64 64-bit CPUs support the fcomi and fcomip instructions (described in the next section), you should consider using those instructions as they spare you from having to store the FPU status word into AX and then copy AH into the FLAGS register before testing the condition.

; Demonstration of fcom instructions

        option  casemap:none

nl          =       10


            .const
fcomFmt     byte    "fcom %f < %f is %d", nl, 0
fcomFmt2    byte    "fcom(2) %f < %f is %d", nl, 0
fcomFmt3    byte    "fcom st(1) %f < %f is %d", nl, 0
fcomFmt4    byte    "fcom st(1) (2) %f < %f is %d", nl, 0
fcomFmt5    byte    "fcom mem %f < %f is %d", nl, 0
fcomFmt6    byte    "fcom mem %f (2) < %f is %d", nl, 0
fcompFmt    byte    "fcomp %f < %f is %d", nl, 0
fcompFmt2   byte    "fcomp (2) %f < %f is %d", nl, 0
fcompFmt3   byte    "fcomp st(1) %f < %f is %d", nl, 0
fcompFmt4   byte    "fcomp st(1) (2) %f < %f is %d", nl, 0
fcompFmt5   byte    "fcomp mem %f < %f is %d", nl, 0
fcompFmt6   byte    "fcomp mem (2) %f < %f is %d", nl, 0
fcomppFmt   byte    "fcompp %f < %f is %d", nl, 0
fcomppFmt2  byte    "fcompp (2) %f < %f is %d", nl, 0

three       real8   3.0
zero        real8   0.0
minusTwo    real8   -2.0
        
            .data
st0         real8   ?
st1         real8   ?
        
            .code
            externdef printf:proc
            
; printFP- Prints values of st0 and (possibly) st1.
;          Caller must pass in ptr to fmtStr in RCX.

printFP     proc
            sub     rsp, 40
            
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.

            mov     rdx, qword ptr st0
            mov     r8, qword ptr st1
            movzx   r9, al
            call    printf
            add     rsp, 40
            ret
printFP     endp

            
            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48   ;Shadow storage

; fcom demo
        
            xor     eax, eax
            fld     three
            fld     zero
            fcom
            fstsw   ax
            sahf
            setb    al
            fstp    st0
            fstp    st1
            lea     rcx, fcomFmt
            call    printFP
            
; fcom demo 2
        
            xor     eax, eax
            fld     zero
            fld     three
            fcom
            fstsw   ax
            sahf
            setb    al
            fstp    st0
            fstp    st1
            lea     rcx, fcomFmt2
            call    printFP

; fcom st(i) demo
        
            xor     eax, eax
            fld     three
            fld     zero
            fcom    st(1)
            fstsw   ax
            sahf
            setb    al
            fstp    st0
            fstp    st1
            lea     rcx, fcomFmt3
            call    printFP
            
; fcom st(i) demo 2
        
            xor     eax, eax
            fld     zero
            fld     three
            fcom    st(1)
            fstsw   ax
            sahf
            setb    al
            fstp    st0
            fstp    st1
            lea     rcx, fcomFmt4
            call    printFP
            
; fcom mem64 demo
        
            xor     eax, eax
            fld     three           ;Never on stack so
            fstp    st1             ; copy for output
            fld     zero
            fcom    three
            fstsw   ax
            sahf
            setb    al
            fstp    st0
            lea     rcx, fcomFmt5
            call    printFP
                        
; fcom mem64 demo 2
        
            xor     eax, eax
            fld     zero            ;Never on stack so
            fstp    st1             ; copy for output
            fld     three
            fcom    zero
            fstsw   ax
            sahf
            setb    al
            fstp    st0
            lea     rcx, fcomFmt6
            call    printFP
                        
; fcomp demo
        
            xor     eax, eax
            fld     zero
            fld     three
            fst     st0             ;Because this gets popped
            fcomp
            fstsw   ax
            sahf
            setb    al
            fstp    st1
            lea     rcx, fcompFmt
            call    printFP
                        
; fcomp demo 2
        
            xor     eax, eax
            fld     three
            fld     zero
            fst     st0             ;Because this gets popped
            fcomp
            fstsw   ax
            sahf
            setb    al
            fstp    st1
            lea     rcx, fcompFmt2
            call    printFP
                        
; fcomp demo 3
        
            xor     eax, eax
            fld     zero
            fld     three
            fst     st0             ;Because this gets popped
            fcomp   st(1)
            fstsw   ax
            sahf
            setb    al
            fstp    st1
            lea     rcx, fcompFmt3
            call    printFP
                        
; fcomp demo 4
        
            xor     eax, eax
            fld     three
            fld     zero
            fst     st0             ;Because this gets popped
            fcomp   st(1)
            fstsw   ax
            sahf
            setb    al
            fstp    st1
            lea     rcx, fcompFmt4
            call    printFP
                        
; fcomp demo 5
        
            xor     eax, eax
            fld     three
            fstp    st1
            fld     zero
            fst     st0             ;Because this gets popped
            fcomp   three
            fstsw   ax
            sahf
            setb    al
            lea     rcx, fcompFmt5
            call    printFP
                        
; fcomp demo 6
        
            xor     eax, eax
            fld     zero
            fstp    st1
            fld     three
            fst     st0             ;Because this gets popped
            fcomp   zero
            fstsw   ax
            sahf
            setb    al
            lea     rcx, fcompFmt6
            call    printFP
                        
; fcompp demo
        
            xor     eax, eax
            fld     zero
            fst     st1             ;Because this gets popped
            fld     three
            fst     st0             ;Because this gets popped
            fcompp  
            fstsw   ax
            sahf
            setb    al
            lea     rcx, fcomppFmt
            call    printFP
                        
; fcompp demo 2
        
            xor     eax, eax
            fld     three
            fst     st1             ;Because this gets popped
            fld     zero
            fst     st0             ;Because this gets popped
            fcompp  
            fstsw   ax
            sahf
            setb    al
            lea     rcx, fcomppFmt2
            call    printFP
                        
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

The fcomi and fcomip Instructions

The fcomi and fcomip instructions compare ST(0) to the specified operand and set the corresponding FLAGS condition code bits based on the result of the comparison.

; Demonstration of fcomi and fcomip instructions

        option  casemap:none

nl          =       10


            .const
fcomiFmt    byte    "fcomi %f < %f is %d", nl, 0
fcomiFmt2   byte    "fcomi(2) %f < %f is %d", nl, 0
fcomipFmt   byte    "fcomip %f < %f is %d", nl, 0
fcomipFmt2  byte    "fcomip (2) %f < %f is %d", nl, 0

three       real8   3.0
zero        real8   0.0
minusTwo    real8   -2.0
        
            .data
st0         real8   ?
st1         real8   ?
        
            .code
            externdef printf:proc
            
; printFP- Prints values of st0 and (possibly) st1.
;          Caller must pass in ptr to fmtStr in RCX.

printFP     proc
            sub     rsp, 40
            
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.

            mov     rdx, qword ptr st0
            mov     r8, qword ptr st1
            movzx   r9, al
            call    printf
            add     rsp, 40
            ret
printFP     endp

            
            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48   ;Shadow storage

; Test to see if 0 < 3
; Note: st(0) contains zero, st(2) contains three
        
            xor     eax, eax
            fld     three
            fld     zero
            fcomi   st(0), st(1)
            setb    al
            fstp    st0
            fstp    st1
            lea     rcx, fcomiFmt
            call    printFP
            
; Test to see if 3 < 0
; Note: st(0) contains zero, st(2) contains three
        
            xor     eax, eax
            fld     zero
            fld     three
            fcomi   st(0), st(1)
            setb    al
            fstp    st0
            fstp    st1
            lea     rcx, fcomiFmt2
            call    printFP
                        
; Test to see if 3 < 0
; Note: st(0) contains zero, st(2) contains three
        
            xor     eax, eax
            fld     zero
            fld     three
            fst     st0             ;Because this gets popped
            fcomip   st(0), st(1)
            setb    al
            fstp    st1
            lea     rcx, fcomipFmt
            call    printFP
                        
; Test to see if 0 < 3
; Note: st(0) contains zero, st(2) contains three
        
            xor     eax, eax
            fld     three
            fld     zero
            fst     st0             ;Because this gets popped
            fcomip   st(0), st(1)
            setb    al
            fstp    st1
            lea     rcx, fcomipFmt2
            call    printFP
                        
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

The ftst Instruction

The ftst instruction compares the value in ST(0) against 0.0.
This instruction does not differentiate –0.0 from +0.0.
If the value in ST(0) is either of these values, ftst will set C3 to denote equality

ftst
fstsw ax
sahf
sete al ; Set AL to 1 if TOS = 0.0

Constant Instructions

The FPU provides several instructions that let you load commonly used constants onto the FPU’s register stack.
These instructions set the stack fault, invalid operation, and C1 flags if a stack overflow occurs.

fldz ; Pushes +0.0
fld1 ; Pushes +1.0
fldpi ; Pushes pi (3.14159...)
fldl2t ; Pushes log2(10)
fldl2e ; Pushes log2(e)
fldlg2 ; Pushes log10(2)
fldln2 ; Pushes ln(2)

The f2xm1 Instruction

f2xm1 computes
The value in ST(0) must be in the range –1.0 to +1.0. If ST(0) is out of range, f2xm1 generates an undefined result but raises no exceptions.
The computed value replaces the value in ST(0).
Here’s an example computing 10^i using the identity .

fld i
fldl2t
fmul
f2xm1
fld1
fadd

The fsin, fcos, and fsincos Instructions

These instructions pop the value off the top of the register stack and compute the sine, cosine, or both, and push the result(s) back onto the stack.
The fsincos instruction pushes the sine followed by the cosine of the original operand; hence, it leaves cos(ST(0)) in ST(0) and sin(ST(0)) in ST(1).
These instructions assume ST(0) specifies an angle in radians, and this angle must be in the range

The fptan Instruction

fptan computes the tangent of ST(0), replaces ST(0) with this value, and then pushes 1.0 onto the stack.

The fpatan Instruction

fpatan expects two values on the top of stack. It pops them and computes . The resulting value is the arctangent of the ratio on the stack expressed in radians.

The fyl2x Instruction

The fyl2x instruction computes . The instruction itself has no operands, but expects two operands on the FPU stack in ST(1) and ST(0).

The fyl2xp1 Instruction

fyl2xp1 computes from two operands on the FPU stack.

The finit and fninit Instructions

The finit and fninit instructions initialize the FPU for proper operation.
Your code should execute one of these instructions before executing any other FPU instructions. They initialize the control register to 37Fh, the status register to 0, and the tag word to 0FFFFh.
The difference between finit and fninit is that finit first checks for any pending floating-point exceptions before initializing the FPU; fninit does not.

finit
fninit

The fldcw and fstcw Instructions

The fldcw and fstcw instructions load the control word from a memory location (fldcw) or store the control word to a 16-bit memory location (fstcw).
When you use fldcw to turn on one of the exceptions, if the corresponding exception flag is set when you enable that exception, the FPU will generate an immediate interrupt before the CPU executes the next instruction. Therefore, you should use fclex to clear any pending interrupts before changing the FPU exception enable bits.

fldcw mem16
fstcw mem16

The fclex and fnclex Instructions

The fclex and fnclex instructions clear all exception bits, the stack fault bit, and the busy flag in the FPU status register.
The difference between these instructions is the same as that between finit and fninit: fclex first checks for pending floating-point exceptions.\

fclex
fnclex

The fstsw and fnstsw Instructions

These instructions store the FPU status word into a 16-bit memory location or the AX register

fstsw ax
fnstsw ax
fstsw mem16
fnstsw mem16

These instructions are unusual in the sense that they can copy an FPU value into one of the x86-64 general-purpose registers (specifically, AX). The purpose is to allow the CPU to easily test the condition code register with the sahf instruction. The difference between fstsw and fnstsw is the same as that for fclex and fnclex.

Converting Floating-Point Expressions to Assembly Language

The FPU uses postfix notation (also called reverse Polish notation, or RPN) for arithmetic operations.
Postfix notation, as opposed to standard infix notation, places the operands before the operator.

As you can see, postfix is a convenient notation because it’s easy to translate this code into FPU instructions. Another advantage to postfix notation is that it doesn’t require any parentheses.

Examples In assembly:

; y z + 2 *

    fld y
    fld z
    fadd
    fld const2 ; const2 real8 2.0 in .data section
    fmul

; y 2 * a b + -

    fld y
    fld const2 ; const2 real8 2.0 in .data section
    fmul
    fld a
    fld b
    fadd
    fsub

; a b + c d + *

    fld a
    fld b
    fadd
    fld c
    fld d
    fadd
    fmul

Conversion Example with steps-

Postfix to assembly conversion example with steps

SSE Floating-Point Arithmetic

Most modern programs favor the use of the SSE (and later) registers and instructions for floating-point operations over the x87 FPU, using only those x87 operations available exclusively on the x87.
The SSE instruction set supports two floating-point data types: 32-bit single-precision (Intel calls these scalar single operations) and 64-bit double-precision values (Intel calls these scalar double operations).
The SSE does not support the 80-bit extended-precision floating-point data types of the x87 FPU. If you need the extended-precision format, you’ll have to use the x87 FPU.

SSE MXCSR Register

The SSE MXCSR register is a 32-bit status and control register that controls SSE floating-point operations. Bits 16 to 32 are reserved and currently have no meaning.

Access to the SSE MXCSR register is via the following two instructions

ldmxcsr mem32
stmxcsr mem32

The ldmxcsr instruction loads the MXCSR register from the specified 32-bit memory location. The stmxcsr instruction stores the current contents of the MXCSR register to the specified memory location.
By far, the most common use of these two instructions is to set the rounding mode. In typical programs using the SSE floating-point instructions, it is common to switch between the round-to-nearest and round-to-zero (truncate) modes.

SSE Floating-Point Move Instructions

The SSE instruction set provides two instructions to move floating-point values between XMM registers and memory: movss (move scalar single) and movsd (move scalar double).

movss xmmn, mem32
movss mem32, xmmn

movsd xmmn, mem64
movsd mem64, xmmn

As for the standard general-purpose registers, the movss and movsd instructions move data between an appropriate memory location and one of the 16 XMM registers (XMM0 to XMM15).
For maximum performance, movss memory operands should appear at a double-word-aligned memory address, and movsd memory operands should appear at a quad-word-aligned memory address.
In addition to the movss and movsd instructions that move floating-point values between XMM registers or XMM registers and memory, you’ll find a couple of other SSE move instructions useful that move data between XMM and general-purpose registers, movd and movq:

movd reg32, xmmn
movd xmmn, reg32

movq reg64, xmmn
movq xmmn, reg64

These instructions also have a form that allows a source memory operand. However, you should use movss and movsd to move floating-point variables into XMM registers.
The movq and movd instructions are especially useful for copying XMM registers into 64-bit general-purpose registers prior to a call to printf().

SSE Floating-Point Arithmetic Instructions

;; The addsx, subsx, mulsx, and divsx instructions perform 
;; the expected floating-point arithmetic operations.
addss xmmn, xmmn
addss xmmn, mem32

addsd xmmn, xmmn
addsd xmmn, mem64

subss xmmn, xmmn
subss xmmn, mem32

subsd xmmn, xmmn
subsd xmmn, mem64

mulss xmmn, xmmn
mulss xmmn, mem32

mulsd xmmn, xmmn
mulsd xmmn, mem64

divss xmmn, xmmn
divss xmmn, mem32

divsd xmmn, xmmn
divsd xmmn, mem64

;; The minsx instructions compute the minimum value 
;; of the two operands, storing the minimum value into 
;; the destination (first) operand.
minss xmmn, xmmn
minss xmmn, mem32

minsd xmmn, xmmn
minsd xmmn, mem64


;; The maxsx instructions do the same thing, but 
;; compute the maximum of the two operands.
maxss xmmn, xmmn
maxss xmmn, mem32

maxsd xmmn, xmmn
maxsd xmmn, mem64

;; The sqrtsx instructions compute the square root of 
;; the source (second) operand and store the result  
;; into the destination (first) operand.
sqrtss xmmn, xmmn
sqrtss xmmn, mem32

sqrtsd xmmn, xmmn
sqrtsd xmmn, mem64

;; The rcpsx instructions compute the reciprocal of the source, 
;; storing the result into the destination. The rcpsx instructions 
;; compute the reciprocal of the source, storing the result into 
;; the destination.
rcpss xmmn, xmmn
rcpss xmmn, mem32

;; The rsqrtsx instructions compute the reciprocal of the square root.
rsqrtss xmmn, xmmn
rsqrtss xmmn, mem32

SSE Floating-Point Comparisons

The SSE floating-point comparisons work quite a bit differently from the integer and x87 FPU compare instructions. Rather than having a single generic instruction that sets flags the SSE provides a set of condition-specific comparison instructions that store true (all 1 bits) or false (all 0 bits) into the destination operand.

cmpss xmmn, xmmm/mem32, imm8
cmpsd xmmn, xmmm/mem64, imm8

cmpeqss xmmn, xmmm/mem32
cmpltss xmmn, xmmm/mem32
cmpless xmmn, xmmm/mem32
cmpunordss xmmn, xmmm/mem32
cmpne qss xmmn, xmmm/mem32
cmpnltss xmmn, xmmm/mem32
cmpnless xmmn, xmmm/mem32
cmpordss xmmn, xmmm/mem32

cmpeqsd xmmn, xmmm/mem64
cmpltsd xmmn, xmmm/mem64
cmplesd xmmn, xmmm/mem64
cmpunordsd xmmn, xmmm/mem64
cmpneqsd xmmn, xmmm/mem64
cmpnltsd xmmn, xmmm/mem64
cmpnlesd xmmn, xmmm/mem64
cmpordsd xmmn, xmmm/mem64

The instructions without the third (immediate) operand are special pseudo-ops MASM provides that automatically supply the appropriate third operand. You can use the nlt form for ge and nle form for gt, assuming the operands are ordered.
The unordered comparison returns true if either (or both) operands are unordered. Likewise, the ordered comparison returns true if both operands are ordered.
As noted, these instructions leave 0 or all 1 bits in the destination register to represent false or true.

SSE Floating-Point Conversions

PreviousProcedures NextLow Level Control Structures

Last updated 1 month ago