Arithmetic
Chapter 6
Sign And Zero-Extension Instructions
Several arithmetic operations require sign- or zero-extended values before the operation.

For general sign-extension operations, the x86-64 provides an extension of the mov instruction, movsx
movsxd dest, source ; If dest is 64 bits and source is 32 bits
movsx dest, source ; For all other operand combinations
In order to zero extend the register, move it to itself or use movzx
mov eax, eax ; Zero-extends EAX into RAX
movzx eax, eax
The mul and imul Instructions
Single operand mul and imul are used for multiprecision work and generic multi-operand mul and imul instructions are used for generic use.
x86-64 always assumes the accumulator (AL, AX, EAX, or RAX) is the destination operand.
mul reg8 ; Returns AX
mul reg16 ; Returns DX:AX
mul reg32 ; Returns EDX:EAX
mul reg64 ; Returns RDX:RAX
mul mem8 ; Returns AX
mul mem16 ; Returns DX:AX
mul mem32 ; Returns EDX:EAX
mul mem64 ; Returns RDX:RAX
imul reg8 ; Returns AX
imul reg16 ; Returns DX:AX
imul reg32 ; Returns EDX:EAX
imul reg64 ; Returns RDX:RAX
imul mem8 ; Returns AX
imul mem16 ; Returns DX:AX
imul mem32 ; Returns EDX:EAX
imul mem64 ; Returns RDX:RAX

However, the generic imul is not a complete replacement for these two instructions; in addition to the number of operands, several differences exist.
There isn’t an 8×8-bit multi-operand imul instruction available
The generic imul instruction does not produce a 2×n-bit result, but truncates the result to n bits. That is, a 16×16-bit multiplication produces a 16-bit result. Likewise, a 32×32-bit multiplication produces a 32-bit result. These instructions set the carry and overflow flags if the result does not fit into the destination register.
The div and idiv instructions
The x86-64 divide instructions perform a 128/64-bit division, a 64/32-bit division, a 32/16-bit division, or a 16/8-bit division. There is no variant of the div or idiv instructions that allows you to divide a value by a constant. If you want to divide a value by a constant, you need to create a memory object (preferably in the .const section) that is initialized with the constant.
.const
ten dword 10
.
.
.
div ten ; Divides EDX:EAX by 10
The div instruction is an unsigned division operation.
div reg8
div reg16
div reg32
div reg64
div mem8
div mem16
div mem32
div mem64
idiv reg8
idiv reg16
idiv reg32
idiv reg64
idiv mem8
idiv mem16
idiv mem32
idiv mem64
If the operand is an 8-bit operand, div divides the AX register by the operand, leaving the quotient in AL and the remainder (modulo) in AH. If the operand is a 16-bit quantity, the div instruction divides the 32-bit quantity in DX:AX by the operand, leaving the quotient in AX and the remainder in DX. With 32-bit operands, div divides the 64-bit value in EDX:EAX by the operand, leaving the quotient in EAX and the remainder in EDX. Finally, with 64-bit operands, div divides the 128-bit value in RDX:RAX by the operand, leaving the quotient in RAX and the remainder in RDX.
You cannot, on the x86-64, simply divide one unsigned 8-bit value by another. If the denominator is an 8-bit value, the numerator must be a 16-bit value. If you need to divide one unsigned 8-bit value by another, you must zero-extend the numerator to 16 bits by loading the numerator into the AL register and then moving 0 into the AH register.
When dealing with signed integer values, you will need to sign-extend AL into AX, AX into DX, EAX into EDX, or RAX into RDX before executing idiv. To do so, use the cbw, cwd, cdq, or cqo instructions.
The x86-64’s divide instructions have one other issue: you can get a fatal error when using this instruction. First, of course, you can attempt to divide a value by 0. Another problem is that the quotient may be too large to fit into the RAX, EAX, AX, or AL register. For example, the 16/8-bit division 8000h/2 produces the quotient 4000h with a remainder of 0. 4000h will not fit into 8 bits. If this happens, or you attempt to divide by 0, the x86-64 will generate a division exception or integer overflow exception. This usually means your program will crash. If this happens to you, chances are you didn’t sign- or zero-extend your numerator before executing the division operation. Because this error may cause your program to crash, you should be very careful about the values you select when using division.
The cmp instruction


The setcc Instructions
The setcc (set on condition) instructions set a single-byte operand (register or memory) to 0 or 1 depending on the values in the FLAGS register. These instructions store a 0 in the corresponding operand if the condition is false, and they store a 1 in the 8-bit operand if the condition is true.
setcc reg8
setcc mem8

The cmp instruction works synergistically with the setcc instructions. Immediately after a cmp operation, the processor flags provide information concerning the relative values of those operands. They allow you to see if one operand is less than, equal to, or greater than the other.


The test instruction
The test instruction computes the logical AND of its two operands and sets the condition code flags based on the result; it does not, however, store the result of the logical AND back into the destination operand.
test operand1, operand2
The test instruction sets the zero flag if the result of the logical AND operation is 0. It sets the sign flag if the HO bit of the result contains a 1. The test instruction always clears the carry and overflow flags.
The test instruction can also check whether all the bits in a specified set of bits contain 0. The instruction test al, 0fh sets the zero flag if and only if the LO 4 bits of AL all contain 0.
One important use of the test instruction is to check whether a register contains 0. The instruction test reg, reg, where both operands are the same register, will logically AND that register with itself. If the register contains 0, the result is 0 and the CPU will set the zero flag. However, if the register contains a nonzero value, logically ANDing that value with itself produces that same nonzero value, so the CPU clears the zero flag.
test eax, eax
setz bl ; BL is set to 1 if EAX contains 0
One major failing of the test instruction is that immediate (constant) operands can be no larger than 32 bits.
Arithmetic Expressions
; x = y + z:
mov eax, y
add eax, z
mov x, eax
; x = y - z:
mov eax, y
sub eax, z
mov x, eax
; x = y * z; (unsigned):
mov eax, y
mul z ; Don't forget this wipes out EDX
mov x, eax
; x = y * z; (signed):
mov eax, y
imul eax, z ; Does not affect EDX!
mov x, eax
; x = y div z; (unsigned div):
mov eax, y
xor edx, edx ; Zero-extend EAX into EDX
div z
mov x, eax
; x = y idiv z; (signed div):
mov eax, y
cdq ; Sign-extend EAX into EDX
idiv z
mov x, eax
; x = y % z; (unsigned remainder):
mov eax, y
xor edx, edx ; Zero-extend EAX into EDX
div z
mov x, edx ; Note that remainder is in EDX
; x = y % z; (signed remainder):
mov eax, y
cdq ; Sign-extend EAX into EDX
idiv z
mov x, edx ; Remainder is in EDX
; var1 = -var2;
mov eax, var2
neg eax
mov var1, eax
Commutative Operators


Assembly Tricks
Multiplying without mul or imul
In general, shifting an operand to the left n bits multiplies it by 2n . You can multiply any value by a constant by using a series of shifts and additions or shifts and subtractions. For example, to multiply the AX register by 10, you need only multiply it by 8 and then add two times the original value. That is, 10 × AX = 8 × AX + 2 × AX.
shl ax, 1 ; Multiply AX by 2
mov bx, ax ; Save 2 * AX for later
shl ax, 2 ; Multiply AX by 8 (*4 really,
; but AX contains *2)
add ax, bx ; Add in AX * 2 to AX * 8 to get AX * 10
You can also use subtraction with shifts to perform a multiplication operation.
mov ebx, eax ; Save EAX * 1
shl eax, 3 ; EAX = EAX * 8
sub eax, ebx ; EAX * 8 - EAX * 1 is EAX * 7
You can also use the lea instruction to compute certain products. The trick is to use the scaled-index addressing modes.
lea eax, [ecx][ecx] ; EAX = ECX * 2
lea eax, [eax][eax * 2] ; EAX = ECX * 3
lea eax, [eax * 4] ; EAX = ECX * 4
lea eax, [ebx][ebx * 4] ; EAX = EBX * 5
lea eax, [eax * 8] ; EAX = EAX * 8
lea eax, [edx][edx * 8] ; EAX = EDX * 9
Implementing Modulo-N Counters with AND
If you want to implement a counter variable that counts up to 2n – 1 and then resets to 0, use the following code:
inc CounterVar
and CounterVar, n_bits
Floating Point Arithmetic
FPU Registers
The x87 FPUs add 14 registers to the x86-64: eight floating-point data registers, a control register, a status register, a tag register, an instruction pointer, a data pointer, and an opcode register.
The data registers handle all floating point calculations, the control register contains bits that let you decide how the FPU handles certain degenerate cases like rounding off of inaccurate computations, the status register is similar to the FLAGS register, the tag register contains several groups of bits that determine the state of the FPU in each of the data registers, the instruction, data pointer, and opcode registers contain certain state information about the last floating-point instruction executed.
FPU Data Registers
The FPUs provide eight 80-bit data registers organized as a stack. MASM refers to these registers as ST(0), ST(1), . . . ST(7).
The biggest difference between the FPU register set and the x86-64 register set is the stack organization. On the x86-64 CPU, the AX register is always the AX register, no matter what happens. On the FPU, however, the register set is an eight-element stack of 80-bit floating-point values.
ST(0) refers to the item on the top of stack, ST(1) refers to the next item on the stack, and so on. Many floating-point instructions push and pop items on the stack; therefore, ST(1) will refer to the previous contents of ST(0) after you push something onto the stack.

FPU Control Register

Bits 10 and 11 of the FPU control register provide rounding control

Bits 8 and 9 of the control register specify the precision during computation. This capability is provided to allow compatibility with older software as required by the IEEE 754 standard.

Bits 0 to 5 are the exception masks. These are similar to the interrupt enable bit in the x86-64’s FLAGS register. If these bits contain a 1, the corresponding condition is ignored by the FPU. However, if any bit contains 0s, and the corresponding condition occurs, then the FPU immediately generates an interrupt so the program can handle the degenerate condition.
Bit 0 corresponds to an invalid operation error, which generally occurs as the result of a programming error.
Bit 1 masks the denormalized interrupt that occurs whenever you try to manipulate denormalized values. Denormalized exceptions occur when you load arbitrary extended-precision values into the FPU or work with very small numbers just beyond the range of the FPU’s capabilities.
Bit 2 masks the zero-divide exception. If this bit contains 0, the FPU will generate an interrupt if you attempt to divide a nonzero value by 0. If you do not enable the zero-divide exception, the FPU will produce NaN whenever you perform a zero division. It’s probably a good idea to enable this exception by programming a 0 into this bit.
Bit 3 masks the overflow exception. The FPU will raise the overflow exception if a calculation overflows or if you attempt to store a value that is too large to fit into the destination operand.
Bit 4, if set, masks the underflow exception. Underflow occurs when the result is too small to fit in the destination operand. Like overflow, this exception can occur whenever you store a small extended-precision value into a smaller variable (single or double precision) or when the result of a computation is too small for extended precision.
Bit 5 controls whether the precision exception can occur. A precision exception occurs whenever the FPU produces an imprecise result, generally the result of an internal rounding operation.
Bits 6 and 7, and 12 to 15, in the control register are currently undefined and reserved for future use.
The FPU provides two instructions, fldcw (load control word) and fstcw (store control word), that let you load and store the contents of the control register, respectively. The single operand to these instructions must be a 16-bit memory location. The fldcw instruction loads the control register from the specified memory location. fstcw stores the control register into the specified memory location.
fldcw mem16
fstcw mem16
;;example
.data
fcw16 word ?
.
.
.
fstcw fcw16
mov ax, fcw16
and ax, 0f0ffh ; Clears bits 8-11
or ax, 0c00h ; Rounding control = %11, Precision = %00
mov fcw16, ax
fldcw fcw16
FPU Status Register
The 16-bit FPU status register provides the status of the FPU at the instant you read it

Bits 0 through 5 are the exception flags. These bits appear in the same order as the exception masks in the control register. If the corresponding condition exists, the bit is set. These bits are independent of the exception masks in the control register.
Bit 6 indicates a stack fault. A stack fault occurs whenever a stack overflow or underflow occurs. When this bit is set, the C1 condition code bit determines whether there was a stack overflow (C1 = 1) or stack underflow (C1 = 0) condition
Bit 7 of the status register is set if any error condition bit is set. It is the logical or of bits 0 through 5. A program can test this bit to quickly determine if an error condition exists.
Bits 8, 9, 10, and 14 are the coprocessor condition code bits.
Bits 11 to 13 of the FPU status register provide the register number of the top of stack. During computations, the FPU adds (modulo 8) the logical register numbers supplied by the programmer to these 3 bits to determine the physical register number at runtime.
Bit 15 of the status register is the busy bit. It is set whenever the FPU is busy. This bit is a historical artifact from the days when the FPU was a separate chip.




FPU Data Types
The FPU supports seven data types: three integer types, a packed decimal type, and three floating-point types.

The integer type supports 16-, 32-, and 64-bit integers, although it is often faster to do the integer arithmetic by using the integer unit of the CPU. The packed decimal type provides an 18-digit signed decimal (BCD) integer. The primary purpose of the BCD format is to convert between strings and floating-point values. The remaining three data types are the 32-, 64-, and 80-bit floating-point data types.
The FPU generally stores values in a normalized format. The HO bit of the mantissa is always 1 when a floating-point number is normalized. In the 32- and 64-bit floating-point formats, the FPU does not actually store this bit; the FPU always assumes that it is 1. Therefore, 32- and 64-bit floating point numbers are always normalized. In the extended-precision 80-bit floating-point format, the FPU does not assume that the HO bit of the mantissa is 1; the HO bit of the mantissa appears as part of the string of bits.
Normalized values provide the greatest precision for a given number of bits. However, many non-normalized values cannot be represented with the 80-bit format. These values are very close to 0 and represent the set of values whose mantissa HO bit is not 0. The FPUs support a special 80-bit form known as denormalized values.
Denormalized values allow the FPU to encode very small values it cannot encode using normalized values, but denormalized values offer fewer bits of precision than normalized values. Therefore, using denormalized values in a computation may introduce slight inaccuracy.

FPU Instruction Set
The fld Instruction
The fld instruction loads a 32-, 64-, or 80-bit floating-point value onto the stack. This instruction converts 32- and 64-bit operands to an 80-bit extended precision value before pushing the value onto the floating-point stack.
fld st(1)
fld real4_variable
fld real8_variable
fld real10_variable
fld real8 ptr [rbx]
There is no way to directly load a 32-bit integer register onto the floating point stack, even if that register contains a real4 value. To do so, you must first store the integer register into a memory location, and then push that memory location onto the FPU stack by using the fld instruction.
mov tempReal4, eax ; Save real4 value in EAX to memory
fld tempReal4 ; Push that value onto the FPU stack
The fst and fstp Instruction
The fst and fstp instructions copy the value on the top of the floating-point stack to another floating-point register or to a 32-, 64-, or (fstp only) 80-bit memory variable. By incrementing the TOS pointer in the status register after accessing the data in ST(0), the fstp instruction pops the value off the top of stack when moving it to the destination location. If the destination operand is a floating-point register, the FPU stores the value at the specified register number before popping the data off the top of stack.
Executing an fstp st(0) instruction effectively pops the data off the top of stack with no data transfer.
fst real4_variable
fst real8_variable
fst realArray[rbx * 8]
fst st(2)
fstp st(1)
The fxch Instruction
The fxch instruction exchanges the value on the top of stack with one of the other FPU registers. This instruction takes two forms: one with a single FPU register as an operand and the second without any operands.
Many FPU instructions (for example, fsqrt) operate only on the top of the register stack. If you want to perform such an operation on a value that is not on top, you can use the fxch instruction to swap that register with TOS, perform the desired operation, and then use fxch to swap the TOS with the original register.
fxch st(2)
fsqrt
fxch st(2)
The fild Instruction
The fild (integer load) instruction converts a 16-, 32-, or 64-bit two’s complement integer to the 80-bit extended-precision format and pushes the result onto the stack. This instruction always expects a single operand: the address of a word, double-word, or quad-word integer variable.
fild word_variable
fild dword_val[rcx * 4]
fild qword_variable
fild sqword ptr [rbx]
The fist, fistp, and fisttp Instructions
The fist, fistp, and fisttp instructions convert the 80-bit extended-precision variable on the top of stack to a 16-, 32-, or (fistp/fistpp only) 64-bit integer and store the result away into the memory variable specified by the single operand.
The fist instruction converts the value on the top of stack to an integer and then stores the result; it does not otherwise affect the floating-point register stack. The fistp (Store Integer and Pop) and fisttp (Store Integer with Truncation and Pop) instructions pop the value off the floating point register stack after storing the converted value.
fist word_var[rbx * 2]
fist dword_var
fisttp dword_var
fistp qword_var
The fbld and fbstp Instructions
The fbld and fbstp instructions load and store 80-bit BCD values. The fbld instruction converts a BCD value to its 80-bit extended-precision equivalent and pushes the result onto the stack.
The fbstp instruction pops the extended-precision real value on TOS, converts it to an 80-bit BCD value (rounding according to the bits in the floating-point control register), and stores the converted result at the address specified by the destination memory operand. There is no fbst instruction.
; Assuming fewer than eight items on the stack, the following
; code sequence is equivalent to an fbst instruction:
fld st(0)
fbstp tbyte_var
; The following example easily converts an 80-bit BCD value to
; a 64-bit integer:
fbld tbyte_var
fistp qword_var
The fadd, faddp, and fiadd Instructions
The fadd instruction, with no operands, is a synonym for faddp. The faddp instruction (also with no operands) pops the two values on the top of stack, adds them, and pushes their sum back onto the stack.
The next two forms of the fadd instruction, those with two FPU register operands, behave like the x86-64’s add instruction. They add the value in the source register operand to the value in the destination register operand. One of the register operands must be ST(0).
The faddp instruction with two operands adds ST(0) (which must always be the source operand) to the destination operand and then pops ST(0). The destination operand must be one of the other FPU registers.
The last two forms, fadd with a memory operand, adds a 32- or 64-bit floating-point variable to the value in ST(0). This instruction will convert the 32- or 64-bit operands to an 80-bit extended-precision value before performing the addition. Note that this instruction does not allow an 80-bit memory operand.
; Demonstration of various forms of fadd
option casemap:none
nl = 10
.const
fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0
fmtAdd1 byte "fadd: st0:%f", nl, 0
fmtAdd2 byte "faddp: st0:%f", nl, 0
fmtAdd3 byte "fadd st(1), st(0): st0:%f, st1:%f", nl, 0
fmtAdd4 byte "fadd st(0), st(1): st0:%f, st1:%f", nl, 0
fmtAdd5 byte "faddp st(1), st(0): st0:%f", nl, 0
fmtAdd6 byte "fadd mem: st0:%f", nl, 0
zero real8 0.0
one real8 1.0
two real8 2.0
minusTwo real8 -2.0
.data
st0 real8 0.0
st1 real8 0.0
.code
externdef printf:proc
; printFP- Prints values of st0 and (possibly) st1.
; Caller must pass in ptr to fmtStr in RCX.
printFP proc
sub rsp, 40
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.
mov rdx, qword ptr st0
mov r8, qword ptr st1
call printf
add rsp, 40
ret
printFP endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 48 ;Shadow storage
; Demonstrate various fadd instructions:
mov rax, qword ptr one
mov qword ptr st1, rax
mov rax, qword ptr minusTwo
mov qword ptr st0, rax
lea rcx, fmtSt0St1
call printFP
; fadd (same as faddp)
fld one
fld minusTwo
fadd ;Pops st(0)!
fstp st0
lea rcx, fmtAdd1
call printFP
; faddp:
fld one
fld minusTwo
faddp ;Pops st(0)!
fstp st0
lea rcx, fmtAdd2
call printFP
; fadd st(1), st(0)
fld one
fld minusTwo
fadd st(1), st(0)
fstp st0
fstp st1
lea rcx, fmtAdd3
call printFP
; fadd st(0), st(1)
fld one
fld minusTwo
fadd st(0), st(1)
fstp st0
fstp st1
lea rcx, fmtAdd4
call printFP
; faddp st(1), st(0)
fld one
fld minusTwo
faddp st(1), st(0)
fstp st0
lea rcx, fmtAdd5
call printFP
; faddp mem64
fld one
fadd two
fstp st0
lea rcx, fmtAdd6
call printFP
leave
ret ;Returns to caller
asmMain endp
end
The fsub, fsubp, fsubr, fsubrp, fisub, and fisubr Instructions
With no operands, fsub is the same as fsubp (without operands). With no operands, the fsubp instruction pops ST(0) and ST(1) from the register stack, computes ST(1) – ST(0), and then pushes the difference back onto the stack. The fsubr and fsubrp instructions (reverse subtraction) operate in an identical fashion except they compute ST(0) – ST(1).
With two register operands (destination, source), the fsub instruction computes destination = destination – source. One of the two registers must be ST(0). With two registers as operands, the fsubp also computes destination = destination – source, and then it pops ST(0) off the stack after computing the difference. For the fsubp instruction, the source operand must be ST(0).
With two register operands, the fsubr and fsubrp instructions work in a similar fashion to fsub and fsubp, except they compute destination = source – destination.
The fsub mem32, fsub mem64, fsubr mem32, and fsubr mem64 instructions accept a 32- or 64-bit memory operand. They convert the memory operand to an 80-bit extended-precision value and subtract this from ST(0) (fsub) or subtract ST(0) from this value (fsubr) and store the result back into ST(0).
; Demonstration of various forms of fsub/fsubrl
option casemap:none
nl = 10
.const
fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0
fmtSub1 byte "fsub: st0:%f", nl, 0
fmtSub2 byte "fsubp: st0:%f", nl, 0
fmtSub3 byte "fsub st(1), st(0): st0:%f, st1:%f", nl, 0
fmtSub4 byte "fsub st(0), st(1): st0:%f, st1:%f", nl, 0
fmtSub5 byte "fsubp st(1), st(0): st0:%f", nl, 0
fmtSub6 byte "fsub mem: st0:%f", nl, 0
fmtSub7 byte "fsubr st(1), st(0): st0:%f, st1:%f", nl, 0
fmtSub8 byte "fsubr st(0), st(1): st0:%f, st1:%f", nl, 0
fmtSub9 byte "fsubrp st(1), st(0): st0:%f", nl, 0
fmtSub10 byte "fsubr mem: st0:%f", nl, 0
zero real8 0.0
three real8 3.0
minusTwo real8 -2.0
.data
st0 real8 0.0
st1 real8 0.0
.code
externdef printf:proc
; printFP- Prints values of st0 and (possibly) st1.
; Caller must pass in ptr to fmtStr in RCX.
printFP proc
sub rsp, 40
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.
mov rdx, qword ptr st0
mov r8, qword ptr st1
call printf
add rsp, 40
ret
printFP endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 48 ;Shadow storage
; Demonstrate various fsub instructions:
mov rax, qword ptr three
mov qword ptr st1, rax
mov rax, qword ptr minusTwo
mov qword ptr st0, rax
lea rcx, fmtSt0St1
call printFP
; fsub (same as fsubp)
fld three
fld minusTwo
fsub ;Pops st(0)!
fstp st0
lea rcx, fmtSub1
call printFP
; fsubp:
fld three
fld minusTwo
fsubp ;Pops st(0)!
fstp st0
lea rcx, fmtSub2
call printFP
; fsub st(1), st(0)
fld three
fld minusTwo
fsub st(1), st(0)
fstp st0
fstp st1
lea rcx, fmtSub3
call printFP
; fsub st(0), st(1)
fld three
fld minusTwo
fsub st(0), st(1)
fstp st0
fstp st1
lea rcx, fmtSub4
call printFP
; fsubp st(1), st(0)
fld three
fld minusTwo
fsubp st(1), st(0)
fstp st0
lea rcx, fmtSub5
call printFP
; fsub mem64
fld three
fsub minusTwo
fstp st0
lea rcx, fmtSub6
call printFP
; fsubr st(1), st(0)
fld three
fld minusTwo
fsubr st(1), st(0)
fstp st0
fstp st1
lea rcx, fmtSub7
call printFP
; fsubr st(0), st(1)
fld three
fld minusTwo
fsubr st(0), st(1)
fstp st0
fstp st1
lea rcx, fmtSub8
call printFP
; fsubrp st(1), st(0)
fld three
fld minusTwo
fsubrp st(1), st(0)
fstp st0
lea rcx, fmtSub9
call printFP
; fsubr mem64
fld three
fsubr minusTwo
fstp st0
lea rcx, fmtSub10
call printFP
leave
ret ;Returns to caller
asmMain endp
end
The fmul, fmulp, and fimul Instructions
The fmul and fmulp instructions multiply two floating-point values. The fimul instruction multiples an integer and a floating-point value.
With no operands, fmul is a synonym for fmulp. The fmulp instruction, with no operands, will pop ST(0) and ST(1), multiply these values, and push their product back onto the stack. The fmul instructions with two register operands compute destination = destination × source. One of the registers (source or destination) must be ST(0).
; Demonstration of various forms of fmul
option casemap:none
nl = 10
.const
fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0
fmtMul1 byte "fmul: st0:%f", nl, 0
fmtMul2 byte "fmulp: st0:%f", nl, 0
fmtMul3 byte "fmul st(1), st(0): st0:%f, st1:%f", nl, 0
fmtMul4 byte "fmul st(0), st(1): st0:%f, st1:%f", nl, 0
fmtMul5 byte "fmulp st(1), st(0): st0:%f", nl, 0
fmtMul6 byte "fmul mem: st0:%f", nl, 0
zero real8 0.0
three real8 3.0
minusTwo real8 -2.0
.data
st0 real8 0.0
st1 real8 0.0
.code
externdef printf:proc
; printFP- Prints values of st0 and (possibly) st1.
; Caller must pass in ptr to fmtStr in RCX.
printFP proc
sub rsp, 40
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.
mov rdx, qword ptr st0
mov r8, qword ptr st1
call printf
add rsp, 40
ret
printFP endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 48 ;Shadow storage
; Demonstrate various fmul instructions:
mov rax, qword ptr three
mov qword ptr st1, rax
mov rax, qword ptr minusTwo
mov qword ptr st0, rax
lea rcx, fmtSt0St1
call printFP
; fmul (same as fmulp)
fld three
fld minusTwo
fmul ;Pops st(0)!
fstp st0
lea rcx, fmtMul1
call printFP
; fmulp:
fld three
fld minusTwo
fmulp ;Pops st(0)!
fstp st0
lea rcx, fmtMul2
call printFP
; fmul st(1), st(0)
fld three
fld minusTwo
fmul st(1), st(0)
fstp st0
fstp st1
lea rcx, fmtMul3
call printFP
; fmul st(0), st(1)
fld three
fld minusTwo
fmul st(0), st(1)
fstp st0
fstp st1
lea rcx, fmtMul4
call printFP
; fmulp st(1), st(0)
fld three
fld minusTwo
fmulp st(1), st(0)
fstp st0
lea rcx, fmtMul5
call printFP
; fmulp mem64
fld three
fmul minusTwo
fstp st0
lea rcx, fmtMul6
call printFP
leave
ret ;Returns to caller
asmMain endp
end
The fdiv, fdivp, fdivr, fdivrp, fidiv, and fidivr Instructions
With no operands, the fdiv instruction is a synonym for fdivp. The fdivp instruction with no operands computes ST(1) = ST(1) / ST(0). The fdivr and fdivrp instructions work in a similar fashion to fdiv and fdivp except that they compute ST(0) / ST(1) rather than ST(1) / ST(0).
With two register operands, these instructions compute the following quotients:
fdiv st(0), st(i) ; st(0) = st(0)/st(i)
fdiv st(i), st(0) ; st(i) = st(i)/st(0)
fdivp st(i), st(0) ; st(i) = st(i)/st(0) then pop st0
fdivr st(0), st(i) ; st(0) = st(i)/st(0)
fdivr st(i), st(0) ; st(i) = st(0)/st(i)
fdivrp st(i), st(0) ; st(i) = st(0)/st(i) then pop st0
The fdivp and fdivrp instructions also pop ST(0) after performing the division operation. The value for i in these two instructions is computed before popping ST(0).
; Demonstration of various forms of fdiv/fdivr
option casemap:none
nl = 10
.const
fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0
fmtDiv1 byte "fdiv: st0:%f", nl, 0
fmtDiv2 byte "fdivp: st0:%f", nl, 0
fmtDiv3 byte "fdiv st(1), st(0): st0:%f, st1:%f", nl, 0
fmtDiv4 byte "fdiv st(0), st(1): st0:%f, st1:%f", nl, 0
fmtDiv5 byte "fdivp st(1), st(0): st0:%f", nl, 0
fmtDiv6 byte "fdiv mem: st0:%f", nl, 0
fmtDiv7 byte "fdivr st(1), st(0): st0:%f, st1:%f", nl, 0
fmtDiv8 byte "fdivr st(0), st(1): st0:%f, st1:%f", nl, 0
fmtDiv9 byte "fdivrp st(1), st(0): st0:%f", nl, 0
fmtDiv10 byte "fdivr mem: st0:%f", nl, 0
three real8 3.0
minusTwo real8 -2.0
.data
st0 real8 0.0
st1 real8 0.0
.code
externdef printf:proc
; printFP- Prints values of st0 and (possibly) st1.
; Caller must pass in ptr to fmtStr in RCX.
printFP proc
sub rsp, 40
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.
mov rdx, qword ptr st0
mov r8, qword ptr st1
call printf
add rsp, 40
ret
printFP endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 48 ;Shadow storage
; Demonstrate various fdiv instructions:
mov rax, qword ptr three
mov qword ptr st1, rax
mov rax, qword ptr minusTwo
mov qword ptr st0, rax
lea rcx, fmtSt0St1
call printFP
; fdiv (same as fdivp)
fld three
fld minusTwo
fdiv ;Pops st(0)!
fstp st0
lea rcx, fmtDiv1
call printFP
; fdivp:
fld three
fld minusTwo
fdivp ;Pops st(0)!
fstp st0
lea rcx, fmtDiv2
call printFP
; fdiv st(1), st(0)
fld three
fld minusTwo
fdiv st(1), st(0)
fstp st0
fstp st1
lea rcx, fmtDiv3
call printFP
; fdiv st(0), st(1)
fld three
fld minusTwo
fdiv st(0), st(1)
fstp st0
fstp st1
lea rcx, fmtDiv4
call printFP
; fdivp st(1), st(0)
fld three
fld minusTwo
fdivp st(1), st(0)
fstp st0
lea rcx, fmtDiv5
call printFP
; fdiv mem64
fld three
fdiv minusTwo
fstp st0
lea rcx, fmtDiv6
call printFP
; fdivr st(1), st(0)
fld three
fld minusTwo
fdivr st(1), st(0)
fstp st0
fstp st1
lea rcx, fmtDiv7
call printFP
; fdivr st(0), st(1)
fld three
fld minusTwo
fdivr st(0), st(1)
fstp st0
fstp st1
lea rcx, fmtDiv8
call printFP
; fdivrp st(1), st(0)
fld three
fld minusTwo
fdivrp st(1), st(0)
fstp st0
lea rcx, fmtDiv9
call printFP
; fdivr mem64
fld three
fdivr minusTwo
fstp st0
lea rcx, fmtDiv10
call printFP
leave
ret ;Returns to caller
asmMain endp
end
The fprem and fprem1 Instructions
The fprem and fprem1 instructions compute a partial remainder (a value that may require additional computation to produce the actual remainder). You should always use fprem1 in new software; therefore, we will discuss only fprem1 here, although you use fprem in an identical fashion.
fprem1 computes the partial remainder of ST(0) / ST(1). If the difference between the exponents of ST(0) and ST(1) is less than 64, fprem1 can compute the exact remainder in one operation. Otherwise, you will have to execute fprem1 two or more times to get the correct remainder value.
The C2 condition code bit determines when the computation is complete.
fprem1 does not pop the two operands off the stack; it leaves the partial remainder in ST(0) and the original divisor in ST(1) in case you need to compute another partial product to complete the result.
The fprem1 instruction sets the stack exception flag if there aren’t two values on the top of stack. It sets the underflow and denormal exception bits if the result is too small. It sets the invalid operation bit if the values on TOS are inappropriate for this operation. It sets the C2 condition code bit if the partial remainder operation is not complete (or on stack underflow). Finally, it loads C1, C2, and C0 with bits 0, 1, and 2 of the quotient, respectively.
; Compute z = x % y:
fld y
fld x
repeatLp:
fprem1
fstsw ax ; Get condition code bits into AX
and ah, 1 ; See if C2 is set
jnz repeatLp ; Repeat until C2 is clear
fstp z ; Store away the remainder
fstp st(0) ; Pop old y value
The frndint Instruction
The frndint instruction rounds the value on TOS to the nearest integer by using the rounding algorithm specified in the control register.
This instruction sets the stack exception flag if there is no value on the TOS (it will also clear C1 in this case). It sets the precision and denormal exception bits if a loss of precision occurred. It sets the invalid operation flag if the value on the TOS is not a valid number.
The fabs Instruction
fabs computes the absolute value of ST(0) by clearing the mantissa sign bit of ST(0). It sets the stack exception bit and invalid operation bits if the stack is empty.
; Compute x = sqrt(abs(x)):
fld x
fabs
fsqrt
fstp x
The fchs Instruction
fchs changes the sign of ST(0)’s value by inverting the mantissa sign bit (this is the floating-point negation instruction). It sets the stack exception bit and invalid operation bits if the stack is empty.
; Compute x = -x if x is positive, x = x if x is negative.
; That is, force x to be a negative value.
fld x
fabs
fchs
fstp x
The fcom, fcomp, and fcompp Instructions
The fcom, fcomp, and fcompp instructions compare ST(0) to the specified operand and set the corresponding FPU condition code bits based on the result of the comparison.
With no operands, fcom, fcomp, and fcompp compare ST(0) against ST(1) and set the FPU flags accordingly. In addition, fcomp pops ST(0) off the stack, and fcompp pops both ST(0) and ST(1) off the stack.
With a single-register operand, fcom and fcomp compare ST(0) against the specified register. fcomp also pops ST(0) after the comparison.
These instructions set C2 if the two operands are not comparable. These instructions set the stack fault bit if there aren’t two items on the top of the register stack. They set the denormalized exception bit if either or both operands are denormalized. They set the invalid operation flag if either or both operands are NaNs. These instructions always clear the C1 condition code.
Because all x86-64 64-bit CPUs support the fcomi and fcomip instructions (described in the next section), you should consider using those instructions as they spare you from having to store the FPU status word into AX and then copy AH into the FLAGS register before testing the condition.
; Demonstration of fcom instructions
option casemap:none
nl = 10
.const
fcomFmt byte "fcom %f < %f is %d", nl, 0
fcomFmt2 byte "fcom(2) %f < %f is %d", nl, 0
fcomFmt3 byte "fcom st(1) %f < %f is %d", nl, 0
fcomFmt4 byte "fcom st(1) (2) %f < %f is %d", nl, 0
fcomFmt5 byte "fcom mem %f < %f is %d", nl, 0
fcomFmt6 byte "fcom mem %f (2) < %f is %d", nl, 0
fcompFmt byte "fcomp %f < %f is %d", nl, 0
fcompFmt2 byte "fcomp (2) %f < %f is %d", nl, 0
fcompFmt3 byte "fcomp st(1) %f < %f is %d", nl, 0
fcompFmt4 byte "fcomp st(1) (2) %f < %f is %d", nl, 0
fcompFmt5 byte "fcomp mem %f < %f is %d", nl, 0
fcompFmt6 byte "fcomp mem (2) %f < %f is %d", nl, 0
fcomppFmt byte "fcompp %f < %f is %d", nl, 0
fcomppFmt2 byte "fcompp (2) %f < %f is %d", nl, 0
three real8 3.0
zero real8 0.0
minusTwo real8 -2.0
.data
st0 real8 ?
st1 real8 ?
.code
externdef printf:proc
; printFP- Prints values of st0 and (possibly) st1.
; Caller must pass in ptr to fmtStr in RCX.
printFP proc
sub rsp, 40
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.
mov rdx, qword ptr st0
mov r8, qword ptr st1
movzx r9, al
call printf
add rsp, 40
ret
printFP endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 48 ;Shadow storage
; fcom demo
xor eax, eax
fld three
fld zero
fcom
fstsw ax
sahf
setb al
fstp st0
fstp st1
lea rcx, fcomFmt
call printFP
; fcom demo 2
xor eax, eax
fld zero
fld three
fcom
fstsw ax
sahf
setb al
fstp st0
fstp st1
lea rcx, fcomFmt2
call printFP
; fcom st(i) demo
xor eax, eax
fld three
fld zero
fcom st(1)
fstsw ax
sahf
setb al
fstp st0
fstp st1
lea rcx, fcomFmt3
call printFP
; fcom st(i) demo 2
xor eax, eax
fld zero
fld three
fcom st(1)
fstsw ax
sahf
setb al
fstp st0
fstp st1
lea rcx, fcomFmt4
call printFP
; fcom mem64 demo
xor eax, eax
fld three ;Never on stack so
fstp st1 ; copy for output
fld zero
fcom three
fstsw ax
sahf
setb al
fstp st0
lea rcx, fcomFmt5
call printFP
; fcom mem64 demo 2
xor eax, eax
fld zero ;Never on stack so
fstp st1 ; copy for output
fld three
fcom zero
fstsw ax
sahf
setb al
fstp st0
lea rcx, fcomFmt6
call printFP
; fcomp demo
xor eax, eax
fld zero
fld three
fst st0 ;Because this gets popped
fcomp
fstsw ax
sahf
setb al
fstp st1
lea rcx, fcompFmt
call printFP
; fcomp demo 2
xor eax, eax
fld three
fld zero
fst st0 ;Because this gets popped
fcomp
fstsw ax
sahf
setb al
fstp st1
lea rcx, fcompFmt2
call printFP
; fcomp demo 3
xor eax, eax
fld zero
fld three
fst st0 ;Because this gets popped
fcomp st(1)
fstsw ax
sahf
setb al
fstp st1
lea rcx, fcompFmt3
call printFP
; fcomp demo 4
xor eax, eax
fld three
fld zero
fst st0 ;Because this gets popped
fcomp st(1)
fstsw ax
sahf
setb al
fstp st1
lea rcx, fcompFmt4
call printFP
; fcomp demo 5
xor eax, eax
fld three
fstp st1
fld zero
fst st0 ;Because this gets popped
fcomp three
fstsw ax
sahf
setb al
lea rcx, fcompFmt5
call printFP
; fcomp demo 6
xor eax, eax
fld zero
fstp st1
fld three
fst st0 ;Because this gets popped
fcomp zero
fstsw ax
sahf
setb al
lea rcx, fcompFmt6
call printFP
; fcompp demo
xor eax, eax
fld zero
fst st1 ;Because this gets popped
fld three
fst st0 ;Because this gets popped
fcompp
fstsw ax
sahf
setb al
lea rcx, fcomppFmt
call printFP
; fcompp demo 2
xor eax, eax
fld three
fst st1 ;Because this gets popped
fld zero
fst st0 ;Because this gets popped
fcompp
fstsw ax
sahf
setb al
lea rcx, fcomppFmt2
call printFP
leave
ret ;Returns to caller
asmMain endp
end
The fcomi and fcomip Instructions
The fcomi and fcomip instructions compare ST(0) to the specified operand and set the corresponding FLAGS condition code bits based on the result of the comparison.
; Demonstration of fcomi and fcomip instructions
option casemap:none
nl = 10
.const
fcomiFmt byte "fcomi %f < %f is %d", nl, 0
fcomiFmt2 byte "fcomi(2) %f < %f is %d", nl, 0
fcomipFmt byte "fcomip %f < %f is %d", nl, 0
fcomipFmt2 byte "fcomip (2) %f < %f is %d", nl, 0
three real8 3.0
zero real8 0.0
minusTwo real8 -2.0
.data
st0 real8 ?
st1 real8 ?
.code
externdef printf:proc
; printFP- Prints values of st0 and (possibly) st1.
; Caller must pass in ptr to fmtStr in RCX.
printFP proc
sub rsp, 40
; For varargs (e.g., printf call), double
; values must appear in RDX and R8 rather
; than XMM1, XMM2.
; Note: if only one double arg in format
; string, printf call will ignore 2nd
; value in R8.
mov rdx, qword ptr st0
mov r8, qword ptr st1
movzx r9, al
call printf
add rsp, 40
ret
printFP endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 48 ;Shadow storage
; Test to see if 0 < 3
; Note: st(0) contains zero, st(2) contains three
xor eax, eax
fld three
fld zero
fcomi st(0), st(1)
setb al
fstp st0
fstp st1
lea rcx, fcomiFmt
call printFP
; Test to see if 3 < 0
; Note: st(0) contains zero, st(2) contains three
xor eax, eax
fld zero
fld three
fcomi st(0), st(1)
setb al
fstp st0
fstp st1
lea rcx, fcomiFmt2
call printFP
; Test to see if 3 < 0
; Note: st(0) contains zero, st(2) contains three
xor eax, eax
fld zero
fld three
fst st0 ;Because this gets popped
fcomip st(0), st(1)
setb al
fstp st1
lea rcx, fcomipFmt
call printFP
; Test to see if 0 < 3
; Note: st(0) contains zero, st(2) contains three
xor eax, eax
fld three
fld zero
fst st0 ;Because this gets popped
fcomip st(0), st(1)
setb al
fstp st1
lea rcx, fcomipFmt2
call printFP
leave
ret ;Returns to caller
asmMain endp
end
The ftst Instruction
The ftst instruction compares the value in ST(0) against 0.0.
This instruction does not differentiate –0.0 from +0.0.
If the value in ST(0) is either of these values, ftst will set C3 to denote equality
ftst
fstsw ax
sahf
sete al ; Set AL to 1 if TOS = 0.0
Constant Instructions
The FPU provides several instructions that let you load commonly used constants onto the FPU’s register stack.
These instructions set the stack fault, invalid operation, and C1 flags if a stack overflow occurs.
fldz ; Pushes +0.0
fld1 ; Pushes +1.0
fldpi ; Pushes pi (3.14159...)
fldl2t ; Pushes log2(10)
fldl2e ; Pushes log2(e)
fldlg2 ; Pushes log10(2)
fldln2 ; Pushes ln(2)
The f2xm1 Instruction
f2xm1 computes
The value in ST(0) must be in the range –1.0 to +1.0. If ST(0) is out of range, f2xm1 generates an undefined result but raises no exceptions.
The computed value replaces the value in ST(0).
Here’s an example computing 10^i using the identity
.
fld i
fldl2t
fmul
f2xm1
fld1
fadd
The fsin, fcos, and fsincos Instructions
These instructions pop the value off the top of the register stack and compute the sine, cosine, or both, and push the result(s) back onto the stack.
The fsincos instruction pushes the sine followed by the cosine of the original operand; hence, it leaves cos(ST(0)) in ST(0) and sin(ST(0)) in ST(1).
These instructions assume ST(0) specifies an angle in radians, and this angle must be in the range
The fptan Instruction
fptan computes the tangent of ST(0), replaces ST(0) with this value, and then pushes 1.0 onto the stack.
The fpatan Instruction
fpatan expects two values on the top of stack. It pops them and computes . The resulting value is the arctangent of the ratio on the stack expressed in radians.
The fyl2x Instruction
The fyl2x instruction computes . The instruction
itself has no operands, but expects two operands on the FPU stack in ST(1) and ST(0).
The fyl2xp1 Instruction
fyl2xp1 computes from two operands on the FPU stack.
The finit and fninit Instructions
The finit and fninit instructions initialize the FPU for proper operation.
Your code should execute one of these instructions before executing any other FPU instructions. They initialize the control register to 37Fh, the status register to 0, and the tag word to 0FFFFh.
The difference between finit and fninit is that finit first checks for any pending floating-point exceptions before initializing the FPU; fninit does not.
finit
fninit
The fldcw and fstcw Instructions
The fldcw and fstcw instructions load the control word from a memory location (fldcw) or store the control word to a 16-bit memory location (fstcw).
When you use fldcw to turn on one of the exceptions, if the corresponding exception flag is set when you enable that exception, the FPU will generate an immediate interrupt before the CPU executes the next instruction. Therefore, you should use fclex to clear any pending interrupts before changing the FPU exception enable bits.
fldcw mem16
fstcw mem16
The fclex and fnclex Instructions
The fclex and fnclex instructions clear all exception bits, the stack fault bit, and the busy flag in the FPU status register.
The difference between these instructions is the same as that between finit and fninit: fclex first checks for pending floating-point exceptions.\
fclex
fnclex
The fstsw and fnstsw Instructions
These instructions store the FPU status word into a 16-bit memory location or the AX register
fstsw ax
fnstsw ax
fstsw mem16
fnstsw mem16
These instructions are unusual in the sense that they can copy an FPU value into one of the x86-64 general-purpose registers (specifically, AX). The purpose is to allow the CPU to easily test the condition code register with the sahf instruction. The difference between fstsw and fnstsw is the same as that for fclex and fnclex.
Converting Floating-Point Expressions to Assembly Language
The FPU uses postfix notation (also called reverse Polish notation, or RPN) for arithmetic operations.
Postfix notation, as opposed to standard infix notation, places the operands before the operator.

As you can see, postfix is a convenient notation because it’s easy to translate this code into FPU instructions. Another advantage to postfix notation is that it doesn’t require any parentheses.

Examples In assembly:
; y z + 2 *
fld y
fld z
fadd
fld const2 ; const2 real8 2.0 in .data section
fmul
; y 2 * a b + -
fld y
fld const2 ; const2 real8 2.0 in .data section
fmul
fld a
fld b
fadd
fsub
; a b + c d + *
fld a
fld b
fadd
fld c
fld d
fadd
fmul
Conversion Example with steps-

Postfix to assembly conversion example with steps

SSE Floating-Point Arithmetic
Most modern programs favor the use of the SSE (and later) registers and instructions for floating-point operations over the x87 FPU, using only those x87 operations available exclusively on the x87.
The SSE instruction set supports two floating-point data types: 32-bit single-precision (Intel calls these scalar single operations) and 64-bit double-precision values (Intel calls these scalar double operations).
The SSE does not support the 80-bit extended-precision floating-point data types of the x87 FPU. If you need the extended-precision format, you’ll have to use the x87 FPU.
SSE MXCSR Register
The SSE MXCSR register is a 32-bit status and control register that controls SSE floating-point operations. Bits 16 to 32 are reserved and currently have no meaning.


Access to the SSE MXCSR register is via the following two instructions
ldmxcsr mem32
stmxcsr mem32
The ldmxcsr instruction loads the MXCSR register from the specified 32-bit memory location. The stmxcsr instruction stores the current contents of the MXCSR register to the specified memory location.
By far, the most common use of these two instructions is to set the rounding mode. In typical programs using the SSE floating-point instructions, it is common to switch between the round-to-nearest and round-to-zero (truncate) modes.
SSE Floating-Point Move Instructions
The SSE instruction set provides two instructions to move floating-point values between XMM registers and memory: movss (move scalar single) and movsd (move scalar double).
movss xmmn, mem32
movss mem32, xmmn
movsd xmmn, mem64
movsd mem64, xmmn
As for the standard general-purpose registers, the movss and movsd instructions move data between an appropriate memory location and one of the 16 XMM registers (XMM0 to XMM15).
For maximum performance, movss memory operands should appear at a double-word-aligned memory address, and movsd memory operands should appear at a quad-word-aligned memory address.
In addition to the movss and movsd instructions that move floating-point values between XMM registers or XMM registers and memory, you’ll find a couple of other SSE move instructions useful that move data between XMM and general-purpose registers, movd and movq:
movd reg32, xmmn
movd xmmn, reg32
movq reg64, xmmn
movq xmmn, reg64
These instructions also have a form that allows a source memory operand. However, you should use movss and movsd to move floating-point variables into XMM registers.
The movq and movd instructions are especially useful for copying XMM registers into 64-bit general-purpose registers prior to a call to printf().
SSE Floating-Point Arithmetic Instructions
;; The addsx, subsx, mulsx, and divsx instructions perform
;; the expected floating-point arithmetic operations.
addss xmmn, xmmn
addss xmmn, mem32
addsd xmmn, xmmn
addsd xmmn, mem64
subss xmmn, xmmn
subss xmmn, mem32
subsd xmmn, xmmn
subsd xmmn, mem64
mulss xmmn, xmmn
mulss xmmn, mem32
mulsd xmmn, xmmn
mulsd xmmn, mem64
divss xmmn, xmmn
divss xmmn, mem32
divsd xmmn, xmmn
divsd xmmn, mem64
;; The minsx instructions compute the minimum value
;; of the two operands, storing the minimum value into
;; the destination (first) operand.
minss xmmn, xmmn
minss xmmn, mem32
minsd xmmn, xmmn
minsd xmmn, mem64
;; The maxsx instructions do the same thing, but
;; compute the maximum of the two operands.
maxss xmmn, xmmn
maxss xmmn, mem32
maxsd xmmn, xmmn
maxsd xmmn, mem64
;; The sqrtsx instructions compute the square root of
;; the source (second) operand and store the result
;; into the destination (first) operand.
sqrtss xmmn, xmmn
sqrtss xmmn, mem32
sqrtsd xmmn, xmmn
sqrtsd xmmn, mem64
;; The rcpsx instructions compute the reciprocal of the source,
;; storing the result into the destination. The rcpsx instructions
;; compute the reciprocal of the source, storing the result into
;; the destination.
rcpss xmmn, xmmn
rcpss xmmn, mem32
;; The rsqrtsx instructions compute the reciprocal of the square root.
rsqrtss xmmn, xmmn
rsqrtss xmmn, mem32
SSE Floating-Point Comparisons
The SSE floating-point comparisons work quite a bit differently from the integer and x87 FPU compare instructions. Rather than having a single generic instruction that sets flags the SSE provides a set of condition-specific comparison instructions that store true (all 1 bits) or false (all 0 bits) into the destination operand.
cmpss xmmn, xmmm/mem32, imm8
cmpsd xmmn, xmmm/mem64, imm8
cmpeqss xmmn, xmmm/mem32
cmpltss xmmn, xmmm/mem32
cmpless xmmn, xmmm/mem32
cmpunordss xmmn, xmmm/mem32
cmpne qss xmmn, xmmm/mem32
cmpnltss xmmn, xmmm/mem32
cmpnless xmmn, xmmm/mem32
cmpordss xmmn, xmmm/mem32
cmpeqsd xmmn, xmmm/mem64
cmpltsd xmmn, xmmm/mem64
cmplesd xmmn, xmmm/mem64
cmpunordsd xmmn, xmmm/mem64
cmpneqsd xmmn, xmmm/mem64
cmpnltsd xmmn, xmmm/mem64
cmpnlesd xmmn, xmmm/mem64
cmpordsd xmmn, xmmm/mem64
The instructions without the third (immediate) operand are special pseudo-ops MASM provides that automatically supply the appropriate third operand. You can use the nlt form for ge and nle form for gt, assuming the operands are ordered.
The unordered comparison returns true if either (or both) operands are unordered. Likewise, the ordered comparison returns true if both operands are ordered.
As noted, these instructions leave 0 or all 1 bits in the destination register to represent false or true.

SSE Floating-Point Conversions

Last updated