Arithmetic
Chapter 6
Last updated
Chapter 6
Last updated
Several arithmetic operations require sign- or zero-extended values before the operation.
For general sign-extension operations, the x86-64 provides an extension of the mov instruction, movsx
In order to zero extend the register, move it to itself or use movzx
Single operand mul and imul are used for multiprecision work and generic multi-operand mul and imul instructions are used for generic use.
x86-64 always assumes the accumulator (AL, AX, EAX, or RAX) is the destination operand.
However, the generic imul is not a complete replacement for these two instructions; in addition to the number of operands, several differences exist.
There isn’t an 8×8-bit multi-operand imul instruction available
The generic imul instruction does not produce a 2×n-bit result, but truncates the result to n bits. That is, a 16×16-bit multiplication produces a 16-bit result. Likewise, a 32×32-bit multiplication produces a 32-bit result. These instructions set the carry and overflow flags if the result does not fit into the destination register.
The x86-64 divide instructions perform a 128/64-bit division, a 64/32-bit division, a 32/16-bit division, or a 16/8-bit division. There is no variant of the div or idiv instructions that allows you to divide a value by a constant. If you want to divide a value by a constant, you need to create a memory object (preferably in the .const section) that is initialized with the constant.
The div instruction is an unsigned division operation.
If the operand is an 8-bit operand, div divides the AX register by the operand, leaving the quotient in AL and the remainder (modulo) in AH. If the operand is a 16-bit quantity, the div instruction divides the 32-bit quantity in DX:AX by the operand, leaving the quotient in AX and the remainder in DX. With 32-bit operands, div divides the 64-bit value in EDX:EAX by the operand, leaving the quotient in EAX and the remainder in EDX. Finally, with 64-bit operands, div divides the 128-bit value in RDX:RAX by the operand, leaving the quotient in RAX and the remainder in RDX.
You cannot, on the x86-64, simply divide one unsigned 8-bit value by another. If the denominator is an 8-bit value, the numerator must be a 16-bit value. If you need to divide one unsigned 8-bit value by another, you must zero-extend the numerator to 16 bits by loading the numerator into the AL register and then moving 0 into the AH register.
When dealing with signed integer values, you will need to sign-extend AL into AX, AX into DX, EAX into EDX, or RAX into RDX before executing idiv. To do so, use the cbw, cwd, cdq, or cqo instructions.
The x86-64’s divide instructions have one other issue: you can get a fatal error when using this instruction. First, of course, you can attempt to divide a value by 0. Another problem is that the quotient may be too large to fit into the RAX, EAX, AX, or AL register. For example, the 16/8-bit division 8000h/2 produces the quotient 4000h with a remainder of 0. 4000h will not fit into 8 bits. If this happens, or you attempt to divide by 0, the x86-64 will generate a division exception or integer overflow exception. This usually means your program will crash. If this happens to you, chances are you didn’t sign- or zero-extend your numerator before executing the division operation. Because this error may cause your program to crash, you should be very careful about the values you select when using division.
The setcc (set on condition) instructions set a single-byte operand (register or memory) to 0 or 1 depending on the values in the FLAGS register. These instructions store a 0 in the corresponding operand if the condition is false, and they store a 1 in the 8-bit operand if the condition is true.
The cmp instruction works synergistically with the setcc instructions. Immediately after a cmp operation, the processor flags provide information concerning the relative values of those operands. They allow you to see if one operand is less than, equal to, or greater than the other.
The test instruction computes the logical AND of its two operands and sets the condition code flags based on the result; it does not, however, store the result of the logical AND back into the destination operand.
The test instruction sets the zero flag if the result of the logical AND operation is 0. It sets the sign flag if the HO bit of the result contains a 1. The test instruction always clears the carry and overflow flags.
The test instruction can also check whether all the bits in a specified set of bits contain 0. The instruction test al, 0fh sets the zero flag if and only if the LO 4 bits of AL all contain 0.
One important use of the test instruction is to check whether a register contains 0. The instruction test reg, reg, where both operands are the same register, will logically AND that register with itself. If the register contains 0, the result is 0 and the CPU will set the zero flag. However, if the register contains a nonzero value, logically ANDing that value with itself produces that same nonzero value, so the CPU clears the zero flag.
One major failing of the test instruction is that immediate (constant) operands can be no larger than 32 bits.
In general, shifting an operand to the left n bits multiplies it by 2n . You can multiply any value by a constant by using a series of shifts and additions or shifts and subtractions. For example, to multiply the AX register by 10, you need only multiply it by 8 and then add two times the original value. That is, 10 × AX = 8 × AX + 2 × AX.
You can also use subtraction with shifts to perform a multiplication operation.
You can also use the lea instruction to compute certain products. The trick is to use the scaled-index addressing modes.
If you want to implement a counter variable that counts up to 2n – 1 and then resets to 0, use the following code:
The x87 FPUs add 14 registers to the x86-64: eight floating-point data registers, a control register, a status register, a tag register, an instruction pointer, a data pointer, and an opcode register.
The data registers handle all floating point calculations, the control register contains bits that let you decide how the FPU handles certain degenerate cases like rounding off of inaccurate computations, the status register is similar to the FLAGS register, the tag register contains several groups of bits that determine the state of the FPU in each of the data registers, the instruction, data pointer, and opcode registers contain certain state information about the last floating-point instruction executed.
The FPUs provide eight 80-bit data registers organized as a stack. MASM refers to these registers as ST(0), ST(1), . . . ST(7).
The biggest difference between the FPU register set and the x86-64 register set is the stack organization. On the x86-64 CPU, the AX register is always the AX register, no matter what happens. On the FPU, however, the register set is an eight-element stack of 80-bit floating-point values.
ST(0) refers to the item on the top of stack, ST(1) refers to the next item on the stack, and so on. Many floating-point instructions push and pop items on the stack; therefore, ST(1) will refer to the previous contents of ST(0) after you push something onto the stack.
Bits 10 and 11 of the FPU control register provide rounding control
Bits 8 and 9 of the control register specify the precision during computation. This capability is provided to allow compatibility with older software as required by the IEEE 754 standard.
Bits 0 to 5 are the exception masks. These are similar to the interrupt enable bit in the x86-64’s FLAGS register. If these bits contain a 1, the corresponding condition is ignored by the FPU. However, if any bit contains 0s, and the corresponding condition occurs, then the FPU immediately generates an interrupt so the program can handle the degenerate condition.
Bit 0 corresponds to an invalid operation error, which generally occurs as the result of a programming error.
Bit 1 masks the denormalized interrupt that occurs whenever you try to manipulate denormalized values. Denormalized exceptions occur when you load arbitrary extended-precision values into the FPU or work with very small numbers just beyond the range of the FPU’s capabilities.
Bit 2 masks the zero-divide exception. If this bit contains 0, the FPU will generate an interrupt if you attempt to divide a nonzero value by 0. If you do not enable the zero-divide exception, the FPU will produce NaN whenever you perform a zero division. It’s probably a good idea to enable this exception by programming a 0 into this bit.
Bit 3 masks the overflow exception. The FPU will raise the overflow exception if a calculation overflows or if you attempt to store a value that is too large to fit into the destination operand.
Bit 4, if set, masks the underflow exception. Underflow occurs when the result is too small to fit in the destination operand. Like overflow, this exception can occur whenever you store a small extended-precision value into a smaller variable (single or double precision) or when the result of a computation is too small for extended precision.
Bit 5 controls whether the precision exception can occur. A precision exception occurs whenever the FPU produces an imprecise result, generally the result of an internal rounding operation.
Bits 6 and 7, and 12 to 15, in the control register are currently undefined and reserved for future use.
The FPU provides two instructions, fldcw (load control word) and fstcw (store control word), that let you load and store the contents of the control register, respectively. The single operand to these instructions must be a 16-bit memory location. The fldcw instruction loads the control register from the specified memory location. fstcw stores the control register into the specified memory location.
The 16-bit FPU status register provides the status of the FPU at the instant you read it
Bits 0 through 5 are the exception flags. These bits appear in the same order as the exception masks in the control register. If the corresponding condition exists, the bit is set. These bits are independent of the exception masks in the control register.
Bit 6 indicates a stack fault. A stack fault occurs whenever a stack overflow or underflow occurs. When this bit is set, the C1 condition code bit determines whether there was a stack overflow (C1 = 1) or stack underflow (C1 = 0) condition
Bit 7 of the status register is set if any error condition bit is set. It is the logical or of bits 0 through 5. A program can test this bit to quickly determine if an error condition exists.
Bits 8, 9, 10, and 14 are the coprocessor condition code bits.
Bits 11 to 13 of the FPU status register provide the register number of the top of stack. During computations, the FPU adds (modulo 8) the logical register numbers supplied by the programmer to these 3 bits to determine the physical register number at runtime.
Bit 15 of the status register is the busy bit. It is set whenever the FPU is busy. This bit is a historical artifact from the days when the FPU was a separate chip.
The FPU supports seven data types: three integer types, a packed decimal type, and three floating-point types.
The integer type supports 16-, 32-, and 64-bit integers, although it is often faster to do the integer arithmetic by using the integer unit of the CPU. The packed decimal type provides an 18-digit signed decimal (BCD) integer. The primary purpose of the BCD format is to convert between strings and floating-point values. The remaining three data types are the 32-, 64-, and 80-bit floating-point data types.
The FPU generally stores values in a normalized format. The HO bit of the mantissa is always 1 when a floating-point number is normalized. In the 32- and 64-bit floating-point formats, the FPU does not actually store this bit; the FPU always assumes that it is 1. Therefore, 32- and 64-bit floating point numbers are always normalized. In the extended-precision 80-bit floating-point format, the FPU does not assume that the HO bit of the mantissa is 1; the HO bit of the mantissa appears as part of the string of bits.
Normalized values provide the greatest precision for a given number of bits. However, many non-normalized values cannot be represented with the 80-bit format. These values are very close to 0 and represent the set of values whose mantissa HO bit is not 0. The FPUs support a special 80-bit form known as denormalized values.
Denormalized values allow the FPU to encode very small values it cannot encode using normalized values, but denormalized values offer fewer bits of precision than normalized values. Therefore, using denormalized values in a computation may introduce slight inaccuracy.
The fld instruction loads a 32-, 64-, or 80-bit floating-point value onto the stack. This instruction converts 32- and 64-bit operands to an 80-bit extended precision value before pushing the value onto the floating-point stack.
There is no way to directly load a 32-bit integer register onto the floating point stack, even if that register contains a real4 value. To do so, you must first store the integer register into a memory location, and then push that memory location onto the FPU stack by using the fld instruction.
The fst and fstp instructions copy the value on the top of the floating-point stack to another floating-point register or to a 32-, 64-, or (fstp only) 80-bit memory variable. By incrementing the TOS pointer in the status register after accessing the data in ST(0), the fstp instruction pops the value off the top of stack when moving it to the destination location. If the destination operand is a floating-point register, the FPU stores the value at the specified register number before popping the data off the top of stack.
Executing an fstp st(0) instruction effectively pops the data off the top of stack with no data transfer.
The fxch instruction exchanges the value on the top of stack with one of the other FPU registers. This instruction takes two forms: one with a single FPU register as an operand and the second without any operands.
Many FPU instructions (for example, fsqrt) operate only on the top of the register stack. If you want to perform such an operation on a value that is not on top, you can use the fxch instruction to swap that register with TOS, perform the desired operation, and then use fxch to swap the TOS with the original register.
The fild (integer load) instruction converts a 16-, 32-, or 64-bit two’s complement integer to the 80-bit extended-precision format and pushes the result onto the stack. This instruction always expects a single operand: the address of a word, double-word, or quad-word integer variable.
The fist, fistp, and fisttp instructions convert the 80-bit extended-precision variable on the top of stack to a 16-, 32-, or (fistp/fistpp only) 64-bit integer and store the result away into the memory variable specified by the single operand.
The fist instruction converts the value on the top of stack to an integer and then stores the result; it does not otherwise affect the floating-point register stack. The fistp (Store Integer and Pop) and fisttp (Store Integer with Truncation and Pop) instructions pop the value off the floating point register stack after storing the converted value.
The fbld and fbstp instructions load and store 80-bit BCD values. The fbld instruction converts a BCD value to its 80-bit extended-precision equivalent and pushes the result onto the stack.
The fbstp instruction pops the extended-precision real value on TOS, converts it to an 80-bit BCD value (rounding according to the bits in the floating-point control register), and stores the converted result at the address specified by the destination memory operand. There is no fbst instruction.
The fadd instruction, with no operands, is a synonym for faddp. The faddp instruction (also with no operands) pops the two values on the top of stack, adds them, and pushes their sum back onto the stack.
The next two forms of the fadd instruction, those with two FPU register operands, behave like the x86-64’s add instruction. They add the value in the source register operand to the value in the destination register operand. One of the register operands must be ST(0).
The faddp instruction with two operands adds ST(0) (which must always be the source operand) to the destination operand and then pops ST(0). The destination operand must be one of the other FPU registers.
The last two forms, fadd with a memory operand, adds a 32- or 64-bit floating-point variable to the value in ST(0). This instruction will convert the 32- or 64-bit operands to an 80-bit extended-precision value before performing the addition. Note that this instruction does not allow an 80-bit memory operand.
With no operands, fsub is the same as fsubp (without operands). With no operands, the fsubp instruction pops ST(0) and ST(1) from the register stack, computes ST(1) – ST(0), and then pushes the difference back onto the stack. The fsubr and fsubrp instructions (reverse subtraction) operate in an identical fashion except they compute ST(0) – ST(1).
With two register operands (destination, source), the fsub instruction computes destination = destination – source. One of the two registers must be ST(0). With two registers as operands, the fsubp also computes destination = destination – source, and then it pops ST(0) off the stack after computing the difference. For the fsubp instruction, the source operand must be ST(0).
With two register operands, the fsubr and fsubrp instructions work in a similar fashion to fsub and fsubp, except they compute destination = source – destination.
The fsub mem32, fsub mem64, fsubr mem32, and fsubr mem64 instructions accept a 32- or 64-bit memory operand. They convert the memory operand to an 80-bit extended-precision value and subtract this from ST(0) (fsub) or subtract ST(0) from this value (fsubr) and store the result back into ST(0).
The fmul and fmulp instructions multiply two floating-point values. The fimul instruction multiples an integer and a floating-point value.
With no operands, fmul is a synonym for fmulp. The fmulp instruction, with no operands, will pop ST(0) and ST(1), multiply these values, and push their product back onto the stack. The fmul instructions with two register operands compute destination = destination × source. One of the registers (source or destination) must be ST(0).
With no operands, the fdiv instruction is a synonym for fdivp. The fdivp instruction with no operands computes ST(1) = ST(1) / ST(0). The fdivr and fdivrp instructions work in a similar fashion to fdiv and fdivp except that they compute ST(0) / ST(1) rather than ST(1) / ST(0).
With two register operands, these instructions compute the following quotients:
The fdivp and fdivrp instructions also pop ST(0) after performing the division operation. The value for i in these two instructions is computed before popping ST(0).
The fprem and fprem1 instructions compute a partial remainder (a value that may require additional computation to produce the actual remainder). You should always use fprem1 in new software; therefore, we will discuss only fprem1 here, although you use fprem in an identical fashion.
fprem1 computes the partial remainder of ST(0) / ST(1). If the difference between the exponents of ST(0) and ST(1) is less than 64, fprem1 can compute the exact remainder in one operation. Otherwise, you will have to execute fprem1 two or more times to get the correct remainder value.
The C2 condition code bit determines when the computation is complete.
fprem1 does not pop the two operands off the stack; it leaves the partial remainder in ST(0) and the original divisor in ST(1) in case you need to compute another partial product to complete the result.
The fprem1 instruction sets the stack exception flag if there aren’t two values on the top of stack. It sets the underflow and denormal exception bits if the result is too small. It sets the invalid operation bit if the values on TOS are inappropriate for this operation. It sets the C2 condition code bit if the partial remainder operation is not complete (or on stack underflow). Finally, it loads C1, C2, and C0 with bits 0, 1, and 2 of the quotient, respectively.
The frndint instruction rounds the value on TOS to the nearest integer by using the rounding algorithm specified in the control register.
This instruction sets the stack exception flag if there is no value on the TOS (it will also clear C1 in this case). It sets the precision and denormal exception bits if a loss of precision occurred. It sets the invalid operation flag if the value on the TOS is not a valid number.
fabs computes the absolute value of ST(0) by clearing the mantissa sign bit of ST(0). It sets the stack exception bit and invalid operation bits if the stack is empty.
fchs changes the sign of ST(0)’s value by inverting the mantissa sign bit (this is the floating-point negation instruction). It sets the stack exception bit and invalid operation bits if the stack is empty.
The fcom, fcomp, and fcompp instructions compare ST(0) to the specified operand and set the corresponding FPU condition code bits based on the result of the comparison.
With no operands, fcom, fcomp, and fcompp compare ST(0) against ST(1) and set the FPU flags accordingly. In addition, fcomp pops ST(0) off the stack, and fcompp pops both ST(0) and ST(1) off the stack.
With a single-register operand, fcom and fcomp compare ST(0) against the specified register. fcomp also pops ST(0) after the comparison.
These instructions set C2 if the two operands are not comparable. These instructions set the stack fault bit if there aren’t two items on the top of the register stack. They set the denormalized exception bit if either or both operands are denormalized. They set the invalid operation flag if either or both operands are NaNs. These instructions always clear the C1 condition code.
Because all x86-64 64-bit CPUs support the fcomi and fcomip instructions (described in the next section), you should consider using those instructions as they spare you from having to store the FPU status word into AX and then copy AH into the FLAGS register before testing the condition.
The fcomi and fcomip instructions compare ST(0) to the specified operand and set the corresponding FLAGS condition code bits based on the result of the comparison.
The ftst instruction compares the value in ST(0) against 0.0.
This instruction does not differentiate –0.0 from +0.0.
If the value in ST(0) is either of these values, ftst will set C3 to denote equality
The FPU provides several instructions that let you load commonly used constants onto the FPU’s register stack.
These instructions set the stack fault, invalid operation, and C1 flags if a stack overflow occurs.
The value in ST(0) must be in the range –1.0 to +1.0. If ST(0) is out of range, f2xm1 generates an undefined result but raises no exceptions.
The computed value replaces the value in ST(0).
These instructions pop the value off the top of the register stack and compute the sine, cosine, or both, and push the result(s) back onto the stack.
The fsincos instruction pushes the sine followed by the cosine of the original operand; hence, it leaves cos(ST(0)) in ST(0) and sin(ST(0)) in ST(1).
fptan computes the tangent of ST(0), replaces ST(0) with this value, and then pushes 1.0 onto the stack.
The finit and fninit instructions initialize the FPU for proper operation.
Your code should execute one of these instructions before executing any other FPU instructions. They initialize the control register to 37Fh, the status register to 0, and the tag word to 0FFFFh.
The difference between finit and fninit is that finit first checks for any pending floating-point exceptions before initializing the FPU; fninit does not.
The fldcw and fstcw instructions load the control word from a memory location (fldcw) or store the control word to a 16-bit memory location (fstcw).
When you use fldcw to turn on one of the exceptions, if the corresponding exception flag is set when you enable that exception, the FPU will generate an immediate interrupt before the CPU executes the next instruction. Therefore, you should use fclex to clear any pending interrupts before changing the FPU exception enable bits.
The fclex and fnclex instructions clear all exception bits, the stack fault bit, and the busy flag in the FPU status register.
The difference between these instructions is the same as that between finit and fninit: fclex first checks for pending floating-point exceptions.\
These instructions store the FPU status word into a 16-bit memory location or the AX register
These instructions are unusual in the sense that they can copy an FPU value into one of the x86-64 general-purpose registers (specifically, AX). The purpose is to allow the CPU to easily test the condition code register with the sahf instruction. The difference between fstsw and fnstsw is the same as that for fclex and fnclex.
The FPU uses postfix notation (also called reverse Polish notation, or RPN) for arithmetic operations.
Postfix notation, as opposed to standard infix notation, places the operands before the operator.
As you can see, postfix is a convenient notation because it’s easy to translate this code into FPU instructions. Another advantage to postfix notation is that it doesn’t require any parentheses.
Examples In assembly:
Conversion Example with steps-
Postfix to assembly conversion example with steps
Most modern programs favor the use of the SSE (and later) registers and instructions for floating-point operations over the x87 FPU, using only those x87 operations available exclusively on the x87.
The SSE instruction set supports two floating-point data types: 32-bit single-precision (Intel calls these scalar single operations) and 64-bit double-precision values (Intel calls these scalar double operations).
The SSE does not support the 80-bit extended-precision floating-point data types of the x87 FPU. If you need the extended-precision format, you’ll have to use the x87 FPU.
The SSE MXCSR register is a 32-bit status and control register that controls SSE floating-point operations. Bits 16 to 32 are reserved and currently have no meaning.
Access to the SSE MXCSR register is via the following two instructions
The ldmxcsr instruction loads the MXCSR register from the specified 32-bit memory location. The stmxcsr instruction stores the current contents of the MXCSR register to the specified memory location.
By far, the most common use of these two instructions is to set the rounding mode. In typical programs using the SSE floating-point instructions, it is common to switch between the round-to-nearest and round-to-zero (truncate) modes.
The SSE instruction set provides two instructions to move floating-point values between XMM registers and memory: movss (move scalar single) and movsd (move scalar double).
As for the standard general-purpose registers, the movss and movsd instructions move data between an appropriate memory location and one of the 16 XMM registers (XMM0 to XMM15).
For maximum performance, movss memory operands should appear at a double-word-aligned memory address, and movsd memory operands should appear at a quad-word-aligned memory address.
In addition to the movss and movsd instructions that move floating-point values between XMM registers or XMM registers and memory, you’ll find a couple of other SSE move instructions useful that move data between XMM and general-purpose registers, movd and movq:
These instructions also have a form that allows a source memory operand. However, you should use movss and movsd to move floating-point variables into XMM registers.
The movq and movd instructions are especially useful for copying XMM registers into 64-bit general-purpose registers prior to a call to printf().
The SSE floating-point comparisons work quite a bit differently from the integer and x87 FPU compare instructions. Rather than having a single generic instruction that sets flags the SSE provides a set of condition-specific comparison instructions that store true (all 1 bits) or false (all 0 bits) into the destination operand.
The instructions without the third (immediate) operand are special pseudo-ops MASM provides that automatically supply the appropriate third operand. You can use the nlt form for ge and nle form for gt, assuming the operands are ordered.
The unordered comparison returns true if either (or both) operands are unordered. Likewise, the ordered comparison returns true if both operands are ordered.
As noted, these instructions leave 0 or all 1 bits in the destination register to represent false or true.
f2xm1 computes
Here’s an example computing 10^i using the identity .
These instructions assume ST(0) specifies an angle in radians, and this angle must be in the range
fpatan expects two values on the top of stack. It pops them and computes . The resulting value is the arctangent of the ratio on the stack expressed in radians.
The fyl2x instruction computes . The instruction itself has no operands, but expects two operands on the FPU stack in ST(1) and ST(0).
fyl2xp1 computes from two operands on the FPU stack.