Advanced Arithmetic
Chapter 8
Extended-Precision Addition

As you can see, the idea is to break a larger operation into a sequence of smaller ones. Since the x86 processor family is capable of adding together at most 64 bits at a time (using general-purpose registers), the operation must proceed in blocks of 64 bits or fewer. Here are the steps:

Remember, X, Y, and Z are oword objects (128 bits), and an instruction of the form mov rax, X would attempt to load a 128-bit value into a 64-bit register. To load a 64-bit value, specifically the LO 64 bits, the qword ptr operator coerces symbols X, Y, and Z to 64 bits. To load the HO qwords, you use address expressions of the form X[8], along with the qword ptr operator, because the x86 memory space addresses bytes, and it takes 8 consecutive bytes to form a quad word.
Extended-Precision Subtraction
Just as it does addition, the x86-64 performs multi-byte subtraction the same way you would manually, except it subtracts whole bytes, words, double words, or quad words at a time rather than decimal digits.
You use the sub instruction on the LO byte, word, double word, or quad word and the sbb (subtract with borrow) instruction on the high-order values.
Extended-Precision Comparisions
You need to look at both bytes of a pair of values only if the HO bytes are equal. In all other cases, comparing the HO bytes tells you everything you need to know about the values. This is true for any number of bytes, not just two. The following code compares two signed 128-bit integers by comparing their HO quad words first and comparing their LO quad words only if the HO quad words are equal:
To compare unsigned values, use the ja and jb instructions in place of jg and jl.
To generalize the preceding code for objects larger than 128 bits, start the comparison with the objectsβ HO quad words and work your way down to their LO quad words, as long as the corresponding double words are equal.
Extended-Precision Multiplication

The x86-64 does extended-precision multiplication in the same manner except that it works with bytes, words, double words, and quad words rather than digits.
Probably the most important thing to remember when performing an extended-precision multiplication is that you must also perform an extended-precision addition at the same time. Adding up all the partial products requires several additions.

Below code demonstrates how to multiply two 64-bit values (producing a 128-bit result) by using 32-bit instructions. Technically, you can do a 64-bit multiplication with a single instruction, but this example demonstrates a method you can easily extend to 128 bits by using the x86-64 64-bit registers rather than the 32-bit registers.
The code works only for unsigned operands. To multiply two signed values, you must note the signs of the operands before the multiplication, take the absolute value of the two operands, do an unsigned multiplication, and then adjust the sign of the resulting product based on the signs of the original operands.
Extended-Precision Division
The general algorithm is as follows:
Move the HO quad word of the dividend into RAX and zero-extend it into RDX.
Divide by the divisor.
Store the value in RAX into the corresponding qword position of the quotient result variable (position of the dividend qword loaded into RAX prior to the division).
Load RAX with the next-lower quad word in the dividend, without modifying RDX.
Repeat steps 2 to 4 until youβve processed all the quad words in the dividend.
Generic N-bit by M-bit Division
To use a divisor larger than 64 bits, you have to implement the division by using a shift-and-subtract strategy, which works but is very slow.

But this method is even easier in binary.

Example of it's implementation in in MASM.
To handle division by 0, check the divisor against 0 prior to running this code and return an appropriate error code if the divisor is 0. Dealing with signed values is the same as the earlier division algorithm: note the signs, take the operandsβ absolute values, do the unsigned division, and then fix the sign afterward.
You can use the following technique to boost the performance of this division by a fair amount. Check to see if the divisor variable uses only 32 bits. Often, even though the divisor is a 128-bit variable, the value itself fits into 32 bits (that is, the HO double words of Divisor are 0) and you can use the div instruction, which is much faster. The improved algorithm is a bit more complex because you have to first compare the HO quad words for 0, but on average, it runs much faster while remaining capable of dividing any two pairs of values.
Extended-Precision Negation Operations
The neg instruction doesnβt provide a generic extended-precision form. However, a negation is equivalent to subtracting a value from 0, so we can easily simulate an extended-precision negation by using the sub and sbb instructions.
The following code provides a simple way to negate a (320-bit) value by subtracting that value from 0, using an extended-precision subtraction:
A slightly more efficient way to negate smaller values (128 bits) uses a combination of neg and sbb instructions. This technique uses the fact that neg subtracts its operand from 0. In particular, it sets the flags the same way the sub instruction would if you subtracted the destination value from 0. This code takes the following form (assuming you want to negate the 128-bit value in RDX:RAX):
Extended-Precision AND Operations
Performing an n-byte AND operation is easy: simply AND the corresponding bytes between the two operands, saving the result.
To extend this technique to any number of qwords, logically AND the corresponding bytes, words, double words, or quad words together in the operands.
This sequence sets the flags according to the value of the last and operation. If you AND the HO quad words last, this sets all but the zero flag correctly. If you need to test the zero flag after this sequence, logically OR the two resulting double words together (or otherwise compare them both against 0).
You can also use the XMM and YMM registers to perform extended-precision logical operations (up to 256 bits at a time).
Extended-Precision OR Operations
Multi-byte logical OR operations are performed in the same way as multibyte AND operations. You OR the corresponding bytes in the two operands together.
As in the previous example, this does not set the zero flag properly for the entire operation. If you need to test the zero flag after an extended-precision OR, you must logically OR all the resulting double words together.
Extended-Precision XOR Operations
As with other logical operations, extended-precision XOR operations exclusive-ORs the corresponding bytes in the two operands to obtain the extended-precision result.
The comment about the zero flag in the previous two sections, as well as the comment about the XMM and YMM registers, apply here.
Extended-Precision Shift Operations
Extended-Precision Shift Left

To accomplish this with machine instructions, we must first shift the LO qword to the left (for example, using the shl instruction) and capture the output from bit 63 (conveniently, the carry flag does this for us). We must then shift this bit into the LO bit of the HO qword while simultaneously shifting all the other bits to the left.
For example, to shift the 128-bit quantity in RDX:RAX one position to the left
To perform a shift left on an operand larger than 128 bits, use additional rcl instructions. An extended-precision shift-left operation always starts with the least-significant quad word, and each succeeding rcl instruction operates on the next-most-significant double word.
If you need to shift your data by 2 or more bits, you can place the instructions in a loop to repeat them a certain number of times.
Extended-Precision Shift Right and Shift Arithmetic Right
You implement shr (shift right) and sar (shift arithmetic right) in a similar way, except you must start at the HO word of the operand and work your way down to the LO word:
Efficient Multi-bit Extended-Precision Shifts
The shld and shrd instructions let you efficiently implement extended-precision shifts of several bits.


The first shld instruction shifts the bits from ShiftMe[8] into ShiftMe[16] without affecting the value in ShiftMe[8]. The second shld instruction shifts the bits from ShiftMe into ShiftMe[8]. Finally, the shl instruction shifts the LO double word the appropriate amount.
Extended-Precision Rotate Operations
The rcl and rcr operations extend in a manner similar to shl and shr. For example, to perform 192-bit rcl and rcr operations, use the following instructions:
Performing an extended-precision rol or ror operation isnβt quite as simple because of the way the incoming bit is processed. You can use the bt, shld, and shrd instructions to implement an extended-precision rol or ror instruction.
Operating on Different-Size Operands
Occasionally, you may need to do a computation on a pair of operands that are not the same size. For example, you may need to add a word and a double word together or subtract a byte value from a word value.
To do so, extend the smaller operand to the size of the larger operand and then operate on two same-size operands. For signed operands, you sign-extend the smaller operand to the same size as the larger operand; for unsigned values, you zero-extend the smaller operand. This works for any operation.
Last updated