Advanced Arithmetic
Chapter 8
Extended-Precision Addition

As you can see, the idea is to break a larger operation into a sequence of smaller ones. Since the x86 processor family is capable of adding together at most 64 bits at a time (using general-purpose registers), the operation must proceed in blocks of 64 bits or fewer. Here are the steps:

; example in MASM
.data
X oword ?
Y oword ?
.code
mov rax, qword ptr X ; Add together the LO 64 bits
add rax, qword ptr Y ; of the numbers and store the
mov qword ptr Z, rax ; result into the LO qword of Z
mov rax, qword ptr X[8] ; Add together (with carry) the
adc rax, qword ptr Y[8] ; HO 64 bits and store the result
mov qword ptr Z[8], rax ; into the HO qword of Z
Remember, X, Y, and Z are oword objects (128 bits), and an instruction of the form mov rax, X would attempt to load a 128-bit value into a 64-bit register. To load a 64-bit value, specifically the LO 64 bits, the qword ptr operator coerces symbols X, Y, and Z to 64 bits. To load the HO qwords, you use address expressions of the form X[8], along with the qword ptr operator, because the x86 memory space addresses bytes, and it takes 8 consecutive bytes to form a quad word.
Extended-Precision Subtraction
Just as it does addition, the x86-64 performs multi-byte subtraction the same way you would manually, except it subtracts whole bytes, words, double words, or quad words at a time rather than decimal digits.
You use the sub instruction on the LO byte, word, double word, or quad word and the sbb (subtract with borrow) instruction on the high-order values.
; Example of extended-precision Subtraction
;
; 256-bit subtraction
.data
BigVal1 qword 4 dup (?)
BigVal2 qword 4 dup (?)
BigVal3 qword 4 dup (?)
.
.
.
; Compute BigVal3 := BigVal1 - BigVal2.
; Note: don't need to coerce types of BigVa1, BigVal2, or BigVal3
; because their base types are already qword.
mov rax, BigVal1[0]
sub rax, BigVal2[0]
mov BigVal3[0], rax
mov rax, BigVal1[8]
sbb rax, BigVal2[8]
mov BigVal3[8], rax
mov rax, BigVal1[16]
sbb rax, BigVal2[16]
mov BigVal3[16], rax
mov rax, BigVal1[24]
sbb rax, BigVal2[24]
mov BigVal3[24], rax
Extended-Precision Comparisions
You need to look at both bytes of a pair of values only if the HO bytes are equal. In all other cases, comparing the HO bytes tells you everything you need to know about the values. This is true for any number of bytes, not just two. The following code compares two signed 128-bit integers by comparing their HO quad words first and comparing their LO quad words only if the HO quad words are equal:
; This sequence transfers control to location "IsGreater" if
; QwordValue > QwordValue2. It transfers control to "IsLess" if
; QwordValue < QwordValue2. It falls through to the instruction
; following this sequence if QwordValue = QwordValue2.
; To test for inequality, change the "IsGreater" and "IsLess"
; operands to "NotEqual" in this code.
mov rax, qword ptr QWordValue[8] ; Get HO qword
cmp rax, qword ptr QWordValue2[8]
jg IsGreater
jl IsLess;
mov rax, qword ptr QWordValue[0] ; If HO qwords equal,
cmp rax, qword ptr QWordValue2[0] ; then we must compare
jg IsGreater; ; the LO dwords
jl IsLess;
; Fall through to this point if the two values were equal.
To compare unsigned values, use the ja and jb instructions in place of jg and jl.
To generalize the preceding code for objects larger than 128 bits, start the comparison with the objects’ HO quad words and work your way down to their LO quad words, as long as the corresponding double words are equal.
; comparing two 256-bit values
.data
Big1 qword 4 dup (?)
Big2 qword 4 dup (?)
.
.
.
mov rax, Big1[24]
cmp rax, Big2[24]
jb isLE
ja notLE
mov rax, Big1[16]
cmp rax, Big2[16]
jb isLE
ja notLE
mov rax, Big1[8]
cmp rax, Big2[8]
jb isLE
ja notLE
mov rax, Big1[0]
cmp rax, Big2[0]
jnbe notLE
isLE:
Code to execute if Big1 <= Big2
.
.
.
notLE:
Code to execute if Big1 > Big2
Extended-Precision Multiplication

The x86-64 does extended-precision multiplication in the same manner except that it works with bytes, words, double words, and quad words rather than digits.
Probably the most important thing to remember when performing an extended-precision multiplication is that you must also perform an extended-precision addition at the same time. Adding up all the partial products requires several additions.

Below code demonstrates how to multiply two 64-bit values (producing a 128-bit result) by using 32-bit instructions. Technically, you can do a 64-bit multiplication with a single instruction, but this example demonstrates a method you can easily extend to 128 bits by using the x86-64 64-bit registers rather than the 32-bit registers.
; 128-bit multiplication
option casemap:none
nl = 10
.const
fmtStr1 byte "%d * %d = %I64d (verify:%I64d)", nl, 0
.data
op1 qword 123456789
op2 qword 234567890
product oword ?
product2 oword ?
.code
externdef printf:proc
; mul64-
;
; Multiplies two 64-bit values passed in rdx and rax by
; doing a 64x64-bit multiplication, producing a 128-bit result.
; Algorithm is easily extended to 128x128 bits by switching the
; 32-bit registers for 64-bit registers.
;
; Stores result to location pointed at by R8.
mul64 proc
mp equ <dword ptr [rbp-8]> ;Multiplier
mc equ <dword ptr [rbp-16]> ;Multiplicand
prd equ <dword ptr [r8]> ;Result
push rbp
mov rbp, rsp
sub rsp, 24
push rbx ;Preserve these register values
push rcx
; Save parameters passed in registers:
mov qword ptr mp, rax
mov qword ptr mc, rdx
; Multiply the LO dword of Multiplier times Multiplicand.
mov eax, mp
mul mc ; Multiply LO dwords.
mov prd, eax ; Save LO dword of product.
mov ecx, edx ; Save HO dword of partial product result.
mov eax, mp
mul mc[4] ; Multiply mp(LO) * mc(HO)
add eax, ecx ; Add to the partial product.
adc edx, 0 ; Don't forget the carry!
mov ebx, eax ; Save partial product for now.
mov ecx, edx
; Multiply the HO word of Multiplier with Multiplicand.
mov eax, mp[4] ; Get HO dword of Multiplier.
mul mc ; Multiply by LO word of Multiplicand.
add eax, ebx ; Add to the partial product.
mov prd[4], eax ; Save the partial product.
adc ecx, edx ; Add in the carry!
mov eax, mp[4] ; Multiply the two HO dwords together.
mul mc[4]
add eax, ecx ; Add in partial product.
adc edx, 0 ; Don't forget the carry!
mov prd[8], eax ;Save HO qword of result
mov prd[12], edx
; EDX:EAX contains 64-bit result at this point
pop rcx ; Restore these registers
pop rbx
leave
ret
mul64 endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 64 ;Shadow storage
; Test the mul64 function:
mov rax, op1
mov rdx, op2
lea r8, product
call mul64
; Use a 64-bit multiply to test the result
mov rax, op1
mov rdx, op2
imul rax, rdx
mov qword ptr product2, rax
; Print the results:
lea rcx, fmtStr1
mov rdx, op1
mov r8, op2
mov r9, qword ptr product
mov rax, qword ptr product2
mov [rsp+32], rax
call printf
leave
ret ;Returns to caller
asmMain endp
end
The code works only for unsigned operands. To multiply two signed values, you must note the signs of the operands before the multiplication, take the absolute value of the two operands, do an unsigned multiplication, and then adjust the sign of the resulting product based on the signs of the original operands.
Extended-Precision Division
The general algorithm is as follows:
Move the HO quad word of the dividend into RAX and zero-extend it into RDX.
Divide by the divisor.
Store the value in RAX into the corresponding qword position of the quotient result variable (position of the dividend qword loaded into RAX prior to the division).
Load RAX with the next-lower quad word in the dividend, without modifying RDX.
Repeat steps 2 to 4 until you’ve processed all the quad words in the dividend.
; 256-bit by 64-bit division
option casemap:none
nl = 10
.const
fmtStr1 byte "quotient = "
byte "%08x_%08x_%08x_%08x_%08x_%08x_%08x_%08x"
byte nl, 0
fmtStr2 byte "remainder = %I64x", nl, 0
.data
; op1 is a 256-bit value. Initial values were chosen
; to make it easy to verify result.
op1 oword 2222eeeeccccaaaa8888666644440000h
oword 2222eeeeccccaaaa8888666644440000h
op2 qword 2
result oword 2 dup (0) ;Also 256 bits
remain qword 0
.code
externdef printf:proc
; div256-
; Divides a 256-bit number by a 64-bit number.
;
; Dividend - passed by reference in RCX.
; Divisor - passed in RDX.
;
; Quotient - passed by reference in R8.
; Remainder- passed by reference in R9.
div256 proc
divisor equ <qword ptr [rbp-8]>
dividend equ <qword ptr [rcx]>
quotient equ <qword ptr [r8]>
remainder equ <qword ptr [r9]>
push rbp
mov rbp, rsp
sub rsp, 8
mov divisor, rdx
mov rax, dividend[24] ; Begin div with HO qword
xor rdx, rdx ; Zero extend into RDS
div divisor ; Divide HO word
mov quotient[24], rax ; Save HO result
mov rax, dividend[16] ; Get dividend qword #2
div divisor ; Continue with division
mov quotient[16], rax ; Store away qword #2
mov rax, dividend[8] ; Get dividend qword #1
div divisor ; Continue with division
mov quotient[8], rax ; Store away qword #1
mov rax, dividend[0] ; Get LO dividend qword
div divisor ; Continue with division
mov quotient[0], rax ; Store away LO qword
mov remainder, rdx ; Save away remainder
leave
ret
div256 endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 80 ;Shadow storage
; Test the div256 function:
lea rcx, op1
mov rdx, op2
lea r8, result
lea r9, remain
call div256
; Print the results:
lea rcx, fmtStr1
mov edx, dword ptr result[28]
mov r8d, dword ptr result[24]
mov r9d, dword ptr result[20]
mov eax, dword ptr result[16]
mov [rsp+32], rax
mov eax, dword ptr result[12]
mov [rsp+40], rax
mov eax, dword ptr result[8]
mov [rsp+48], rax
mov eax, dword ptr result[4]
mov [rsp+56], rax
mov eax, dword ptr result[0]
mov [rsp+64], rax
call printf
lea rcx, fmtStr2
mov rdx, remain
call printf
leave
ret ;Returns to caller
asmMain endp
end
Generic N-bit by M-bit Division
To use a divisor larger than 64 bits, you have to implement the division by using a shift-and-subtract strategy, which works but is very slow.

But this method is even easier in binary.

Example of it's implementation in in MASM.
; 128-bit by 128-bit division
option casemap:none
nl = 10
.const
fmtStr1 byte "quotient = "
byte "%08x_%08x_%08x_%08x"
byte nl, 0
fmtStr2 byte "remainder = "
byte "%08x_%08x_%08x_%08x"
byte nl, 0
fmtStr3 byte "quotient (2) = "
byte "%08x_%08x_%08x_%08x"
byte nl, 0
.data
; op1 is a 128-bit value. Initial values were chosen
; to make it easy to verify result.
op1 oword 2222eeeeccccaaaa8888666644440000h
op2 oword 2
op3 oword 11117777666655554444333322220000h
result oword ?
remain oword ?
.code
externdef printf:proc
; div128-
;
; This procedure does a general 128/128 division operation
; using the following algorithm (all variables are assumed
; to be 128-bit objects):
;
; Quotient := Dividend;
; Remainder := 0;
; for i := 1 to NumberBits do
;
; Remainder:Quotient := Remainder:Quotient SHL 1;
; if Remainder >= Divisor then
;
; Remainder := Remainder - Divisor;
; Quotient := Quotient + 1;
;
; endif
; endfor
;
; Data passed:
;
; 128-bit dividend, by reference in RCX
; 128-bit divisor, by reference in RDX
;
; Data returned:
;
; Pointer to 128-bit quotient in R8
; Pointer to 128-bit remainder in R9
div128 proc
remainder equ <[rbp-16]>
dividend equ <[rbp-32]>
quotient equ <[rbp-32]> ;Aliased to dividend
divisor equ <[rbp-48]>
push rbp
mov rbp, rsp
sub rsp, 48
push rax
push rcx
xor rax, rax ;Initialize remainder to 0
mov remainder, rax
mov remainder[8], rax
; Copy the dividend to local storage
mov rax, [rcx]
mov dividend, rax
mov rax, [rcx+8]
mov dividend[8], rax
; Copy the divisor to local storage
mov rax, [rdx]
mov divisor, rax
mov rax, [rdx+8]
mov divisor[8], rax
mov cl, 128 ;Count off bits in cl
; Compute Remainder:Quotient := Remainder:Quotient SHL 1:
repeatLp: shl qword ptr dividend[0], 1 ;256-bit extended
rcl qword ptr dividend[8], 1 ; precision shift
rcl qword ptr remainder[0], 1 ; through remainder
rcl qword ptr remainder[8], 1
; Do a 128-bit comparison to see if the remainder
; is greater than or equal to the divisor.
mov rax, remainder[8]
cmp rax, divisor[8]
ja isGE
jb notGE
mov rax, remainder
cmp rax, divisor
ja isGE
jb notGE
; Remainder := Remainder - Divisor
isGE: mov rax, divisor
sub remainder, rax
mov rax, divisor[8]
sbb remainder[8], rax
; Quotient := Quotient + 1;
add qword ptr quotient, 1
adc qword ptr quotient[8], 0
notGE: dec cl
jnz repeatLp
; Okay, copy the quotient (left in the Dividend variable)
; and the remainder to their return locations.
mov rax, quotient[0]
mov [r8], rax
mov rax, quotient[8]
mov [r8][8], rax
mov rax, remainder[0]
mov [r9], rax
mov rax, remainder[8]
mov [r9][8], rax
pop rcx
pop rax
leave
ret
div128 endp
; Here is the "asmMain" function.
public asmMain
asmMain proc
push rbp
mov rbp, rsp
sub rsp, 64 ;Shadow storage
; Test the div128 function:
lea rcx, op1
lea rdx, op2
lea r8, result
lea r9, remain
call div128
; Print the results:
lea rcx, fmtStr1
mov edx, dword ptr result[12]
mov r8d, dword ptr result[8]
mov r9d, dword ptr result[4]
mov eax, dword ptr result[0]
mov [rsp+32], rax
call printf
lea rcx, fmtStr2
mov edx, dword ptr remain[12]
mov r8d, dword ptr remain[8]
mov r9d, dword ptr remain[4]
mov eax, dword ptr remain[0]
mov [rsp+32], rax
call printf
; Test the div128 function:
lea rcx, op1
lea rdx, op3
lea r8, result
lea r9, remain
call div128
; Print the results:
lea rcx, fmtStr3
mov edx, dword ptr result[12]
mov r8d, dword ptr result[8]
mov r9d, dword ptr result[4]
mov eax, dword ptr result[0]
mov [rsp+32], rax
call printf
lea rcx, fmtStr2
mov edx, dword ptr remain[12]
mov r8d, dword ptr remain[8]
mov r9d, dword ptr remain[4]
mov eax, dword ptr remain[0]
mov [rsp+32], rax
call printf
leave
ret ;Returns to caller
asmMain endp
end
To handle division by 0, check the divisor against 0 prior to running this code and return an appropriate error code if the divisor is 0. Dealing with signed values is the same as the earlier division algorithm: note the signs, take the operands’ absolute values, do the unsigned division, and then fix the sign afterward.
You can use the following technique to boost the performance of this division by a fair amount. Check to see if the divisor variable uses only 32 bits. Often, even though the divisor is a 128-bit variable, the value itself fits into 32 bits (that is, the HO double words of Divisor are 0) and you can use the div instruction, which is much faster. The improved algorithm is a bit more complex because you have to first compare the HO quad words for 0, but on average, it runs much faster while remaining capable of dividing any two pairs of values.
Extended-Precision Negation Operations
The neg instruction doesn’t provide a generic extended-precision form. However, a negation is equivalent to subtracting a value from 0, so we can easily simulate an extended-precision negation by using the sub and sbb instructions.
The following code provides a simple way to negate a (320-bit) value by subtracting that value from 0, using an extended-precision subtraction:
.data
Value qword 5 dup (?) ; 320-bit value
.
.
.
xor rax, rax ; RAX = 0
sub rax, Value
mov Value, rax
mov eax, 0 ; Cannot use XOR here:
sbb rax , Value[8] ; must preserve carry!
mov Value[8], rax
mov eax, 0 ; Zero-extends!
sbb rax, Value[16]
mov Value[16], rax
mov eax, 0
sbb rax, Value[24]
mov Value[24], rax
mov rax, 0
sbb rax, Value[32]
mov Value[32], rax
A slightly more efficient way to negate smaller values (128 bits) uses a combination of neg and sbb instructions. This technique uses the fact that neg subtracts its operand from 0. In particular, it sets the flags the same way the sub instruction would if you subtracted the destination value from 0. This code takes the following form (assuming you want to negate the 128-bit value in RDX:RAX):
neg rdx
neg rax
sbb rdx, 0
Extended-Precision AND Operations
Performing an n-byte AND operation is easy: simply AND the corresponding bytes between the two operands, saving the result.
; 128-bit AND Operation
mov rax, qword ptr source1
and rax, qword ptr source2
mov qword ptr dest, rax
mov rax, qword ptr source1[8]
and rax, qword ptr source2[8]
mov qword ptr dest[8], rax
To extend this technique to any number of qwords, logically AND the corresponding bytes, words, double words, or quad words together in the operands.
This sequence sets the flags according to the value of the last and operation. If you AND the HO quad words last, this sets all but the zero flag correctly. If you need to test the zero flag after this sequence, logically OR the two resulting double words together (or otherwise compare them both against 0).
You can also use the XMM and YMM registers to perform extended-precision logical operations (up to 256 bits at a time).
Extended-Precision OR Operations
Multi-byte logical OR operations are performed in the same way as multibyte AND operations. You OR the corresponding bytes in the two operands together.
; 192-bit OR Operation
mov rax, qword ptr source1
or rax, qword ptr source2
mov qword ptr dest, rax
mov rax, qword ptr source1[8]
or rax, qword ptr source2[8]
mov qword ptr dest[8], rax
mov rax, qword ptr source1[16]
or rax, qword ptr source2[16]
mov qword ptr dest[16], rax
As in the previous example, this does not set the zero flag properly for the entire operation. If you need to test the zero flag after an extended-precision OR, you must logically OR all the resulting double words together.
Extended-Precision XOR Operations
As with other logical operations, extended-precision XOR operations exclusive-ORs the corresponding bytes in the two operands to obtain the extended-precision result.
; 64-bit XOR Operation
mov rax, qword ptr source1
xor rax, qword ptr source2
mov qword ptr dest, rax
mov rax, qword ptr source1[8]
xor rax, qword ptr source2[8]
mov qword ptr dest[8], rax
The comment about the zero flag in the previous two sections, as well as the comment about the XMM and YMM registers, apply here.
Extended-Precision Shift Operations
Extended-Precision Shift Left

To accomplish this with machine instructions, we must first shift the LO qword to the left (for example, using the shl instruction) and capture the output from bit 63 (conveniently, the carry flag does this for us). We must then shift this bit into the LO bit of the HO qword while simultaneously shifting all the other bits to the left.
For example, to shift the 128-bit quantity in RDX:RAX one position to the left
shl rax, 1
rcl rdx, 1
To perform a shift left on an operand larger than 128 bits, use additional rcl instructions. An extended-precision shift-left operation always starts with the least-significant quad word, and each succeeding rcl instruction operates on the next-most-significant double word.
; 192-bit shift left operation
shl qword ptr Operand[0], 1
rcl qword ptr Operand[8], 1
rcl qword ptr Operand[16], 1
If you need to shift your data by 2 or more bits, you can place the instructions in a loop to repeat them a certain number of times.
ShiftLoop:
shl qword ptr Operand[0], 1
rcl qword ptr Operand[8], 1
rcl qword ptr Operand[16], 1
dec cl
jnz ShiftLoop
Extended-Precision Shift Right and Shift Arithmetic Right
You implement shr (shift right) and sar (shift arithmetic right) in a similar way, except you must start at the HO word of the operand and work your way down to the LO word:
; Extended-precision SAR:
sar qword ptr Operand[16], 1
rcr qword ptr Operand[8], 1
rcr qword ptr Operand[0], 1
; Extended-precision SHR:
shr qword ptr Operand[16], 1
rcr qword ptr Operand[8], 1
rcr qword ptr Operand[0], 1
Efficient Multi-bit Extended-Precision Shifts
The shld and shrd instructions let you efficiently implement extended-precision shifts of several bits.
shld Operand1, Operand2, constant
shld Operand1, Operand2, cl
shrd Operand1, Operand2, constant
shrd Operand1, Operand2, cl


.data
ShiftMe qword 012345678h, 90123456h, 78901234h
.
.
.
mov rax, ShiftMe[8]
shld ShiftMe[16], rax, 6
mov rax, ShiftMe[0]
shld ShiftMe[8], rax, 6
shl ShiftMe[0], 6
The first shld instruction shifts the bits from ShiftMe[8] into ShiftMe[16] without affecting the value in ShiftMe[8]. The second shld instruction shifts the bits from ShiftMe into ShiftMe[8]. Finally, the shl instruction shifts the LO double word the appropriate amount.
Extended-Precision Rotate Operations
The rcl and rcr operations extend in a manner similar to shl and shr. For example, to perform 192-bit rcl and rcr operations, use the following instructions:
rcl qword ptr Operand[0], 1
rcl qword ptr Operand[8], 1
rcl qword ptr Operand[16], 1
rcr qword ptr Operand[16], 1
rcr qword ptr Operand[8], 1
rcr qword ptr Operand[0], 1
Performing an extended-precision rol or ror operation isn’t quite as simple because of the way the incoming bit is processed. You can use the bt, shld, and shrd instructions to implement an extended-precision rol or ror instruction.
; Compute (128-bit) rol RDX:RAX, 4:
mov rbx, rdx
shld rdx, rax, 4
shld rax, rbx, 4
bt rbx, 28 ; Set carry flag, if desired
Operating on Different-Size Operands
Occasionally, you may need to do a computation on a pair of operands that are not the same size. For example, you may need to add a word and a double word together or subtract a byte value from a word value.
To do so, extend the smaller operand to the size of the larger operand and then operate on two same-size operands. For signed operands, you sign-extend the smaller operand to the same size as the larger operand; for unsigned values, you zero-extend the smaller operand. This works for any operation.
Last updated