Advanced Arithmetic

Chapter 8

Extended-Precision Addition

As you can see, the idea is to break a larger operation into a sequence of smaller ones. Since the x86 processor family is capable of adding together at most 64 bits at a time (using general-purpose registers), the operation must proceed in blocks of 64 bits or fewer. Here are the steps:

; example in MASM

        .data
        
X oword ?
Y oword ?


        .code
        
        mov rax, qword ptr X ; Add together the LO 64 bits
        add rax, qword ptr Y ; of the numbers and store the
        mov qword ptr Z, rax ; result into the LO qword of Z
        
        mov rax, qword ptr X[8] ; Add together (with carry) the
        adc rax, qword ptr Y[8] ; HO 64 bits and store the result
        mov qword ptr Z[8], rax ; into the HO qword of Z

Remember, X, Y, and Z are oword objects (128 bits), and an instruction of the form mov rax, X would attempt to load a 128-bit value into a 64-bit register. To load a 64-bit value, specifically the LO 64 bits, the qword ptr operator coerces symbols X, Y, and Z to 64 bits. To load the HO qwords, you use address expressions of the form X[8], along with the qword ptr operator, because the x86 memory space addresses bytes, and it takes 8 consecutive bytes to form a quad word.

Extended-Precision Subtraction

Just as it does addition, the x86-64 performs multi-byte subtraction the same way you would manually, except it subtracts whole bytes, words, double words, or quad words at a time rather than decimal digits.
You use the sub instruction on the LO byte, word, double word, or quad word and the sbb (subtract with borrow) instruction on the high-order values.

; Example of extended-precision Subtraction
;
; 256-bit subtraction

                .data
                
BigVal1 qword 4 dup (?)
BigVal2 qword 4 dup (?)
BigVal3 qword 4 dup (?)

                 .
                 .
                 .
                 
; Compute BigVal3 := BigVal1 - BigVal2.

; Note: don't need to coerce types of BigVa1, BigVal2, or BigVal3 
; because their base types are already qword.

                   mov rax, BigVal1[0]
                   sub rax, BigVal2[0]
                   mov BigVal3[0], rax
                   
                   mov rax, BigVal1[8]
                   sbb rax, BigVal2[8]
                   mov BigVal3[8], rax
                   
                   mov rax, BigVal1[16]
                   sbb rax, BigVal2[16]
                   mov BigVal3[16], rax
                   
                   mov rax, BigVal1[24]
                   sbb rax, BigVal2[24]
                   mov BigVal3[24], rax

Extended-Precision Comparisions

You need to look at both bytes of a pair of values only if the HO bytes are equal. In all other cases, comparing the HO bytes tells you everything you need to know about the values. This is true for any number of bytes, not just two. The following code compares two signed 128-bit integers by comparing their HO quad words first and comparing their LO quad words only if the HO quad words are equal:

; This sequence transfers control to location "IsGreater" if
; QwordValue > QwordValue2. It transfers control to "IsLess" if
; QwordValue < QwordValue2. It falls through to the instruction
; following this sequence if QwordValue = QwordValue2.
; To test for inequality, change the "IsGreater" and "IsLess"
; operands to "NotEqual" in this code.


                mov rax, qword ptr QWordValue[8] ; Get HO qword
                cmp rax, qword ptr QWordValue2[8]
                jg IsGreater
                jl IsLess;
                
                mov rax, qword ptr QWordValue[0] ; If HO qwords equal,
                cmp rax, qword ptr QWordValue2[0] ; then we must compare
                jg IsGreater; ; the LO dwords
                jl IsLess;
                
; Fall through to this point if the two values were equal.

To compare unsigned values, use the ja and jb instructions in place of jg and jl.
To generalize the preceding code for objects larger than 128 bits, start the comparison with the objects’ HO quad words and work your way down to their LO quad words, as long as the corresponding double words are equal.

; comparing two 256-bit values

                .data
                
Big1 qword 4 dup (?)
Big2 qword 4 dup (?)
 
 .
 .
 .
 
                mov rax, Big1[24]
                cmp rax, Big2[24]
                jb isLE
                ja notLE

                mov rax, Big1[16]
                cmp rax, Big2[16]
                jb isLE
                ja notLE

                mov rax, Big1[8]
                cmp rax, Big2[8]
                jb isLE
                ja notLE

                mov rax, Big1[0]
                cmp rax, Big2[0]
                jnbe notLE

isLE:

                Code to execute if Big1 <= Big2
 .
 .
 .
 
notLE:

                Code to execute if Big1 > Big2

Extended-Precision Multiplication

The x86-64 does extended-precision multiplication in the same manner except that it works with bytes, words, double words, and quad words rather than digits.
Probably the most important thing to remember when performing an extended-precision multiplication is that you must also perform an extended-precision addition at the same time. Adding up all the partial products requires several additions.

Below code demonstrates how to multiply two 64-bit values (producing a 128-bit result) by using 32-bit instructions. Technically, you can do a 64-bit multiplication with a single instruction, but this example demonstrates a method you can easily extend to 128 bits by using the x86-64 64-bit registers rather than the 32-bit registers.

; 128-bit multiplication

        option  casemap:none

nl          =       10

            .const
fmtStr1     byte    "%d * %d = %I64d (verify:%I64d)", nl, 0
            
            
             .data
op1         qword   123456789
op2         qword   234567890
product     oword   ?           
product2    oword   ?           

            .code
            externdef printf:proc
            
; mul64-
;
;  Multiplies two 64-bit values passed in rdx and rax by
; doing a 64x64-bit multiplication, producing a 128-bit result. 
; Algorithm is easily extended to 128x128 bits by switching the 
; 32-bit registers for 64-bit registers.
;
; Stores result to location pointed at by R8.


mul64       proc
mp          equ     <dword ptr [rbp-8]>     ;Multiplier
mc          equ     <dword ptr [rbp-16]>    ;Multiplicand
prd         equ     <dword ptr [r8]>        ;Result

            push    rbp
            mov     rbp, rsp
            sub     rsp, 24
            
            push    rbx     ;Preserve these register values
            push    rcx
            

; Save parameters passed in registers:

            mov     qword ptr mp, rax
            mov     qword ptr mc, rdx

; Multiply the LO dword of Multiplier times Multiplicand.
                                           
            mov eax, mp
            mul mc          ; Multiply LO dwords.
            mov prd, eax    ; Save LO dword of product.
            mov ecx, edx    ; Save HO dword of partial product result.

            mov eax, mp
            mul mc[4]       ; Multiply mp(LO) * mc(HO)
            add eax, ecx    ; Add to the partial product.
            adc edx, 0      ; Don't forget the carry!
            mov ebx, eax    ; Save partial product for now.
            mov ecx, edx

; Multiply the HO word of Multiplier with Multiplicand.

            mov eax, mp[4]  ; Get HO dword of Multiplier.
            mul mc          ; Multiply by LO word of Multiplicand.
            add eax, ebx    ; Add to the partial product.
            mov prd[4], eax ; Save the partial product.
            adc ecx, edx    ; Add in the carry!

            mov eax, mp[4]  ; Multiply the two HO dwords together.
            mul mc[4]
            add eax, ecx    ; Add in partial product.
            adc edx, 0      ; Don't forget the carry!
            
            mov prd[8], eax ;Save HO qword of result
            mov prd[12], edx
    
; EDX:EAX contains 64-bit result at this point

            pop     rcx     ; Restore these registers
            pop     rbx
            leave
            ret    
mul64       endp

            
            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ;Shadow storage
            

; Test the mul64 function:

            mov     rax, op1
            mov     rdx, op2
            lea     r8, product
            call    mul64
            
; Use a 64-bit multiply to test the result

            mov     rax, op1
            mov     rdx, op2
            imul    rax, rdx
            mov     qword ptr product2, rax
            
; Print the results:

            lea     rcx, fmtStr1
            mov     rdx, op1
            mov     r8,  op2
            mov     r9,  qword ptr product
            mov     rax, qword ptr product2
            mov     [rsp+32], rax
            call    printf
                
            
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

The code works only for unsigned operands. To multiply two signed values, you must note the signs of the operands before the multiplication, take the absolute value of the two operands, do an unsigned multiplication, and then adjust the sign of the resulting product based on the signs of the original operands.

Extended-Precision Division

The general algorithm is as follows:

Move the HO quad word of the dividend into RAX and zero-extend it into RDX.
Divide by the divisor.
Store the value in RAX into the corresponding qword position of the quotient result variable (position of the dividend qword loaded into RAX prior to the division).
Load RAX with the next-lower quad word in the dividend, without modifying RDX.
Repeat steps 2 to 4 until you’ve processed all the quad words in the dividend.

; 256-bit by 64-bit division

        option  casemap:none

nl          =       10

            .const
fmtStr1     byte    "quotient  = "
            byte    "%08x_%08x_%08x_%08x_%08x_%08x_%08x_%08x"
            byte    nl, 0
            
fmtStr2     byte    "remainder = %I64x", nl, 0
            
            
             .data

; op1 is a 256-bit value. Initial values were chosen
; to make it easy to verify result.

op1         oword   2222eeeeccccaaaa8888666644440000h
            oword   2222eeeeccccaaaa8888666644440000h
            
op2         qword   2
result      oword   2 dup (0) ;Also 256 bits
remain      qword   0

            .code
            externdef printf:proc
; div256-
;    Divides a 256-bit number by a 64-bit number.
;
; Dividend - passed by reference in RCX.
; Divisor  - passed in RDX.
;
; Quotient - passed by reference in R8.
; Remainder- passed by reference in R9.

div256      proc
divisor     equ     <qword ptr [rbp-8]>
dividend    equ     <qword ptr [rcx]>
quotient    equ     <qword ptr [r8]>
remainder   equ     <qword ptr [r9]>

            push    rbp
            mov     rbp, rsp
            sub     rsp, 8
            
            mov     divisor, rdx
            
            mov     rax, dividend[24]  ; Begin div with HO qword
            xor     rdx, rdx           ; Zero extend into RDS
            div     divisor            ; Divide HO word
            mov     quotient[24], rax  ; Save HO result
            
            mov     rax, dividend[16]  ; Get dividend qword #2
            div     divisor            ; Continue with division
            mov     quotient[16], rax  ; Store away qword #2
            
            mov     rax, dividend[8]   ; Get dividend qword #1
            div     divisor            ; Continue with division
            mov     quotient[8], rax   ; Store away qword #1
            
            mov     rax, dividend[0]   ; Get LO dividend qword
            div     divisor            ; Continue with division
            mov     quotient[0], rax   ; Store away LO qword
            
            mov     remainder, rdx     ; Save away remainder
                
            leave
            ret            
div256      endp
            
            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 80         ;Shadow storage
            

; Test the div256 function:

            lea     rcx, op1
            mov     rdx, op2
            lea     r8, result
            lea     r9, remain
            call    div256
                        
; Print the results:

            lea     rcx, fmtStr1
            mov     edx, dword ptr result[28]
            mov     r8d, dword ptr result[24]
            mov     r9d, dword ptr result[20]
            mov     eax, dword ptr result[16]
            mov     [rsp+32], rax
            mov     eax, dword ptr result[12]
            mov     [rsp+40], rax
            mov     eax, dword ptr result[8]
            mov     [rsp+48], rax
            mov     eax, dword ptr result[4]
            mov     [rsp+56], rax
            mov     eax, dword ptr result[0]
            mov     [rsp+64], rax
            call    printf
                

            lea     rcx, fmtStr2
            mov     rdx, remain
            call    printf
                
            
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

Generic N-bit by M-bit Division

To use a divisor larger than 64 bits, you have to implement the division by using a shift-and-subtract strategy, which works but is very slow.

But this method is even easier in binary.

Example of it's implementation in in MASM.

; 128-bit by 128-bit division

        option  casemap:none

nl          =       10

            .const
fmtStr1     byte    "quotient  = "
            byte    "%08x_%08x_%08x_%08x"
            byte    nl, 0
            
fmtStr2     byte    "remainder = "
            byte    "%08x_%08x_%08x_%08x"
            byte    nl, 0
            
fmtStr3     byte    "quotient (2)  = "
            byte    "%08x_%08x_%08x_%08x"
            byte    nl, 0
            
            
             .data

; op1 is a 128-bit value. Initial values were chosen
; to make it easy to verify result.

op1         oword   2222eeeeccccaaaa8888666644440000h            
op2         oword   2
op3         oword   11117777666655554444333322220000h
result      oword   ?
remain      oword   ?

            .code
            externdef printf:proc
            
; div128-
;
; This procedure does a general 128/128 division operation 
; using the following algorithm (all variables are assumed 
; to be 128-bit objects):
;
; Quotient := Dividend;
; Remainder := 0;
; for i := 1 to NumberBits do
; 
;  Remainder:Quotient := Remainder:Quotient SHL 1;
;  if Remainder >= Divisor then
; 
;      Remainder := Remainder - Divisor;
;      Quotient := Quotient + 1;
; 
;  endif
; endfor
;
; Data passed:
;
; 128-bit dividend, by reference in RCX
; 128-bit divisor, by reference in RDX
;
; Data returned:
;
; Pointer to 128-bit quotient in R8
; Pointer to 128-bit remainder in R9 

div128      proc
remainder   equ     <[rbp-16]>
dividend    equ     <[rbp-32]>
quotient    equ     <[rbp-32]>      ;Aliased to dividend
divisor     equ     <[rbp-48]>

            push    rbp
            mov     rbp, rsp
            sub     rsp, 48
            
            push    rax
            push    rcx
            
            xor     rax, rax        ;Initialize remainder to 0
            mov     remainder, rax
            mov     remainder[8], rax
            
; Copy the dividend to local storage

            mov     rax, [rcx]
            mov     dividend, rax
            mov     rax, [rcx+8]
            mov     dividend[8], rax
            
; Copy the divisor to local storage

            mov     rax, [rdx]
            mov     divisor, rax
            mov     rax, [rdx+8]
            mov     divisor[8], rax
            
            
            mov     cl, 128           ;Count off bits in cl

; Compute Remainder:Quotient := Remainder:Quotient SHL 1:

repeatLp:   shl     qword ptr dividend[0], 1  ;256-bit extended
            rcl     qword ptr dividend[8], 1  ; precision shift
            rcl     qword ptr remainder[0], 1 ; through remainder
            rcl     qword ptr remainder[8], 1       

; Do a 128-bit comparison to see if the remainder
; is greater than or equal to the divisor.

            mov     rax, remainder[8]
            cmp     rax, divisor[8]
            ja      isGE
            jb      notGE
            
            mov     rax, remainder
            cmp     rax, divisor
            ja      isGE
            jb      notGE
            
; Remainder := Remainder - Divisor

isGE:       mov     rax, divisor
            sub     remainder, rax
            mov     rax, divisor[8]
            sbb     remainder[8], rax

; Quotient := Quotient + 1;


            add     qword ptr quotient, 1
            adc     qword ptr quotient[8], 0

notGE:      dec     cl
            jnz     repeatLp


; Okay, copy the quotient (left in the Dividend variable)
; and the remainder to their return locations.
    
            mov     rax, quotient[0]
            mov     [r8], rax
            mov     rax, quotient[8]
            mov     [r8][8], rax
            
            mov     rax, remainder[0]
            mov     [r9], rax
            mov     rax, remainder[8]
            mov     [r9][8], rax


            pop     rcx
            pop     rax
            leave
            ret
                        
div128      endp

            
; Here is the "asmMain" function.

        
            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ;Shadow storage
            

; Test the div128 function:

            lea     rcx, op1
            lea     rdx, op2
            lea     r8, result
            lea     r9, remain
            call    div128
                        
; Print the results:

            lea     rcx, fmtStr1
            mov     edx, dword ptr result[12]
            mov     r8d, dword ptr result[8]
            mov     r9d, dword ptr result[4]
            mov     eax, dword ptr result[0]
            mov     [rsp+32], rax
            call    printf
                

            lea     rcx, fmtStr2
            mov     edx, dword ptr remain[12]
            mov     r8d, dword ptr remain[8]
            mov     r9d, dword ptr remain[4]
            mov     eax, dword ptr remain[0]
            mov     [rsp+32], rax
            call    printf
                
            
; Test the div128 function:

            lea     rcx, op1
            lea     rdx, op3
            lea     r8, result
            lea     r9, remain
            call    div128
                        
; Print the results:

            lea     rcx, fmtStr3
            mov     edx, dword ptr result[12]
            mov     r8d, dword ptr result[8]
            mov     r9d, dword ptr result[4]
            mov     eax, dword ptr result[0]
            mov     [rsp+32], rax
            call    printf
                

            lea     rcx, fmtStr2
            mov     edx, dword ptr remain[12]
            mov     r8d, dword ptr remain[8]
            mov     r9d, dword ptr remain[4]
            mov     eax, dword ptr remain[0]
            mov     [rsp+32], rax
            call    printf
            
            leave
            ret     ;Returns to caller
        
asmMain     endp
            end

To handle division by 0, check the divisor against 0 prior to running this code and return an appropriate error code if the divisor is 0. Dealing with signed values is the same as the earlier division algorithm: note the signs, take the operands’ absolute values, do the unsigned division, and then fix the sign afterward.
You can use the following technique to boost the performance of this division by a fair amount. Check to see if the divisor variable uses only 32 bits. Often, even though the divisor is a 128-bit variable, the value itself fits into 32 bits (that is, the HO double words of Divisor are 0) and you can use the div instruction, which is much faster. The improved algorithm is a bit more complex because you have to first compare the HO quad words for 0, but on average, it runs much faster while remaining capable of dividing any two pairs of values.

Extended-Precision Negation Operations

The neg instruction doesn’t provide a generic extended-precision form. However, a negation is equivalent to subtracting a value from 0, so we can easily simulate an extended-precision negation by using the sub and sbb instructions.
The following code provides a simple way to negate a (320-bit) value by subtracting that value from 0, using an extended-precision subtraction:

        .data
Value qword 5 dup (?) ; 320-bit value
 
 .
 .
 .
 
        xor rax, rax ; RAX = 0
        sub rax, Value
        mov Value, rax

        mov eax, 0 ; Cannot use XOR here:
        sbb rax , Value[8] ; must preserve carry!
        mov Value[8], rax

        mov eax, 0 ; Zero-extends!
        sbb rax, Value[16]
        mov Value[16], rax

        mov eax, 0
        sbb rax, Value[24]
        mov Value[24], rax

        mov rax, 0
        sbb rax, Value[32]
        mov Value[32], rax

A slightly more efficient way to negate smaller values (128 bits) uses a combination of neg and sbb instructions. This technique uses the fact that neg subtracts its operand from 0. In particular, it sets the flags the same way the sub instruction would if you subtracted the destination value from 0. This code takes the following form (assuming you want to negate the 128-bit value in RDX:RAX):

neg rdx
neg rax
sbb rdx, 0

Extended-Precision AND Operations

Performing an n-byte AND operation is easy: simply AND the corresponding bytes between the two operands, saving the result.

; 128-bit AND Operation

mov rax, qword ptr source1
and rax, qword ptr source2
mov qword ptr dest, rax

mov rax, qword ptr source1[8]
and rax, qword ptr source2[8]
mov qword ptr dest[8], rax

To extend this technique to any number of qwords, logically AND the corresponding bytes, words, double words, or quad words together in the operands.
This sequence sets the flags according to the value of the last and operation. If you AND the HO quad words last, this sets all but the zero flag correctly. If you need to test the zero flag after this sequence, logically OR the two resulting double words together (or otherwise compare them both against 0).
You can also use the XMM and YMM registers to perform extended-precision logical operations (up to 256 bits at a time).

Extended-Precision OR Operations

Multi-byte logical OR operations are performed in the same way as multibyte AND operations. You OR the corresponding bytes in the two operands together.

; 192-bit OR Operation

mov rax, qword ptr source1
or rax, qword ptr source2
mov qword ptr dest, rax

mov rax, qword ptr source1[8]
or rax, qword ptr source2[8]
mov qword ptr dest[8], rax

mov rax, qword ptr source1[16]
or rax, qword ptr source2[16]
mov qword ptr dest[16], rax

As in the previous example, this does not set the zero flag properly for the entire operation. If you need to test the zero flag after an extended-precision OR, you must logically OR all the resulting double words together.

Extended-Precision XOR Operations

As with other logical operations, extended-precision XOR operations exclusive-ORs the corresponding bytes in the two operands to obtain the extended-precision result.

; 64-bit XOR Operation

mov rax, qword ptr source1
xor rax, qword ptr source2
mov qword ptr dest, rax

mov rax, qword ptr source1[8]
xor rax, qword ptr source2[8]
mov qword ptr dest[8], rax

The comment about the zero flag in the previous two sections, as well as the comment about the XMM and YMM registers, apply here.

Extended-Precision Shift Operations

Extended-Precision Shift Left

To accomplish this with machine instructions, we must first shift the LO qword to the left (for example, using the shl instruction) and capture the output from bit 63 (conveniently, the carry flag does this for us). We must then shift this bit into the LO bit of the HO qword while simultaneously shifting all the other bits to the left.
For example, to shift the 128-bit quantity in RDX:RAX one position to the left

shl rax, 1
rcl rdx, 1

To perform a shift left on an operand larger than 128 bits, use additional rcl instructions. An extended-precision shift-left operation always starts with the least-significant quad word, and each succeeding rcl instruction operates on the next-most-significant double word.

; 192-bit shift left operation
shl qword ptr Operand[0], 1
rcl qword ptr Operand[8], 1
rcl qword ptr Operand[16], 1

If you need to shift your data by 2 or more bits, you can place the instructions in a loop to repeat them a certain number of times.

ShiftLoop:
         shl qword ptr Operand[0], 1
         rcl qword ptr Operand[8], 1
         rcl qword ptr Operand[16], 1
         dec cl
         jnz ShiftLoop

Extended-Precision Shift Right and Shift Arithmetic Right

You implement shr (shift right) and sar (shift arithmetic right) in a similar way, except you must start at the HO word of the operand and work your way down to the LO word:

; Extended-precision SAR:

         sar qword ptr Operand[16], 1
         rcr qword ptr Operand[8], 1
         rcr qword ptr Operand[0], 1
         
; Extended-precision SHR:

         shr qword ptr Operand[16], 1
         rcr qword ptr Operand[8], 1
         rcr qword ptr Operand[0], 1

Efficient Multi-bit Extended-Precision Shifts

The shld and shrd instructions let you efficiently implement extended-precision shifts of several bits.

shld Operand1, Operand2, constant
shld Operand1, Operand2, cl
shrd Operand1, Operand2, constant
shrd Operand1, Operand2, cl

            .data
ShiftMe qword 012345678h, 90123456h, 78901234h
 
 .
 .
 .
 
            mov rax, ShiftMe[8]
            shld ShiftMe[16], rax, 6
            mov rax, ShiftMe[0] 
            shld ShiftMe[8], rax, 6
            shl ShiftMe[0], 6

The first shld instruction shifts the bits from ShiftMe[8] into ShiftMe[16] without affecting the value in ShiftMe[8]. The second shld instruction shifts the bits from ShiftMe into ShiftMe[8]. Finally, the shl instruction shifts the LO double word the appropriate amount.

Extended-Precision Rotate Operations

The rcl and rcr operations extend in a manner similar to shl and shr. For example, to perform 192-bit rcl and rcr operations, use the following instructions:

rcl qword ptr Operand[0], 1
rcl qword ptr Operand[8], 1
rcl qword ptr Operand[16], 1

rcr qword ptr Operand[16], 1
rcr qword ptr Operand[8], 1
rcr qword ptr Operand[0], 1

Performing an extended-precision rol or ror operation isn’t quite as simple because of the way the incoming bit is processed. You can use the bt, shld, and shrd instructions to implement an extended-precision rol or ror instruction.

; Compute (128-bit) rol RDX:RAX, 4:

    mov rbx, rdx
    shld rdx, rax, 4
    shld rax, rbx, 4
    bt rbx, 28        ; Set carry flag, if desired

Operating on Different-Size Operands

Occasionally, you may need to do a computation on a pair of operands that are not the same size. For example, you may need to add a word and a double word together or subtract a byte value from a word value.
To do so, extend the smaller operand to the size of the larger operand and then operate on two same-size operands. For signed operands, you sign-extend the smaller operand to the same size as the larger operand; for unsigned values, you zero-extend the smaller operand. This works for any operation.

PreviousLow Level Control Structures NextNumeric Conversion

Last updated 28 days ago