String Instructions

Chapter 14

  • A string is a collection of values stored in contiguous memory locations. The x86-64 CPUs can process four types of strings: byte strings, word strings, double-word strings, and quad-word strings.

String Instructions

  • All members of the x86-64 family support five string instructions: movsx, cmpsx, scasx, lodsx, and stosx. (x = b, w, d, or q for byte, word, double word, or quad word)

  • The string instructions operate on blocks of memory. For example, the movs instruction moves a sequence of bytes from one memory location to another, the cmps instruction compares two blocks of memory, and the scas instruction scans a block of memory for a particular value.

  • The source and destination blocks are not provided as explicit operands, however. Instead, the string instructions use specific registers as operands:

    • RSI (source index) register

    • RDI (destination index) register

    • RCX (count) register

    • AL, AX, EAX, and RAX registers

    • The direction flag in the FLAGS register

  • For example, the movs (move string) instruction copies RCX elements from the source address specified by RSI to the destination address specified by RDI. Likewise, the cmps instruction compares the string pointed at by RSI, of length RCX, to the string pointed at by RDI.

The rep, repe, repz, and the repnz and repne Prefixes

  • The repeat prefixes tell the x86-64 to do a multi-byte string operation specifically, to repeat a string operation up to RCX times.

rep prefix:
    rep movsx (x is b, w, d, or q)
    rep stosx
    
repe prefix: (Note: repz is a synonym for repe)
    repe cmpsx
    repe scasx

repne prefix: (Note: repnz is a synonym for repne)
    repne cmpsx
    repne scasx
  • The repe prefix says to “repeat this operation while the comparison is equal, or up to the number of times specified by RCX (whichever condition fails first).”

  • The repne prefix’s action is “repeat this operation while the comparison is not equal, or up to the number of times specified by RCX.” As it turns out, you’ll use repe for most character string comparisons; repne is used mainly with the scasx instructions to locate a specific character within a string (such as a zero-terminating byte).

The Direction Flag

  • The direction flag in the FLAGS register controls how the CPU processes strings. If the direction flag is clear, the CPU increments RSI and RDI after operating on each string element. For example, executing movs will move the byte, word, double word, or quad word at RSI to RDI and then increment RSI and RDI by 1, 2, 4, or 8.

  • When specifying the rep prefix before this instruction, the CPU increments RSI and RDI for each element in the string (the count in RCX specifies the number of elements). At completion, the RSI and RDI registers will be pointing at the first item beyond the strings.

  • You can change the direction flag’s value by using the cld (clear direction flag) and std (set direction flag) instructions.

The movs Instruction

  • The movsb (move string, bytes) instruction fetches the byte at address RSI, stores it at address RDI, and then increments or decrements the RSI and RDI registers by 1.

  • If the rep prefix is present, the CPU checks RCX to see whether it contains 0. If not, it moves the byte from RSI to RDI and decrements the RCX register. This process repeats until RCX becomes 0. If RCX contains 0 upon initial execution, the movsb instruction will not copy any data bytes.

movsb
movsw
movsd
movsq
rep movsb
rep movsw
rep movsd
rep movsq
  • If you’ve set the direction flag before executing a movsq, movsb, movsw, or movsd instruction, the CPU decrements the RSI and RDI registers after moving each string element. This means that the RSI and RDI registers must point at the last element of their respective strings before executing a movsb, movsw, movsd, or movsq instruction.

CharArray1 byte ?
CharArray2 byte 384 dup (?)
 .
 .
 .
        cld
        lea rsi, CharArray1
        lea rdi, CharArray2
        mov rcx, lengthof(CharArray2);
        rep movsb
  • When the CPU executes the movsb instruction, it does the following:

    1. Copies the byte at RSI (CharArray1) to the byte pointed at by RDI (CharArray2).

    2. Increments RSI and RDI, and decrements RCX by 1. Now the RSI register points at CharArray1 + 1 (which is the address of CharArray2), and the RDI register points at CharArray2 + 1.

    3. Copies the byte pointed at by RSI to the byte pointed at by RDI. However, this is the byte originally copied from location CharArray1. So, the movsb instruction copies the value originally in location CharArray1 to both CharArray2 and CharArray2 + 1.

    4. Again increments RSI and RDI, and decrements RCX.

    5. Copies the byte from location CharArray1 + 2 (CharArray2 + 1) to location CharArray2 + 2. Once again, this is the value that originally appeared in location CharArray1.

  • If you really want to move one array into another when they overlap like this, you should move each element of the source string to the destination string, starting at the end of the two strings

  • On many computer systems, the movsq instruction provides about the fastest way to copy bulk data from one location to another. While there are, arguably, faster ways to copy data on certain CPUs, ultimately the memory bus performance is the limiting factor, and the CPUs are generally much faster than the memory bus. Therefore writing fancy code to improve memory-to-memory transfers is probably a waste of time.

  • On later processors, it may be more efficient to use movsb to copy the specified number of bytes.

The cmps Instruction

  • The cmps instruction compares two strings. The CPU compares the value referenced by RDI to the value pointed at by RSI.

  • RCX contains the number of elements in the source string when using the repe or repne prefix to compare entire strings.

cmpsb
cmpsw
cmpsd
cmpsq

repe cmpsb
repe cmpsw
repe cmpsd
repe cmpsq

repne cmpsb
repne cmpsw
repne cmpsd
repne cmpsq
  • After the execution of repne cmps, either the RCX register is 0 (in which case the two strings are totally different), or the RCX contains the number of elements compared in the two strings until a match is found.

  • While this form of the cmps instruction isn’t particularly useful for comparing strings, it is useful for locating the first pair of matching items in a couple of byte, word, or double-word arrays.

Comparing Character Strings

  • We compare corresponding elements until encountering a character that doesn’t match or the end of the shorter string. If a pair of corresponding characters does not match, compare the two strings based on that single character. If the two strings match up to the length of the shorter string, compare their length.

  • The two strings are equal if and only if their lengths are equal and each corresponding pair of characters in the two strings is identical. The length of a string affects the comparison only if the two strings are identical up to the length of the shorter string.

  • For (ASCII) character strings, use the cmpsb instruction in the following manner:

    1. Clear the direction flag.

    2. Load the RCX register with the length of the smaller string.

    3. Point the RSI and RDI registers at the first characters in the two strings you want to compare.

    4. Use the repe prefix with the cmpsb instruction to compare the strings on a byte-by-byte basis. Even if the strings contain an even number of characters, you cannot use the cmpsw or cmpsd instructions, because they do not compare strings in lexicographical order.

    5. If the two strings are equal, compare their lengths.

        cld
        mov rsi, AdrsStr1
        mov rdi, AdrsStr2
        mov rcx, LengthSrc
        cmp rcx, LengthDest
        jbe srcIsShorter ; Put the length of the
                         ; shorter string in RCX
        mov rcx, LengthDest

srcIsShorter:
   repe cmpsb
        jnz notEq ; If equal to the length of the
                  ; shorter string, cmp lengths
        mov rcx, LengthSrc
        cmp rcx, LengthDest

notEq: 

Comparing Extended-Precision Integers

  • You can also use the cmps instruction to compare multi-word integer values

  • Unlike with character strings, we cannot compare integer strings by using lexicographical ordering. When comparing strings, we compare the characters from the least significant byte to the most significant byte

  • When comparing integers, we must compare the values from the most significant byte, word, or double word down to the least significant.

        std
        lea rsi, SourceInteger[3 * 8]
        lea rdi, DestInteger[3 * 8]
        mov rcx, 4
repe cmpsq
        cld
  • This code compares the integers from their most significant qword down to the least significant qword. The cmpsq instruction finishes when the two values are unequal or upon decrementing RCX to 0.

The scas Instruction

  • The scas (scan string) instruction is used to search for a particular element within a string.

  • Unlike the movs and cmps instructions, scas requires only a destination string (pointed at by RDI). The source operand is the value in the AL (scasb), AX (scasw), EAX (scasd), or RAX (scasq) register. The scas instruction compares the value in the accumulator (AL, AX, EAX, or RAX) against the value pointed at by RDI and then increments (or decrements) RDI by 1, 2, 4, or 8.

scasb
scasw
scasd
scasq

repe scasb
repe scasw
repe scasd
repe scasq

repne scasb
repne scasw
repne scasd
repne scasq
  • repe scas scans through the string while the value in the accumulator is equal to the string operand, and repne scas scans through the string while the accumulator is not equal to the string operand.

The stos Instruction

  • The stos instruction stores the value in the accumulator at the location specified by RDI. After storing the value, the CPU increments or decrements RDI depending on the state of the direction flag.

stosb
stosw
stosd
stosq

rep stosb
rep stosw
rep stosd
rep stosq
  • Although the stos instruction has many uses, its primary use is to initialize arrays and strings to a constant value. . For example, if you have a 256-byte array that you want to clear out with 0s

        cld
        lea rdi, DestArray
        mov rcx, 32 ; 32 quad words = 256 bytes
        xor rax, rax ; Zero out RAX
rep stosq
  • The stosb instruction stores the value in the AL register into the specified memory location(s), stosw stores the AX register into the specified memory location(s), stosd stores EAX into the specified location(s), and stosq stores RAX into the specified location(s). With the rep prefix, this process repeats the number of times specified by the RCX register.

The lods Instruction

  • The lods instruction copies the byte, word, double word, or quad word pointed at by RSI into the AL, AX, EAX, or RAX register, after which it increments or decrements the RSI register by 1, 2, 4, or 8.

  • Use lods to fetch bytes (lodsb), words (lodsw), double words (lodsd), or quad words (lodsq) from memory for further processing.

lodsb
lodsw
lodsd
lodsq

rep lodsb
rep lodsw
rep lodsd
rep lodsq
  • The repeat prefixes appear here simply because they are allowed. They’re not very useful, but they are allowed.

Building Complex String Functions from lods and stos

        mov rsi, StringAddress   ; Load string address into RSI
        mov rdi, rsi             ; Also point RDI here
        mov rcx, stringLength    ; Presumably, this was precomputed
        jrcxz skipUC             ; Don't do anything if length is 0
rpt:
        lodsb                    ; Get the next character in the string
        cmp al, 'A'
        jb notUpper
        cmp al, 'Z'
        ja notUpper
        or al, 20h               ; Convert to lowercase
notUpper:
        stosb                    ; Store converted char into string
        dec rcx
        jnz rpt                  ; Zero flag is set when RCX is 0
skipUC:

SIMD String Instructions

  • The four SSE4.2 instructions that process text and string fragments are as follows:

    • PCMPESTRI - Packed compare explicit-length strings, return index

    • PCMPESTRM - Packed compare explicit-length strings, return mask

    • PCMPISTRI - Packed compare implicit-length strings, return index

    • PCMPISTRM - Packed compare implicit-length strings, return mask

  • Implicit-length strings use a sentinel (trailing) byte to mark the end of the string, specifically, a zero-terminating byte (or word, in the case of Unicode characters). Explicit-length strings are those for which you supply a string length.

pcmpXstrY xmmsrc1, xmmsrc2/memsrc2, imm8
vpcmpXstrY xmmsrc1, xmmsrc2/memsrc2, imm8
; where X is E or I, and Y is I or M. Both forms use 128-bit operands

The immediate Operand (imm8)

  • Bits 0 and 1 of the immediate operand specify the size and type of the string elements.

  • Bit 0 specifies word (Unicode) or byte (ASCII) operands. Bit 1 specifies whether the operands are signed or unsigned.

Packed Compare imm8 Bits 0 and 1
  • Bits 2 and 3 of the immediate operand specify how the instruction will compare the two strings.

Packed Compare imm8 Bits 2 and 3
  • Equal each (10b) is probably the easiest comparison to understand. The packed compare instruction will compare each corresponding character in the string (up to the length of the string) and set a Boolean flag for the result of the comparison of each byte or word in the string.

Equal each aggregate comparison operation
  • The equal any comparison compares each byte in the second source operand to see whether it is any of the characters found in the first source operand (XMMsrc2/memsrc2). For example, if XMMsrc1 contains the string abcdefABCDEF (and four 0 bytes), and XMMsrc2/memsrc2 contains 12AF89C0, the resulting comparison would yield 00101100b (1s in the character positions corresponding to the A, F, and C characters).

  • The equal ordered comparison searches for each occurrence of the string in XMMsrc1 that can be found in the XMMsrc2/memsrc2 operand. For example, if the XMMsrc2/memsrc2 operand contains the string never need shine and the XMMsrc1 operand has the string ne (padded with 0s), then the equal ordered comparison produces the vector 0100000001000001b. This is similar to the strstr() function in the C Standard Library.

  • The ranges comparison aggregate operation breaks the entries in the XMMsrc1 operand into pairs (at even and odd indexes in the register). The first element (byte or word) specifies a lower bound, and the second entry specifies an upper bound. The XMMsrc1 register supports up to eight byte ranges or four word ranges. This aggregate operation compares each character in the XMMsrc2/memsrc2 operand against each of these ranges and stores true in the resultant vector if the character is within one of the specified ranges (inclusive) and false if it is outside all of these ranges.

  • Bits 4 and 5 of the immediate operand specify the result polarity

  • Bit 6 of the immediate operand specifies the instruction result

  • The (v)pcmpestrm and (v)pcmpistrm instructions produce a bit-mask result and store it into the XMM0 register. If bit 6 of the imm8 operand contains a 0, these two instructions pack this bit mask into 8 or 16 bits and store them into the LO 8 (or 16) bits of XMM0, zero-extending that value through the upper bits of XMM0. If imm8 bit 6 contains a 1, these instructions will store the bit mask (all 1 bits per byte or word) throughout the XMM0 register.

  • The (v)pcmpestri and (v)pcmpistri instructions produce an index result and return this value in the ECX register. If bit 6 of the imm8 operand contains a 0, these two instructions return the index of the LO set bit in the result bit mask. If bit 6 of the imm8 operand is 1, these instructions return the index of the highest-order set bit in the resultant bit mask. If there are no set bits in the result bit mask, these instructions return 16 (for byte comparisons) or 8 (for word comparisons) in the ECX register.

Comparison Result When Source 1 and Source 2 Are Valid or Invalid
  • All eight instructions—(v)pcmpestri, (v)pcmpistri, (v)pcmpestrm, and (v)pcmpistrm—clear the carry flag if all of the bits in the (internal) result bit map are 0 (no comparison); these instructions set the carry flag if there is at least 1 bit set in the bit map

  • The zero flag indicates whether the src2 length is less than 16 (8 for word characters). For the (v)pcmpestri and (v)pcmpestrm instructions, the zero flag is set if EDX is less than 16 (8); for the (v)pcmpistri and (v)pcmpistrm instructions, the zero flag is set if XMMsrc2/memsrc2 contains a null character.

  • The sign flag indicates whether the src1 length is less than 16 (8 for word characters). For the (v)pcmpestri and (v)pcmpestrm instructions, the sign flag is set if EAX is less than 16 (8); for the (v)pcmpistri and (v)pcmpistrm instructions, the zero flag is set if XMMsrc1 contains a null character.

  • The overflow flag contains the setting for bit 0 of the result bit map (that is, whether the first character of the source string was a match

Last updated