String Instructions
Chapter 14
A string is a collection of values stored in contiguous memory locations. The x86-64 CPUs can process four types of strings: byte strings, word strings, double-word strings, and quad-word strings.
String Instructions
All members of the x86-64 family support five string instructions: movsx, cmpsx, scasx, lodsx, and stosx. (x = b, w, d, or q for byte, word, double word, or quad word)
The string instructions operate on blocks of memory. For example, the movs instruction moves a sequence of bytes from one memory location to another, the cmps instruction compares two blocks of memory, and the scas instruction scans a block of memory for a particular value.
The source and destination blocks are not provided as explicit operands, however. Instead, the string instructions use specific registers as operands:
RSI (source index) register
RDI (destination index) register
RCX (count) register
AL, AX, EAX, and RAX registers
The direction flag in the FLAGS register
For example, the movs (move string) instruction copies RCX elements from the source address specified by RSI to the destination address specified by RDI. Likewise, the cmps instruction compares the string pointed at by RSI, of length RCX, to the string pointed at by RDI.
The rep, repe, repz, and the repnz and repne Prefixes
The repeat prefixes tell the x86-64 to do a multi-byte string operation specifically, to repeat a string operation up to RCX times.
rep prefix:
rep movsx (x is b, w, d, or q)
rep stosx
repe prefix: (Note: repz is a synonym for repe)
repe cmpsx
repe scasx
repne prefix: (Note: repnz is a synonym for repne)
repne cmpsx
repne scasx
The repe prefix says to “repeat this operation while the comparison is equal, or up to the number of times specified by RCX (whichever condition fails first).”
The repne prefix’s action is “repeat this operation while the comparison is not equal, or up to the number of times specified by RCX.” As it turns out, you’ll use repe for most character string comparisons; repne is used mainly with the scasx instructions to locate a specific character within a string (such as a zero-terminating byte).
The Direction Flag
The direction flag in the FLAGS register controls how the CPU processes strings. If the direction flag is clear, the CPU increments RSI and RDI after operating on each string element. For example, executing movs will move the byte, word, double word, or quad word at RSI to RDI and then increment RSI and RDI by 1, 2, 4, or 8.
When specifying the rep prefix before this instruction, the CPU increments RSI and RDI for each element in the string (the count in RCX specifies the number of elements). At completion, the RSI and RDI registers will be pointing at the first item beyond the strings.
You can change the direction flag’s value by using the cld (clear direction flag) and std (set direction flag) instructions.
The movs Instruction
The movsb (move string, bytes) instruction fetches the byte at address RSI, stores it at address RDI, and then increments or decrements the RSI and RDI registers by 1.
If the rep prefix is present, the CPU checks RCX to see whether it contains 0. If not, it moves the byte from RSI to RDI and decrements the RCX register. This process repeats until RCX becomes 0. If RCX contains 0 upon initial execution, the movsb instruction will not copy any data bytes.
movsb
movsw
movsd
movsq
rep movsb
rep movsw
rep movsd
rep movsq
If you’ve set the direction flag before executing a movsq, movsb, movsw, or movsd instruction, the CPU decrements the RSI and RDI registers after moving each string element. This means that the RSI and RDI registers must point at the last element of their respective strings before executing a movsb, movsw, movsd, or movsq instruction.
CharArray1 byte ?
CharArray2 byte 384 dup (?)
.
.
.
cld
lea rsi, CharArray1
lea rdi, CharArray2
mov rcx, lengthof(CharArray2);
rep movsb
When the CPU executes the movsb instruction, it does the following:
Copies the byte at RSI (CharArray1) to the byte pointed at by RDI (CharArray2).
Increments RSI and RDI, and decrements RCX by 1. Now the RSI register points at CharArray1 + 1 (which is the address of CharArray2), and the RDI register points at CharArray2 + 1.
Copies the byte pointed at by RSI to the byte pointed at by RDI. However, this is the byte originally copied from location CharArray1. So, the movsb instruction copies the value originally in location CharArray1 to both CharArray2 and CharArray2 + 1.
Again increments RSI and RDI, and decrements RCX.
Copies the byte from location CharArray1 + 2 (CharArray2 + 1) to location CharArray2 + 2. Once again, this is the value that originally appeared in location CharArray1.

If you really want to move one array into another when they overlap like this, you should move each element of the source string to the destination string, starting at the end of the two strings

On many computer systems, the movsq instruction provides about the fastest way to copy bulk data from one location to another. While there are, arguably, faster ways to copy data on certain CPUs, ultimately the memory bus performance is the limiting factor, and the CPUs are generally much faster than the memory bus. Therefore writing fancy code to improve memory-to-memory transfers is probably a waste of time.
On later processors, it may be more efficient to use movsb to copy the specified number of bytes.
The cmps Instruction
The cmps instruction compares two strings. The CPU compares the value referenced by RDI to the value pointed at by RSI.
RCX contains the number of elements in the source string when using the repe or repne prefix to compare entire strings.
cmpsb
cmpsw
cmpsd
cmpsq
repe cmpsb
repe cmpsw
repe cmpsd
repe cmpsq
repne cmpsb
repne cmpsw
repne cmpsd
repne cmpsq
After the execution of repne cmps, either the RCX register is 0 (in which case the two strings are totally different), or the RCX contains the number of elements compared in the two strings until a match is found.
While this form of the cmps instruction isn’t particularly useful for comparing strings, it is useful for locating the first pair of matching items in a couple of byte, word, or double-word arrays.
Comparing Character Strings
We compare corresponding elements until encountering a character that doesn’t match or the end of the shorter string. If a pair of corresponding characters does not match, compare the two strings based on that single character. If the two strings match up to the length of the shorter string, compare their length.
The two strings are equal if and only if their lengths are equal and each corresponding pair of characters in the two strings is identical. The length of a string affects the comparison only if the two strings are identical up to the length of the shorter string.
For (ASCII) character strings, use the cmpsb instruction in the following manner:
Clear the direction flag.
Load the RCX register with the length of the smaller string.
Point the RSI and RDI registers at the first characters in the two strings you want to compare.
Use the repe prefix with the cmpsb instruction to compare the strings on a byte-by-byte basis. Even if the strings contain an even number of characters, you cannot use the cmpsw or cmpsd instructions, because they do not compare strings in lexicographical order.
If the two strings are equal, compare their lengths.
cld
mov rsi, AdrsStr1
mov rdi, AdrsStr2
mov rcx, LengthSrc
cmp rcx, LengthDest
jbe srcIsShorter ; Put the length of the
; shorter string in RCX
mov rcx, LengthDest
srcIsShorter:
repe cmpsb
jnz notEq ; If equal to the length of the
; shorter string, cmp lengths
mov rcx, LengthSrc
cmp rcx, LengthDest
notEq:
Comparing Extended-Precision Integers
You can also use the cmps instruction to compare multi-word integer values
Unlike with character strings, we cannot compare integer strings by using lexicographical ordering. When comparing strings, we compare the characters from the least significant byte to the most significant byte
When comparing integers, we must compare the values from the most significant byte, word, or double word down to the least significant.
std
lea rsi, SourceInteger[3 * 8]
lea rdi, DestInteger[3 * 8]
mov rcx, 4
repe cmpsq
cld
This code compares the integers from their most significant qword down to the least significant qword. The cmpsq instruction finishes when the two values are unequal or upon decrementing RCX to 0.
The scas Instruction
The scas (scan string) instruction is used to search for a particular element within a string.
Unlike the movs and cmps instructions, scas requires only a destination string (pointed at by RDI). The source operand is the value in the AL (scasb), AX (scasw), EAX (scasd), or RAX (scasq) register. The scas instruction compares the value in the accumulator (AL, AX, EAX, or RAX) against the value pointed at by RDI and then increments (or decrements) RDI by 1, 2, 4, or 8.
scasb
scasw
scasd
scasq
repe scasb
repe scasw
repe scasd
repe scasq
repne scasb
repne scasw
repne scasd
repne scasq
repe scas scans through the string while the value in the accumulator is equal to the string operand, and repne scas scans through the string while the accumulator is not equal to the string operand.
The stos Instruction
The stos instruction stores the value in the accumulator at the location specified by RDI. After storing the value, the CPU increments or decrements RDI depending on the state of the direction flag.
stosb
stosw
stosd
stosq
rep stosb
rep stosw
rep stosd
rep stosq
Although the stos instruction has many uses, its primary use is to initialize arrays and strings to a constant value. . For example, if you have a 256-byte array that you want to clear out with 0s
cld
lea rdi, DestArray
mov rcx, 32 ; 32 quad words = 256 bytes
xor rax, rax ; Zero out RAX
rep stosq
The stosb instruction stores the value in the AL register into the specified memory location(s), stosw stores the AX register into the specified memory location(s), stosd stores EAX into the specified location(s), and stosq stores RAX into the specified location(s). With the rep prefix, this process repeats the number of times specified by the RCX register.
The lods Instruction
The lods instruction copies the byte, word, double word, or quad word pointed at by RSI into the AL, AX, EAX, or RAX register, after which it increments or decrements the RSI register by 1, 2, 4, or 8.
Use lods to fetch bytes (lodsb), words (lodsw), double words (lodsd), or quad words (lodsq) from memory for further processing.
lodsb
lodsw
lodsd
lodsq
rep lodsb
rep lodsw
rep lodsd
rep lodsq
The repeat prefixes appear here simply because they are allowed. They’re not very useful, but they are allowed.
Building Complex String Functions from lods and stos
mov rsi, StringAddress ; Load string address into RSI
mov rdi, rsi ; Also point RDI here
mov rcx, stringLength ; Presumably, this was precomputed
jrcxz skipUC ; Don't do anything if length is 0
rpt:
lodsb ; Get the next character in the string
cmp al, 'A'
jb notUpper
cmp al, 'Z'
ja notUpper
or al, 20h ; Convert to lowercase
notUpper:
stosb ; Store converted char into string
dec rcx
jnz rpt ; Zero flag is set when RCX is 0
skipUC:
SIMD String Instructions
The four SSE4.2 instructions that process text and string fragments are as follows:
PCMPESTRI - Packed compare explicit-length strings, return index
PCMPESTRM - Packed compare explicit-length strings, return mask
PCMPISTRI - Packed compare implicit-length strings, return index
PCMPISTRM - Packed compare implicit-length strings, return mask
Implicit-length strings use a sentinel (trailing) byte to mark the end of the string, specifically, a zero-terminating byte (or word, in the case of Unicode characters). Explicit-length strings are those for which you supply a string length.
pcmpXstrY xmmsrc1, xmmsrc2/memsrc2, imm8
vpcmpXstrY xmmsrc1, xmmsrc2/memsrc2, imm8
; where X is E or I, and Y is I or M. Both forms use 128-bit operands
The immediate Operand (imm8)
Bits 0 and 1 of the immediate operand specify the size and type of the string elements.
Bit 0 specifies word (Unicode) or byte (ASCII) operands. Bit 1 specifies whether the operands are signed or unsigned.

Bits 2 and 3 of the immediate operand specify how the instruction will compare the two strings.

Equal each (10b) is probably the easiest comparison to understand. The packed compare instruction will compare each corresponding character in the string (up to the length of the string) and set a Boolean flag for the result of the comparison of each byte or word in the string.

The equal any comparison compares each byte in the second source operand to see whether it is any of the characters found in the first source operand (XMMsrc2/memsrc2). For example, if XMMsrc1 contains the string abcdefABCDEF (and four 0 bytes), and XMMsrc2/memsrc2 contains 12AF89C0, the resulting comparison would yield 00101100b (1s in the character positions corresponding to the A, F, and C characters).
The equal ordered comparison searches for each occurrence of the string in XMMsrc1 that can be found in the XMMsrc2/memsrc2 operand. For example, if the XMMsrc2/memsrc2 operand contains the string never need shine and the XMMsrc1 operand has the string ne (padded with 0s), then the equal ordered comparison produces the vector 0100000001000001b. This is similar to the strstr() function in the C Standard Library.
The ranges comparison aggregate operation breaks the entries in the XMMsrc1 operand into pairs (at even and odd indexes in the register). The first element (byte or word) specifies a lower bound, and the second entry specifies an upper bound. The XMMsrc1 register supports up to eight byte ranges or four word ranges. This aggregate operation compares each character in the XMMsrc2/memsrc2 operand against each of these ranges and stores true in the resultant vector if the character is within one of the specified ranges (inclusive) and false if it is outside all of these ranges.
Bits 4 and 5 of the immediate operand specify the result polarity

Bit 6 of the immediate operand specifies the instruction result

The (v)pcmpestrm and (v)pcmpistrm instructions produce a bit-mask result and store it into the XMM0 register. If bit 6 of the imm8 operand contains a 0, these two instructions pack this bit mask into 8 or 16 bits and store them into the LO 8 (or 16) bits of XMM0, zero-extending that value through the upper bits of XMM0. If imm8 bit 6 contains a 1, these instructions will store the bit mask (all 1 bits per byte or word) throughout the XMM0 register.
The (v)pcmpestri and (v)pcmpistri instructions produce an index result and return this value in the ECX register. If bit 6 of the imm8 operand contains a 0, these two instructions return the index of the LO set bit in the result bit mask. If bit 6 of the imm8 operand is 1, these instructions return the index of the highest-order set bit in the resultant bit mask. If there are no set bits in the result bit mask, these instructions return 16 (for byte comparisons) or 8 (for word comparisons) in the ECX register.

All eight instructions—(v)pcmpestri, (v)pcmpistri, (v)pcmpestrm, and (v)pcmpistrm—clear the carry flag if all of the bits in the (internal) result bit map are 0 (no comparison); these instructions set the carry flag if there is at least 1 bit set in the bit map
The zero flag indicates whether the src2 length is less than 16 (8 for word characters). For the (v)pcmpestri and (v)pcmpestrm instructions, the zero flag is set if EDX is less than 16 (8); for the (v)pcmpistri and (v)pcmpistrm instructions, the zero flag is set if XMMsrc2/memsrc2 contains a null character.
The sign flag indicates whether the src1 length is less than 16 (8 for word characters). For the (v)pcmpestri and (v)pcmpestrm instructions, the sign flag is set if EAX is less than 16 (8); for the (v)pcmpistri and (v)pcmpistrm instructions, the zero flag is set if XMMsrc1 contains a null character.
The overflow flag contains the setting for bit 0 of the result bit map (that is, whether the first character of the source string was a match
Last updated