Computer Data Representation and Operations
Chapter 2
Last updated
Chapter 2
Last updated
A bit is the smallest unit of data in a binary computer. A bit represents either one of the following two values - 1 for True and 0 for false.
A nibble is a collection of 4 bits. With a nibble, we can represent up to 16 distinct values (0000 - 1111).
Nibbles are interesting because it takes 4 bits to represent binary-coded decimal (BCD) and hexadecimal numbers.
Main memory and I/O addresses are all byte addresses. Bytes can contain upto an 8-bit value. Anything smaller requires that we read the byte containing the data and eliminate the unwanted bits.
Bit 0 is the LO bit, or least significant bit, and bit 7 is the HO bit, or most significant bit of the byte. We’ll refer to all other bits by their number. Bits 0 to 3 compose the low-order nibble, and bits 4 to 7 form the high-order nibble. Because a byte contains exactly two nibbles, byte values require two hexadecimal digits.
Because a byte contains 8 bits, it can represent 2^8 (256) different values. Generally, we’ll use a byte to represent numeric values in the range 0 through 255, signed numbers in the range –128 through +127.
Because the x86-64 is a byte-addressable machine, it’s more efficient to manipulate a whole byte than an individual bit or nibble. The byte data type is a partially untyped data type. The only type information associated with a byte object is its size (1 byte). Bytes are also the smallest variable you can create in a MASM program.
A word is a group of 16 bits. Like the byte bit 0 is the low-order bit. For words, bit 15 is the high-order bit.
With 16 bits, you can represent 2^16 (65,536) values. These could be the values in the range 0 to 65,535 or, as is usually the case, the signed values –32,768 to +32,767, or any other data type with no more than 65,536 values.
The three major uses for words are short signed integer values, short unsigned integer values, and Unicode characters. Unsigned numeric values are represented by the binary value corresponding to the bits in the word. Signed numeric values use the two’s complement form for numeric values.
A double word is exactly what its name indicates: a pair of words. Therefore, a double-word quantity is 32 bits long.
Double words (dwords) can represent all kinds of things. A common item you will represent with a double word is a 32-bit integer value (which allows unsigned numbers in the range 0 to 4,294,967,295 or signed numbers in the range –2,147,483,648 to 2,147,483,647). 32-bit floating-point values also fit into a double word.
Quad-word (64-bit) values are also important because 64-bit integers, pointers, and certain floating-point data types require 64 bits. Likewise, the SSE/MMX instruction set of modern x86-64 processors can manipulate 64-bit values.
Octal-word (128-bit) values are important because the AVX/SSE instruction set can manipulate 128-bit values.
128-bit integer objects can't be directly manipulated using mov, add and sub because the x86_64 integer registers process only 64 bits at a time.
There are four primary logical operations -
AND Operation
OR Operation
XOR Operation
NOT Operation
Below is an asm program showcasing these operations -
Another set of logical operations that apply to bit strings is the shift and rotate operations. These two categories can be further broken down into left shifts, left rotates, right shifts, and right rotates.
The shift-left operation moves each bit in a bit string one position to the left. Bit 0 moves into bit position 1, the previous value in bit position 1 moves into bit position 2, and so on. We’ll shift a 0 into bit 0, and the previous value of the high-order bit will become the carry out of this operation.
The count operand is either the CL register or a constant in the range 0 to n, where n is one less than the number of bits in the destination operand. The dest operand is a typical destination operand. It can be either a memory location or a register.
C represents the carry flag that is, the HO bit shifted out of the operand moves into the carry flag. Therefore, you can test for overflow after a shl instruction by testing the carry flag immediately after executing the instruction.
The shl instruction sets the zero flag based on the result (z=1 if the result is zero, z=0 otherwise). The shl instruction sets the sign flag if the HO bit of the result is 1. If the shift count is 1, then shl sets the overflow flag if the HO bit changes (that is, you shift a 0 into the HO bit when it was previously 1, or shift a 1 in when it was previously 0); the overflow flag is undefined for all other shift counts.
A shift-right operation works the same way, except we’re moving the data in the opposite direction i.e to the right. x86-64 provides a shr instruction that will shift the bits to the right in a destination operand.
The shr instruction sets the zero flag based on the result (ZF=1 if the result is zero, ZF=0 otherwise). The shr instruction clears the sign flag (because the HO bit of the result is always 0). If the shift count is 1, shl sets the overflow flag if the HO bit changes (that is, you shift a 0 into the HO bit when it was previously 1, or shift a 1 in when it was previously 0); the overflow flag is undefined for all other shift counts.
However, a shift right is equivalent to only an unsigned division by 2. To use the shift right as a division operator, we must define a third shift operation: arithmetic shift right.
Arithmetic shift right works just like the normal shift-right operation (a logical shift right) except, instead of shifting a 0 into the high-order bit, an arithmetic shift-right operation copies the HO bit back into itself; that is, during the shift operation, it does not modify the HO bit.
This operation always rounds the numbers to the closest integer that is less than or equal to the actual result.
The sar instruction sets the zero flag based on the result (z=1 if the result is zero, and z=0 otherwise). The sar instruction sets the sign flag to the HO bit of the result. The overflow flag should always be clear after a sar instruction, as signed overflow is impossible with this operation.
The rotate-left and rotate-right operations behave like the shift-left and shift-right operations, except the bit shifted out from one end is shifted back in at the other end.
The x86-64 provides rol (rotate left) and ror (rotate right) instructions that do these basic operations on their operands.
Unlike the shift instructions, the rotate instructions do not affect the settings of the sign or zero flags. The OF flag is defined only for the 1-bit rotates; it is undefined in all other cases. For left rotates, the OF flag is set to the exclusive-or of the original HO 2 bits. For right rotates, the OF flag is set to the exclusive-or of the HO 2 bits after the rotate.
It is often more convenient for the rotate operation to shift the output bit through the carry and to shift the previous carry value back into the input bit of the shift operation. The x86-64 rcl (rotate through carry left) and Computer Data Representation and Operations rcr (rotate through carry right) instructions achieve this for you.
The count operand is either a constant or the CL register, and the dest operand is a memory location or register. The count operand must be a value that is less than the number of bits in the dest operand.
Unlike the shift instructions, the rotate-through-carry instructions do not affect the settings of the sign or zero flags. The OF flag is defined only for the 1-bit rotates. For left rotates, the OF flag is set if the original HO 2 bits change. For right rotates, the OF flag is set to the exclusive OR of the resultant HO 2 bits.
This register packs nine important Boolean objects (along with seven important system flags) into a single 16-bit register.
The lahf and sahf instructions provide a convenient way to access the LO 8 bits of the FLAGS register as an 8-bit byte.
The single-precision format uses a one’s complement 24-bit mantissa, an 8-bit excess-127 exponent, and a single sign bit.
The mantissa usually represents a value from 1.0 to just under 2.0. The HO bit of the mantissa is always assumed to be 1 and represents a value just to the left of the binary point. The remaining 23 mantissa bits appear to the right of the binary point.
The HO bit of the mantissa is always 1, the single-precision format doesn’t actually store this bit within the 32 bits of the floating-point number. This is known as an implied bit.
Because we are working with binary numbers, each position to the right of the binary point represents a value (0 or 1) times a successive negative power of 2.
Although there is an infinite number of values between 1 and 2, we can represent only 8 million of them because we use a 23-bit mantissa (with the implied 24th bit always 1). This is the reason for inaccuracy in floating point arithmetic - we are limited to a fixed number of bits in computations involving single-precision floating-point values.
The mantissa uses a one’s complement format rather than two’s complement to represent signed values. The 24-bit value of the mantissa is simply an unsigned binary number, and the sign bit determines whether that value is positive or negative.
To represent values outside the range 1.0 to just under 2.0, the exponent portion of the floating-point format comes into play. The floating point format raises 2 to the power specified by the exponent and then multiplies the mantissa by this value. The exponent is 8 bits and is stored in an excess-127 format.
In excess-127 format, the exponent 0 is represented by the value 127 (7Fh), negative exponents are values in the range 0 to 126, and positive exponents are values in the range 128 to 255. To convert an exponent to excess-127 format, add 127 to the exponent value. The use of excess-127 format makes it easier to compare floating-point values.
With a 24-bit mantissa, you will get approximately six and a half (decimal) digits of precision. With an 8-bit excess-127 exponent, the dynamic range14 of single-precision floating-point numbers is approximately 2±127, or about 10±38.
The double-precision format helps overcome the problems of single-precision floating-point.
Using twice the space, the double-precision format has an 11-bit excess-1023 exponent and a 53-bit mantissa (with an implied HO bit of 1) plus a sign bit. This provides a dynamic range of about 10±308 and 14.5 digits of precision, sufficient for most applications.
Extended-precision format uses 80 bits. Twelve of the additional 16 bits are appended to the mantissa, and 4 of the additional bits are appended to the end of the exponent.
Unlike the single- and double-precision values, the extended-precision format’s mantissa does not have an implied HO bit. Therefore, the extended-precision format provides a 64-bit mantissa, a 15-bit excess-16383 exponent, and a 1-bit sign.
On the x86-64 FPU, all computations are done using the extended precision format. Whenever you load a single- or double-precision value, the FPU automatically converts it to an extended-precision value. Likewise, when you store a single- or double-precision value to memory, the FPU automatically rounds the value down to the appropriate size before storing it.
A normalized floating-point value is one whose HO mantissa bit contains 1. Almost any non-normalized value can be normalized: shift the mantissa bits to the left and decrement the exponent until a 1 appears in the HO bit of the mantissa.
Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision for a computation. If the HO n bits of the mantissa are all 0, the mantissa has that many fewer bits of precision available for computation. Therefore, a floating-point computation will be more accurate if it involves only normalized values.
In two important cases, a floating-point number cannot be normalized. Zero is one of these special cases. In the second case, we have some HO bits in the mantissa that are 0, but the biased exponent is also 0. Rather than disallow certain small values, whose HO mantissa bits and biased exponent are 0 , the IEEE standard allows special denormalized values to represent these smaller values.
The IEEE floating-point standard recognizes three special non-numeric values: –infinity, +infinity, and a special not-a-number (NaN).
If the exponent is all 1 bits and the mantissa is all 0 bits, then the value is infinity. The sign bit will be 0 for +infinity, and 1 for –infinity.
If the exponent is all 1 bits and the mantissa is not all 0 bits, then the value is an invalid number. NaNs represent illegal operations, such as trying to take the square root of a negative number.
Unordered comparisons occur whenever either operand (or both) is a NaN. As NaNs have an indeterminate value, they cannot be compared. Any attempt to perform an unordered comparison typically results in an exception or some sort of error. Ordered comparisons, on the other hand, involve two operands, neither of which are NaNs.
MASM provides several data types to support the use of floating-point data in your assembly language programs. MASM floating-point constants allow the following syntax:
An optional + or - symbol, denoting the sign of the mantissa (if this is not present, MASM assumes that the mantissa is positive)
Followed by one or more decimal digits
Followed by a decimal point and zero or more decimal digits
Optionally followed by an e or E, optionally followed by a sign (+ or -) and one or more decimal digits.
you use real4 to declare single-precision real values, real8 to declare double precision floating-point values, and real10 to declare extended-precision floating-point values.
The American Standard Code for Information Interchange (ASCII) character set maps 128 textual characters to the unsigned integer values 0 to 127 (0 to 7Fh).
The ASCII character set is divided into four groups of 32 characters. The first 32 characters, ASCII codes 0 to 1Fh (31), form a special set of nonprinting characters, the control characters. We call them control characters because they perform various printer/display control operations rather than display symbols. Examples include carriage return, which positions the cursor to the left side of the current line of characters;17 line feed, which moves the cursor down one line on the output device; and backspace, which moves the cursor back one position to the left.
The second group of 32 ASCII character codes contains various punctuation symbols, special characters, and the numeric digits. The notable characters in this group include the space character (ASCII code 20h) and the numeric digits (ASCII codes 30h to 39h).
The third group of 32 ASCII characters contains the uppercase alphabetic characters. The ASCII codes for the characters A to Z lie in the range 41h to 5Ah (65 to 90). Because there are only 26 alphabetic characters, the remaining 6 codes hold various special symbols.
The fourth, and final, group of 32 ASCII character codes represents the lowercase alphabetic symbols, 5 additional special symbols, and another control character (delete). The lowercase character symbols use the ASCII codes 61h to 7Ah.
The LO nibble of the ASCII code is the binary equivalent of the represented number. By stripping away (that is, setting to 0) the HO nibble of a numeric character, you can convert that character code to the corresponding binary representation.
Conversely, you can convert a binary value in the range 0 to 9 to its ASCII character representation by simply setting the HO nibble to 3. You can use the logical AND operation to force the HO bits to 0; likewise, you can use the logical OR operation to force the HO bits to 0011b.
MASM provides support for character variables and literals in your assembly language programs. Character literal constants in MASM take one of two forms: a single character surrounded by apostrophes or a single character surrounded by quotes, as follows:
Or use the byte data type -
The problem with ASCII is that it supports only 128 character codes. This is way too small for modern multinational/multilingual applications. Unicode defines 1,112,064 possible characters, encoded using a variable-length character format.
A Unicode code point is an integer value that Unicode associates with a particular character symbol. The convention for Unicode code points is to specify the value in hexadecimal with a preceding U+ prefix; for example, U+0041 is the Unicode code point for the A character.
The Unicode standard defines code points in the range U+000000 to U+10FFFF (10FFFFh is 1,114,111, which is where most of the 1,112,064 characters in the Unicode character set come from; the remaining 2047 code points are reserved for use as surrogates, which are Unicode extensions).
The Unicode standard breaks this range up into 17 multilingual planes, each supporting up to 65,536 code points. The HO two hexadecimal digits of the six-digit code point value specify the multilingual plane, and the remaining four digits specify the character within the plane.
The first multilingual plane, U+000000 to U+00FFFF, roughly corresponds to the original 16-bit Unicode definition; the Unicode standard calls this the Basic Multilingual Plane (BMP). Planes 1 (U+010000 to U+01FFFF), 2 (U+020000 to U+02FFFF), and 14 (U+0E0000 to U+0EFFFF) are supplementary (extension) planes. Unicode reserves planes 3 to 13 for future expansion, and planes 15 and 16 for user-defined character sets.
Obviously, representing Unicode code points outside the BMP requires more than 2 bytes. To reduce memory usage, Unicode uses 2 bytes for the Unicode code points in the BMP, and uses 4 bytes to represent code points outside the BMP. Within the BMP, Unicode reserves the surrogate code points (U+D800–U+DFFF) to specify the 16 planes after the BMP.
UTF-16 uses 16-bit (unsigned) integers to represent Unicode values. To handle scalar values greater than 0FFFFh, UTF-16 uses the surrogate pair scheme to represent values in the range 010000h to 10FFFFh.
UTF-8 encoding is upward compatible from the ASCII character set. In particular, all ASCII characters have a single-byte representation (their original ASCII code, where the HO bit of the byte containing the character contains a 0 bit). If the UTF-8 HO bit is 1, UTF-8 requires additional bytes (1 to 3 additional bytes) to represent the Unicode code point.
Unfortunately, MASM provides almost zero support for Unicode text in a source file. Fortunately, MASM’s macro facilities provide a way for you to create your own Unicode support for strings in MASM.