“The Life of a Data Byte”
Communications of the ACM, December 2020, Vol. 63 No. 12, Pages 38-45
By Jessie Frazelle
“What was required before returning a movie to Blockbuster? Rewinding the tape! The same could be said for the tape used for computers. Programs could not hop around a tape, or randomly access data—they had to read and write in sequential order.”
A byte of data has been stored in a number of different ways through the years as newer, better, and faster storage media are introduced. A byte is a unit of digital information that most commonly refers to eight bits. A bit is a unit of information that can be expressed as 0 or 1, representing a logical state. Let’s take a brief walk down memory lane to learn about the origins of bits and bytes.
Going back in time to Babbage’s Analytical Engine, you can see that a bit was stored as the position of a mechanical gear or lever. In the case of paper cards, a bit was stored as the presence or absence of a hole in the card at a specific place. For magnetic storage devices, such as tapes and disks, a bit is represented by the polarity of a certain area of the magnetic film. In modern DRAM (dynamic random-access memory), a bit is often represented as two levels of electrical charge stored in a capacitor, a device that stores electrical energy in an electric field. (In the early 1960s, the paper cards used to input programs for IBM mainframes were known as Hollerith cards, named after their inventor Herman Hollerith from the Tabulating Machines Company—which through numerous mergers is what is now known as IBM.)
In June 1956, Werner Buchholz coined the word byte to refer to a group of bits used to encode a single character of text. Let’s address character encoding, starting with ASCII (American Standard Code for Information Interchange). ASCII was based on the English alphabet; therefore, every letter, digit, and symbol (a-z, A-Z, 0-9, +, -, /, “, !, among others) were represented as a seven-bit integer between 32 and 127. This wasn’t very friendly to other languages. To support other languages, Unicode extended ASCII so that each character is represented as a code-point, or character; for example, a lowercase j is U+006A, where U stands for Unicode followed by a hexadecimal number.
UTF-8 is the standard for representing characters as eight bits, allowing every code-point from 0 to 127 to be stored in a single byte. This is fine for English characters, but other languages often have characters that are expressed as two or more bytes. UTF-16 is the standard for representing characters as 16 bits, and UTF-32 is the standard for 32 bits. In ASCII every character is a byte, but in Unicode, that’s often not true—a character can be one, two, three, or more bytes. Groups of characters might also be referred to as words, as in this linked Univac ad calling out “1 kiloword or 12,000 characters.” This article refers throughout to different sized groupings of bits—the number of bits in a byte varying according to the design of the storage medium in the past.
This article also travels in time through various storage media, diving into how data has been stored throughout history. By no means does this include every single storage medium ever manufactured, sold, or distributed. This article is meant to be fun and informative but not encyclopedic. It wraps up with a look at the current and future technologies for storage.
To get started, let’s assume we have a byte of data to be stored: the letter j, or as an encoded byte 6a, or in binary 01001010. As we travel through time, this data byte will come into play in some of the storage technologies covered here.
The story begins in 1951 with the Uniservo tape drive for the Univac 1 computer, the first tape drive made for a commercial computer. The tape was three pounds of a thin strip (half-inch) of nickel-plated phosphor bronze, called vicalloy, which was 1,200 feet long. Our data byte could be stored at a rate of 7,200 characters per second on tape moving at 100 inches per second. At this point in history, you could measure the speed of a storage algorithm by the distance the tape traveled.
About the Author:
Jessie Frazelle is the cofounder and chief product officer of the Oxide Computer Company. Before that, she worked on various parts of Linux, including containers as well as the Go programming language.
The Most Expensive One-byte Mistake
Should You Upload or Ship Big Data to the Cloud?
Injecting Errors for Fun and Profit