A Proposal for a Universal Free-Market Data Exchange Format ASCII (American Standard Code for Information Interchange) has long been the predominant code for sending text from one small computer to another in this country. Although it works fairly well for English text if you don't need fancy formatting and typefaces, it does not fill all needs perfectly. Thus over the last few years various people have put forth proposals for new codes to take its place. These are often in some format using words of two or more bytes and are seen as including characters from foreign alphabets as well as the present ASCII set. After seeing one such proposal, I began thinking. Here's what I've come up with so far. The assumptions used by the proposal I saw included a multi-byte word of a fixed length. If all but the last byte are zero, that last byte is assumed to be ASCII (probably the IBM extended ASCII set). Values above hex 00 00 FF were other character sets, including Chinese, Korean, etc. Each set would be assigned a block of values somewhere between 00 00 FF and FF FF FF. Wouldn't a world standard in which each "vanilla ASCII" byte is prefixed by several bytes of zeros be wasteful? A little, but not as much as appears at first glance. Data compression is becoming common as part of disk drive packages, and will likely be the norm by the time such an expanded code comes into wide use. Thus multibyte-character-standard files would take up essentially no more space on the disk than ASCII files do now: They'd just compress by a larger factor. Likewise, if whoever you're networking with has a modem with built-in compression compatible with yours, there would be little added transmission overhead. The only time you'd need to send all the unchanging header bits would be if you and the other party could not agree on any sort of compression scheme. Thus the full standard multi-byte format would be a sort of last-resort common denominator. What about within your computer? As things stand now, almost every text editor (WordStar, Word Perfect, etc.) has its own format for character manipulation and storage. These are generally incompatible with each other, but most text editors have means for importing and exporting files to other formats, including plain ASCII. There is some inconvenience, but the situation is generally livable. By extension, writers of operating systems could define compression algorithms to be considered standard within a given system. Thus you could feed data to the screen and printer in compressed form, confident that the device (or its driver) knows how to handle it. The main assumption I'm making that wasn't put forth in the proposals I've seen is that whatever we do, some user will need something that the agency in charge of maintaining the standard couldn't foresee. Thus my proposal for a user-created expandable code set. In the discussion that follows, the numbers 0-9 and letters A-F are fixed hex digits. An S indicates a digit that identifies a user- defined code set, while an X can be any four-bit pattern. The basic format is a frame of 64 bits, shown here as 16 hex digits. The value of the first digit determines the meanings of subsequent digits within the frame. The predefined frames would be as follows: 0X XX XX XX XX XX XX XX 1X XX XX XX XX XX XX XX 2X XX XX XX XX XX XX XX Reserved for incorporation of previous standards. For example, 00 XX XX XX XX XX XX XX might convey seven bytes of present- day ASCII, 01 XX XX XX XX XX XX XX seven bytes of EBCDIC, 12 ... multiple frames of Unicode, and so on. These would have to be worked out as part of the official standard. 3S SS XX XX XX XX XX XX 4S SS SX XX XX XX XX XX 5S SS SS XX XX XX XX XX 6S SS SS SX XX XX XX XX 7S SS SS SS XX XX XX XX 8S SS SS SS SX XX XX XX 9S SS SS SS SS XX XX XX AS SS SS SS SS SX XX XX BS SS SS SS SS SS XX XX The user-defined code sets. Each user who wanted one would be assigned a code set ID (the S digits above) and would then be free to define the X digits at will. Given a set of S digits, the user who owned that set, not the administering agency, would be the ultimate authority as to what the X digits mean. CX XX XX XX XX XX XX XX DX XX XX XX XX XX XX XX EX XX XX XX XX XX XX XX Reserved for the future. These are not to be defined until the rest of the system has been in use for five (C row), ten (D row), or twenty (E row) years. FX XX XX XX XX XX XX XX The anarchy row, never to be standardized. If you see one of these in a file or transmission, ask whoever you got it from. Period. As mentioned above, each code set (defined by a unique set of S digits) would belong to some user who would be the ultimate authority on what the X digits within that set mean. The administering agency would assign sets of S digits to users, sort of like government agencies assigning radio call signs or vehicle license numbers, but would not define the X digits bound to those sets. For example, if a group of archaeologists wanted to define an ancient Sumerian character set, they would apply for a code set. Once they got their S digits they could then use the affiliated X digits to assign values to the ancient Sumerian characters they wanted to encode. Once they had the set defined to their satisfaction they could publicize it within the scholarly community for other archaeologists to use. The definition information they send out would include graphics images of the characters, suitable for downloading into a printer or screen driver. A user with this information properly loaded could then freely intermix ancient Sumerian characters with English text in a document and post it on a bulletin board. Readers who had that same code set loaded would see the document with the Sumerian characters correctly displayed. Readers who didn't have that set would get an error message and could then attempt to contact the owner of the set through the administering agency. Some owners of code sets may decide to deposit their definitions with the administering agency to allow the public to more easily use their sets. Others would have enquiries forwarded to them, perhaps treating their sets as shareware. Still others would keep theirs private. It would all be up to the owner of the set. The method of assigning code sets would vary with the first digit. Since the number of possible sets in the 9, A, or B rows is much greater than the population of the world, I would expect anyone who wanted one to be able to get one for the price of a postcard or an E-mail message. On the other hand, there are only 4096 possible sets in the 3 row. Those I would expect to go to national governments and giant corporations willing to pay large sums for them. The rows in between would mostly go to intermediate users such as universities, printer manufacturers, smaller governments, computer clubs, and so on. Note that a printer manufacturer could use a set to define not only printable characters, but also printer control, font changes, and formatting codes. The question "What can this printer emulate?" would become "Which code sets does it recognize?" And if the printer recognizes a number of different sets, you can access any and all of them by simply including the appropriate S digits in the data frame. This gives the equivalent of being able to change emulations on the fly. (And I would expect the printer of the future to have its own hard drive, or other means of storing large amounts of font and character data in non-volatile but updatable form. And there should be a standardized downloading language.) Discussions of other data standards have brought up the issue of whether or not to allow escape codes to shift between different data formats. There are arguments pro and con. My proposal would allow them within a code set. If a user with a long set of S digits (and hence few X digits to play with) wanted to escape from the code set format, I would recommend escaping to the F-prefix anarchy row for a predetermined number of frames. For example, a user wanting to include graphics images in a document might define a code word that translates to "The following N frames are a graphics image in XYZ format," followed by N frames of the form FF GG GG GG GG GG GG GG where GG is the graphics information. The receiver would simply discard every FF frame header byte thereafter for the specified number of frames, building the image from what remains. Something similar could be done for other special forms of data, such as audio embedded in a document. A group of users all on the same hardware platform could even include within a document machine-language subroutines for doing something special to, with, or about the document. If the code word defining the escape specified the hardware, users could tell automatically whether or not they had the required configuration. (This could even be a way to distribute a program as part of its documentation rather than vice versa.) Should "permanent" escape completely out of the format (no leading FF, no eight-byte frames) be allowed? I feel it should be discouraged, but since nobody can totally anticipate all future needs I think it needs to be allowed. Then some new data set that supersedes this one in some far future could include this one as a subset, with escape possible to the larger set. -- Tom Digby written April 1992 minor editing Feb 9, 1995