The LinuxDig.Com Linux Dictionary is currently in Beta. You can help, email Comments or Suggestions here.
Number of Terms : 8142 Number of Definitions : 9135
Unicode1. The international character set. The United States characters ASCII only needs 7-bits to encode text. There are fewer than 100 characters in the English language (26-upper-case, 26-lower-case, 10-digits, and a bunch of punctuation). Since 7-bits has 128 combinations, it is sufficient to cover the characerts plus a few control codes. However, there are other alphabets, such as Russian, Greek, and Hebrew. Even worse, far-eastern languages like Chinese, Japanese, and Korean use symbols/ideographs to represent words without a strict alphabet. The Unicode character set was built to represent all these characters within a 2-byte (16-bit) format. Roughly 30,000 characters from all the popular languages have been assigned in an internationally agreed upon format. Key point: Most computers are built to handle 1-byte characters, and do not like the idea of handling 2-bytes for each character. Therefore, a multi-byte character set has been designed to store Unicode. It is called "UTF8". It is the native character set for many newer systems, such as Java. Using "multibyte" rather than "fixed" character set means that a variable number of bytes can be used, depending upon how many bytes/bits are needed to represent the character. The key issue here is that every 7-bit ASCII character can be encoded in all forms. For example, older Microsoft IIS web-servers would check for backtracking attacks. However, a UTF8 encoding of the backtracks would bypass the IIS checks, but would still be passed to the filesystem. Encoding Bits Encoding of '.' 0xxxxxxx 7-bits 2E 110xxxxx 10xxxxxx 11-bits C0 AE 1110xxxx 10xxxxxx 10xxxxxx 15-bits E0 80 AE See also: encoding From Hacking-Lexicon |
|
|