Sunday, August 23, 2009

Web Technologies - Introductions - III


Browser History
1991 – 93 The World Wide Web is born
1993 – NCSA Mosiac Released
1994 – Netscape Released his Browser Ver 0.9
1996 – Microsoft Internet Explorer 3.0 Ver Released
1998 – Netscape released its code under a open source license
2000 – Internet Explorer 5 for Mac OS Released
2005 – Mozila Firefox Released his latest version of browser
Rendering Engine –
- Also known as a layout engine , is the code that tells the browser how to display the content and available style information in the browser

- The First separate and reusable rendering engine was GECKO



Popular Server Software
1. Apache –
2. Internet Information Server (IIS) -

Character Sets and Encoding






6.1. Character Sets and Encoding
The first challenge in internationalization is dealing with the staggering number of unique character shapes (called glyphs ) that occur in the writing systems of the world. This includes not only alphabets, but also all ideographs (characters that indicate a whole word or concept) for such languages as Chinese, Japanese, and Korean. There are also invisible characters that indicate particular functionality within a word or a line of text, such as characters that indicate that adjacent characters should be joined.
To understand character encoding as it relates to HTML, XHTML, and XML, you must be familiar with some basic terms and concepts.

Character set
A character set is any collection or repertoire of characters that are used together for a particular function. Many character sets have been standardized, such as the familiar ASCII character set that includes 128 characters mostly from the Roman alphabet used in modern English.

Coded character set
When a specific number is assigned to each character in a set, it becomes a coded character set. Each position (or numbered unit) in a coded character set is called a code point (or code position ). In Unicode, (discussed in more detail later) the code point of the greater-than symbol (>) is 3E in hexadecimal or 62 in decimal. Unicode code points are typically denoted as U+hhhh, where hhhh is a sequence of at least four and sometimes six hexadecimal digits.

Character encoding
Character encoding refers to the way characters and their code points are converted to bytes for use by computers. The character encoding transforms the character stream in a document to a byte stream that is interpreted by user agents and reassembled again as a character stream for the user.
The number of characters available in a character set is limited by the bit depth of its encoding. For example, 8 bits are capable of describing 256 (28) unique characters, 16 bits can describe 65,536 (216) different characters, and so on.
Many character sets and their encodings have been standardized for worldwide interoperability. The most relevant character set to the Web is the comprehensive Unicode (ISO/IEC 106460-1), which includes more than 50,000 characters from all active modern languages. Unicode is discussed in appropriate detail in the next section.
Web documents may also be encoded with more specialized encodings appropriate to their authoring languages. Some common encodings are listed in Table 6-1. Note that all of these encodings are 8-bit (256 character) subsets of Unicode.
Table 6-1. Common 8-bit character encodings
Encoding Description
ISO 8859-1 (a.k.a. Latin-1) Latin characters used in most Western languages (includes ASCII)
ISO 8859-5 Cyrillic
ISO 8859-6 Arabic
ISO 8859-7 Greek
ISO 8859-8 Hebrew
ISO-2022-JP Japanese
SHIFT_JIS Japanese
EUC-JP Japanese

HTML 2.0 and 3.0 were based on the 8-bit Latin-1 (ISO 8859-1) character set. Even as HTML 2.0 was being penned, the W3C was aware that 256 characters were not adequate to exchange information on a global scale, and it had its sights set on a super-character set called Unicode. Unfortunately, Unicode wasn't ready for inclusion in an HTML Recommendation until Version 4.0 (1996). Without further ado, it's time to talk Unicode.
6.1.1. Unicode (ISO/IEC 10646-1)
SGML-based markup languages are required to define a document character set that serves as the basis for interpreting characters. The document character set for HTML (4 and 4.01), XHTML, and XML is the Universal Character Set (UCS) , which is a superset of all widely used standard character sets in the world.
The USC is defined by both the Unicode and ISO/IEC 10646 standards. The code points in Unicode and ISO/IEC 10646 are identical and the standards are developed in parallel. The difference is that Unicode adds some rules about how characters should be used. It is also used as a reference for such issues as the bidirectional text algorithm for handling reading direction within text. The Unicode Standard is defined by the Unicode Consortium (www.unicode.org).

In common practice, and throughout this book, the Universal Character Set is referred to simply as "Unicode."


Because Unicode is the document character set for all (X)HTML documents, numeric character references in web documents will always be interpreted according to Unicode code points, regardless of the document's declared encoding.
6.1.1.1. Unicode code points
Unicode was originally intended to be a 16-bit encoded character set, but it was soon recognized that 65,536 code positions would not be enough, so it was extended to include more than a million available code points (not all of them are assigned, of course) on supplementary planes.
The first 16 bits, or 65,536 positions, in Unicode are referred to as the Basic Multilingual Plane (BMP) . The BMP includes most of the more common characters in use, such as character sets for Latin, Greek, Cyrillic, Devangari, hirgana, katakana, Cherokee, and others, as well as mathematical and other miscellaneous characters. Most ideographs are there, too, but due to their large numbers, many have been moved to a Supplementary Ideographic Plane.
Unicode was created with backward compatibility in mind. The first 256 code points in the BMP are identical to the Latin-1 character set, with the first 128 matching the established ASCII standard.
6.1.1.2. Unicode encodings
Many character sets have only one encoding method, such as the ISO 8859 series. Unicode, however, may be encoded a number of ways. So although the code points never change, they may be represented by 1, 2, or 4 bytes. The encoding forms for Unicode are:

UTF-8
This is an expanding format that uses 1 byte for characters in the ASCII set, 2 bytes for additional character ranges, and 3 bytes for the rest of the BMP. Supplementary planes use 4 bytes. UTF-8 is the recommended Unicode encoding for web documents and other Internet technologies.

UTF-16
Uses 2 bytes for BMP characters and 4 bytes for supplementary characters. UTF-16 is another option for web documents.

UTF-32
Uses 4 bytes for all characters.
So while the code point for the percent sign is U+0025, it would be represented by the byte value 25 in UTF-8, 00 25 in UTF-16, and 00 00 00 25 by UTF-32. There are other things at work in the encoding as well, but this gives you a feel for the difference in encoding forms.
6.1.1.3. Choosing an encoding
The W3C recommends the UTF-8 encoding for all (X)HTML and XML documents because it can accommodate the greatest number of characters and is well supported by servers. It allows wide-ranging languages to be mixed within a single document.
Not all web documents need to be encoded using UTF-8 however. If you are authoring a document in a language that uses a lot of non-ASCII characters, you may want to choose an encoding that minimizes the need to numerically represent ("escape") these special characters.
Bear in mind, however, that regardless of the encoding, all characters in the document will be interpreted relative to Unicode code points.

For more information on how character sets and character encodings should be handled for web documents, see the W3C's Character Model for the World Wide Web 1.0 Recommendation at www.w3.org/TR/charmod/.




6.1.2. Specifying Character Encoding
The W3C encourages authors to specify the character encoding for all web documents, even those that use the default UTF-8 Unicode encoding, but it is particularly critical if an alternate encoding is used. There are several ways to declare the character encoding for documents: in the HTTP header delivered by the server, in the XML declaration (for XHTML and XML documents only), or in a meta element in the head of the document. This section looks at each method and provides guidelines for their use.
6.1.2.1. HTTP headers
When a server sends a document to a user agent (such as a browser), it also sends information about the document in a portion of the document called the HTTP header. A typical HTTP header looks like this:
HTTP/1.x 200 OK
Date: Mon, 14 Nov 2005 19:45:33 GMT
Server: Apache/2.0.46 (Red Hat)
Accept-Ranges: bytes
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

Notice that one of the bits of information that the server sends along is the Content-Type of the document using a MIME type label. For example, HTML documents are always delivered as type text/html. (The MIME types for XHTML documents aren't as straightforward, as discussed in the sidebar, "Serving XHTML.") The Content-Type entry may also contain the character encoding of the document using the charset parameter, as shown in the example.
The method for setting up a server with your preferred character encoding varies with different server software, so it is best to consult the server administrator for assistance. For Apache servers , the default character encoding may be set for all documents with the .html extension by adding this line to the .htaccess file.
AddType 'text/html; charset=UTF-8' html

The advantages to setting character encodings in HTTP headers are that the information is easily accessible to user agents and the header information has the highest priority in case of conflict. On the downside, it is not always easy for authors to access the server settings, and it is possible for the default server settings to be changed without the author's knowledge.
It is also possible for the character encoding information to get separated from the document, which is why it is recommended that the character encoding be provided within the document as well, as described by the next two methods.
Serving XHTML
XHTML 1.0 documents may be served as either XML or HTML documents. Although XML is the proper method, many authors choose to deliver XHTML 1.0 files with the text/html MIME type used for HTML documents for reasons of backward compatibility, lack of browser support for XML files, and other problems with XHTML interpretation. When XHTML documents are served in this manner, they may not be parsed as XML documents.
XHTML 1.0 files may also be served as XML, and XHTML 1.1 files must always be served as XML. XHTML documents served as XML may use the MIME types application/xhtml+xml, application/xml, or text/xml. The W3C recommends that you use application/xhtml+xml only.
Whether you serve an XHTML document as an HTML or XML file type changes the way you specify the character encoding , as covered in the upcoming "Choosing the declaration method" section.



6.1.2.2. XML declaration
XHTML (and other XML) documents often begin with an XML declaration before the DOCTYPE declaration. The XML declaration is not required. The declaration may include the encoding of the document, as shown in this example.

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

The XML declaration may be provided even for XHTML documents served as text/html.
Because the default encoding for all XML documents is UTF-8 or UTF-16, encoding information in the XML declaration is not required for these encodings, and thus can be omitted as a space-saving optimization.
In addition, although it is technically correct to include the XML declaration in such documents, Appendix C of the XHTML 1.0 specification, "HTML Compatibility Guidelines," recommends avoiding it, and many authors choose to omit it because of browser-support issues. For example, when Internet Explorer 6 for Windows detects a line of text before the DOCTYPE declaration, it converts to Quirks Mode (see Chapter 9 for details), which can have a damaging effect on how the documents styles are rendered. (This is reportedly fixed in IE 7.) It is required only if your document uses an encoding other than UTF-8 or UTF-16 and if the encoding has not been set on the server.
6.1.2.3. The meta element
For HTML documents as well as XHTML documents served as text/html, the encoding should always be specified using a meta element in the head of the document. The http-equiv attribute passes information along to the user agent as though it appeared in the HTTP header. Again, the encoding is provided with the charset value as shown here:


Document Title


Although the meta element declaring the content type is not a required element in the HTML and XHTML DTDs, it is strongly recommended for the purpose of clearly identifying the character encoding and keeping that information with the document. This is particularly helpful for common text editors (such as BBEdit), which use the meta element to identify the character encoding of the document when opening the document for editing. With this method, all character encodings must be explicitly specified, including UTF-8 and UTF-16.
6.1.2.4. Choosing the declaration method
The declaration method you use depends on the type of document you are authoring and its encoding method.

HTML documents
The encoding should be specified on the server and again in the document with a meta element. This makes sure the encoding is easily accessible and stays with the document should it be saved for later use.

XHTML 1.0 documents served as HTML
The encoding should be specified on the server and again in the document with a meta element. If the encoding is something other than UTF-8 or UTF-16, and the document is likely to be parsed as XML (not just HTML), then also include the encoding in an XML header. Be aware that the inclusion of the XML declaration may cause rendering problems for some browsers.

XHTML (1.0 and 1.1) documents served as XML
The encoding should be specified on the server and by using the encoding attribute in the XML declaration. Although not strictly required for UTF-8 and UTF-16 encodings, it doesn't hurt to include it anyway.







6.2. Character References
HTML and XHTML documents use the standard ASCII character set (these are the characters you see printed on the keys of your keyboard). To represent characters that fall outside the ASCII range, you must refer to the character by using a character reference. This is known as escaping the character.
Declaring Encoding in Style Sheets
It is also possible to declare the encoding of an external style sheet by including a statement at the beginning of the .css document (it must be the first thing in the file):
@charset "utf-8";

It is important to do this if your style sheet includes non-ASCII characters in property values such as quotation characters used in generated content, font names, and so on.

In HTML and XML documents, some ASCII characters that you intend to be rendered in the browser as part of the text content must be escaped in order not to be interpreted as code by the user agent. For example, the less-than symbol (<) must be escaped in order not to be mistaken as the beginning of an element start tag. Other characters that must be escaped are the greater-than symbol (>), ampersand (&), single quote ('), and double quotation marks ("). In XML documents, all ampersands must be escaped or they won't validate.
There are two types of character references: Numeric Character References (NCR) and character entities.
6.2.1. Numeric Character References
A Numeric Character Reference (NCR) refers to the character by its Unicode code point (introduced earlier in this chapter). NCRs are always preceded by &# and end with a ; (semicolon). The numeric value may be provided in decimal or hexadecimal. Hexadecimal values are indicated by an x before the value.
For example, the copyright symbol (©), which occupies the 169th position in Unicode (U+00A9), may be represented by its hexadecimal NCR © or its decimal equivalent, ©. Decimal values are more common in practice. Note that the zeros at the beginning of the code point may be omitted in the numeric character reference.

Handy charts of every character in the Basic Multilingual Plane are maintained as a labor of love by Jens Brueckmann at his site J-A-B.net. The Unicode code point and decimal/hexadecimal NCR is provided for every character. It is available at www.j-a-b.net/web/char/char-unicode-bmp.



6.2.2. Character Entities
Character entities use abbreviations or words instead of numbers to represent characters that may be easier to remember than numbers. In this sense, entities are merely a convenience. Character entities must be predefined in the DTD of a markup language to be available for use. For example, the copyright symbol may be referred to as ©, because that entity has been declared in the DTD. The character entities defined in HTML 4.01 and XHTML are listed in Appendix C (a list of the most common is also provided in Chapter 10). XML defines five character entities for use with all XML languages:

<
Less than (<)

>
Greater than (>)

&
Ampersand (&)

'
Apostrophe (')

"
Quotation mark (")
6.2.3. Escapes in CSS
It may be necessary to escape a character in a style sheet if the value of a property contains a non-ASCII character. In CSS, the escape mechanism is a backslash followed by the hexadecimal Unicode code point value. The escape is terminated with a space instead of a semicolon. For example, a font name starting with a capital letter C with a cedilla (Ç) needs to be escaped in the style rule, as shown here.
p { font-family: \C7 elikfont; }

When the special character appears in a style attribute value, it is possible to use its NCR, entity, or CSS escape. The CSS escape is recommended to make it easier to move it to a style sheet later.










No comments:

Post a Comment