25 May 2007
Unicode UTF-8 Byte Order Mark
So, I setup this site with the intent of focusing on internationalization and localization as they relate to the web but have not done a whole lot of that yet. I have found that this is something I actually know more about just because I have developed in a multilingual (English, Chinese, Japanese) environment for a while. To kick this off, I thought I would put up a short blurb about the Unicode UTF-8 Byte Order Mark, otherwise known as the BOM. If you have ever seen  prefixing the first line of a file, then you have already been introduced. My thought here is not to discuss the nature of the BOM (you can check out the links below) but to mention some potentially lesser known facts about its use that developers may run into.
- When saving a file in Notepad, if you save with "Encoding" set to "UTF-8″ then you are including the BOM at the beginning of the file even though you cannot see it. Similarly, in Visual Web Developer, if you save the file with encoding and choose "Unicode (UTF-8 with signature) - Codepage 65001″ then you are also including the BOM at the beginning of the file.
- Properly using multilingual text on the web requires using files saved with Unicode encoding. There are usually a number of options. Generally, best practice is to not include the UTF-8 BOM, and I recommend choosing a Unicode encoding that both excludes the BOM and maintains a small file size. For example, in Visual Web Developer, I save files with the "Unicode (UTF-8 without signature) - Codepage 65001″ encoding.
- Having said that, I ran into the same case twice on my former blogging platform where I was forced to include the BOM at the beginning of an ASP file in order for it to properly recognize the script as Unicode and correctly process Unicode text. If all else seems to be failing, give it a shot and see if it fixes it. Still, my recommendation is to exclude the BOM unless it proves absolutely necessary.
Check out these links for more information: