Sunday, February 3, 2008

How do I encode URLs for readability in my Sitemap?

How do I encode URLs for readability in my Sitemap?

We require your Sitemap file to be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed in the table below.

Character Escape Code
Ampersand & &
Single Quote ' '
Double Quote " "
Greater Than > >
Less Than < <

In addition, all URLs (including the URL of your Sitemap) must be encoded for readability by the web server on which they are located and URL-escaped. However, if you are using any sort of script, tool, or log file to generate your URLs (anything except typing them in by hand), this is usually already done for you. If you submit your Sitemap and you receive an error that Google is unable to find some of your URLs, check to make sure that your URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard.

Below is an example of a URL that uses a non-ASCII character (ü), as well as a character that requires entity escaping (&):

  http://www.example.com/ümlat.html&q=name

Below is that same URL, ISO-8859-1 encoded (for hosting on a server that uses that encoding) and URL escaped:

  http://www.example.com/%FCmlat.html&q=name

Below is that same URL, UTF-8 encoded (for hosting on a server that uses that encoding) and URL escaped:

  http://www.example.com/%C3%BCmlat.html&q=name

Below is that same URL, entity escaped:

  http://www.example.com/%C3%BCmlat.html&q=name

In addition, it can contain only ASCII characters. It can't contain upper ASCII characters or certain control codes or special characters such as * and {}. If your Sitemap URL contains these characters, you'll receive an error when you try to add it.

No comments: