Sunday, February 3, 2008

The Sitemap Protocol

The Sitemap Protocol allows you to inform search engines about URLs on your websites that are available for crawling. In its simplest form, a Sitemap that uses the Sitemap Protocol is an XML file that lists URLs for a site. The protocol was written to be highly scalable so it can accommodate sites of any size. It also enables webmasters to include additional information about each URL (when it was last updated; how often it changes; how important it is in relation to other URLs in the site) so that search engines can more intelligently crawl the site.

Sitemaps are particularly beneficial when users can't reach all areas of a website through a browseable interface. (Generally, this is when users are unable to reach certain pages or regions of a site by following links). For example, any site where certain pages are only accessible via a search form would benefit from creating a Sitemap and submitting it to search engines.

This document describes the formats for Sitemap files and also explains where you should post your Sitemap files so that search engines can retrieve them.

Please note that the Sitemap Protocol supplements, but does not replace, the crawl-based mechanisms that search engines already use to discover URLs. By submitting a Sitemap (or Sitemaps) to a search engine, you will help that engine's crawlers to do a better job of crawling your site.

Using this protocol does not guarantee that your webpages will be included in search indexes. (Note that using this protocol will not influence the way your pages are ranked by Google.)

Google adheres to Sitemap Protocol 0.9 as dictated by sitemaps.org. Sitemaps created for Google using Sitemap Protocol 0.9 are therefore compatible with other search engines that adopt the standards of sitemaps.org

XML Sitemap Format [Contents]

The Sitemap Protocol format consists of XML tags. All data values in a Sitemap must be entity-escaped. The file itself must be UTF-8 encoded.

A sample Sitemap that contains just one URL and uses all optional tags is shown below. The optional tags are in italics.

  
< urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
< url>
< loc>http://www.example.com/
< lastmod>2005-01-01
< changefreq>monthly
< priority>0.8



The Sitemap must:

  • Begin with an opening tag and end with a closing tag.
  • Include a entry for each URL as a parent XML tag.
  • Include a child entry for each parent tag.

XML tag definitions

The available XML tags are described below.

required Encapsulates the file and references the current protocol standard.
required Parent tag for each URL entry. The remaining tags are children of this tag.
required URL of the page. This URL must begin with the protocol (such as http) and end with a trailing slash, if your web server requires it. This value must be less than 2048 characters.
optional The date of last modification of the file. This date should be in W3C Datetime format. This format allows you to omit the time portion, if desired, and use YYYY-MM-DD.
optional

How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

The value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.

Please note that the value of this tag is considered a hint and not a command. Even though search engine crawlers consider this information when making decisions, they may crawl pages marked "hourly" less frequently than that, and they may crawl pages marked "yearly" more frequently than that. It is also likely that crawlers will periodically crawl pages marked "never" so that they can handle unexpected changes to those pages.

optional

The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. This value has no effect on your pages compared to pages on other sites, and only lets the search engines know which of your pages you deem most important so they can order the crawl of your pages in the way you would most like.

The default priority of a page is 0.5.

Please note that the priority you assign to a page has no influence on the position of your URLs in a search engine's result pages. Search engines use this information when selecting between URLs on the same site, so you can use this tag to increase the likelihood that your more important pages are present in a search index.

Also, please note that assigning a high priority to all of the URLs on your site will not help you. Since the priority is relative, it is only used to select between URLs on your site; the priority of your pages will not be compared to the priority of pages on other sites.

Entity escaping

We require your Sitemap file to be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed in the table below.

Character Escape Code
Ampersand & &
Single Quote ' '
Double Quote " "
Greater Than > >
Less Than < <

In addition, all URLs (including the URL of your Sitemap) must be encoded for readability by the web server on which they are located and URL-escaped. However, if you are using any sort of script, tool, or log file to generate your URLs (anything except typing them in by hand), this is usually already done for you. If you submit your Sitemap and you receive an error that Google is unable to find some of your URLs, check to make sure that your URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard.

Below is an example of a URL that uses a non-ASCII character (ü), as well as a character that requires entity escaping (&):

  http://www.example.com/ümlat.html&q=name

Below is that same URL, ISO-8859-1 encoded (for hosting on a server that uses that encoding) and URL escaped:

  http://www.example.com/%FCmlat.html&q=name

Below is that same URL, UTF-8 encoded (for hosting on a server that uses that encoding) and URL escaped:

  http://www.example.com/%C3%BCmlat.html&q=name

Below is that same URL, entity escaped:

  http://www.example.com/%C3%BCmlat.html&q=name

Sample XML Sitemap

The following example shows a Sitemap in XML format. The Sitemap in the example contains a small number of URLs, each of which is identified using the XML tag. In this example, a different set of optional parameters has been provided for each URL.


<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<url>
<loc>http://www.example.com/
<lastmod>2005-01-01
<changefreq>monthly
<priority>0.8

<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii
<changefreq>weekly

<url>
<loc>http://www.example.com/catalog?item=73&desc=vacation_new_zealand
<lastmod>2004-12-23
<changefreq>weekly

<url>
<loc>http://www.example.com/catalog?item=74&desc=vacation_newfoundland
<lastmod>2004-12-23T18:00:15+00:00
<priority>0.3

<url>
<loc>http://www.example.com/catalog?item=83&desc=vacation_usa
<lastmod>2004-11-23


You can compress your Sitemap files using gzip. Compressing your Sitemap files will reduce your bandwidth requirement. Please note that your uncompressed Sitemap file may not be larger than 10MB.

Using Sitemap index files (to group multiple sitemap files) [Contents]

You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760) when uncompressed. These limits help to ensure that your web server does not get bogged down serving very large files.

If you want to list more than 50,000 URLs, you must create multiple Sitemap files. If you anticipate your Sitemap growing beyond 50,000 URLs or 10MB, you should consider creating multiple Sitemap files. If you do provide multiple Sitemaps, you can list them in a Sitemap index file. Sitemap index files may not list more than 1,000 Sitemaps.

The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file. The Sitemap index file uses the following XML tags:

Note: A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com. As with Sitemaps, your Sitemap index file must be UTF-8 encoded.

Sample XML Sitemap Index

The following example shows a Sitemap index in XML format. The Sitemap index lists two Sitemaps:




http://www.example.com/sitemap1.xml.gz
2004-10-01T18:23:17+00:00


http://www.example.com/sitemap2.xml.gz
2005-01-01


Note: Sitemap URLs, like all values in your XML files, must be entity escaped.

Sitemap Index XML Tag Definitions

  • The tag is required and identifies the location of the Sitemap.

  • The tag is an optional tag that identifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be in W3C Datetime format.

    By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index i.e. a crawler could only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.

  • The tag encapsulates information about an individual Sitemap.

  • The tag encapsulates information about all of the Sitemaps in the file.

Location of Sitemap Files [Contents]

The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.gz can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.

If you have the permission to change http://example.org/path/sitemap.gz, it is safe to assume that you also have permission to provide information for URLs with the prefix http://example.org/path/. Examples of URLs considered valid in http://example.com/catalog/sitemap.gz include:

 http://example.com/catalog/show?item=23
http://example.com/catalog/show?item=233&user=3453

URLs not considered valid in http://example.com/catalog/sitemap.gz include:

 http://example.com/image/show?item=23
http://example.com/image/show?item=233&user=3453
https://example.com/catalog/page1.html

URLs that are not considered valid are dropped from further consideration. It is strongly recommended that you place your Sitemap at the root directory of your web server. For example, if your web server is at example.com, then your Sitemap index file would be at http://example.com/sitemap.gz. In certain cases, you may need to produce different Sitemaps for different paths e.g. if security permissions in your organization compartmentalize write access to different directories.

Validating your Sitemap [Contents]

Google uses an XML schema to define the elements and attributes that can appear in your Sitemap file. You can download this schema from the links below:

For Sitemaps: http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd
For Sitemap index files: http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd

There are a number of tools available to help you validate the structure of your Sitemap based on this schema. You can find a list of XML-related tools at each of the following locations:

http://www.w3.org/XML/Schema#Tools
http://www.xml.com/pub/a/2000/12/13/schematools.html

In order to validate your Sitemap or Sitemap index file against a schema, the XML file will need additional headers. If you're using the Sitemap Generator, these headers are already included. If you are using a different tool for creating your sitemaps, the header in the XML file should look like the examples below.

Sitemap:

 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">


...

Sitemap index file:

 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd">


...

Frequently Asked Questions [Contents]

Q: How do I represent URLs in the Sitemap?

As with all XML files, any data values (including URLs) must use entity escape codes for the following characters: ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>). You should also make sure that all URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard. If you are using a script to generate your URLs, you can generally URL escape them as part of that script. You will still need to entity escape them. For instance, the following python script entity escapes http://www.example.com/view?widget=3&count>2

 $ python
Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
>>> import xml.sax.saxutils
>>> xml.sax.saxutils.escape("http://www.example.com/view?widget=3&count>2")

The resulting URL from the example above is:

 http://www.example.com/view?widget=3&count>2 

Q: Does it matter which character encoding method I use to generate my Sitemap files?

Yes. Your Sitemap files must use UTF-8 encoding.

Q: How do I specify time?

Use W3C Datetime encoding for the lastmod timestamps and all other dates and times in this protocol. For example, 2004-09-22T14:12:14+00:00.

This encoding allows you to omit the time portion of the ISO8601 format; for example, 2004-09-22 is also valid. However, if your site changes frequently, you are encouraged to include the time portion so crawlers have more complete information about your site.

Q: How do I compute lastmod date?

For static files, this is the actual file update date. You can use the UNIX date command to get this date:

 $ date --iso-8601=seconds -u -r /home/foo/www/bar.html
>> 2004-10-26T08:56:39+00:00

For many dynamic URLs, you may be able to easily compute a lastmod date based on when the underlying data was changed or by using some approximation based on periodic updates (if applicable). Using even an approximate date or timestamp can help crawlers avoid crawling URLs that have not changed. This will reduce the bandwidth and CPU requirements for your web servers.

Q: Where do I place my Sitemap?

It is strongly recommended that you place your Sitemap at the root directory of your HTML server; that is, place it at http://example.com/sitemap.xml.gz.

In some situations, you may want to produce different Sitemaps for different paths on your site — e.g. if security permissions in your organization compartmentalize write access to different directories.

We assume that if you have the permission to upload http://example.com/path/sitemap.xml.gz, you also have permission to report metadata under http://example.com/path/.

Q: How big can my Sitemap be?

Sitemaps should be no larger than 10MB (10,485,760 bytes) in length when uncompressed and can contain a maximum of 50,000 URLs. This means that if your site contains more than 50,000 URLs or your Sitemap is bigger than 10MB, you must create multiple Sitemap files and use a Sitemap index file. You should use a Sitemap index file even if you have a small site but plan on growing beyond 50,000 URLs or a file size of 10MB.

Q: My site has tens of millions of URLs; can I somehow submit only those that have changed recently?

You can list the updated URLs in a small number of Sitemaps that change frequently and then use the lastmod tag in your Sitemap index file to identify those Sitemap files. Search engines can then incrementally crawl only the changed Sitemaps.

Q: What happens after I produce my Sitemap?

After you produce your Sitemap, you will need to notify search engines of the Sitemap's location. The search engines that you notify can then retrieve your Sitemap and make the URLs available to their crawlers.

Q: Do URLs in the Sitemap need to be completely specified?

Yes. You need to include the protocol (for instance, http) in your URL. You also need to include a trailing slash in your URL if your web server requires one. For example, http://www.google.com/ is a valid URL for a Sitemap, whereas www.google.com is not.

Q: My site has both "http" and "https" version of URLs. Do I need to list both?

No. Please list only one version of a URL in your Sitemaps. Including multiple versions of URLs may result in incomplete crawling of your site.

Q: URLs on my site have session IDs in them. Do I need to remove them?

Yes. Including session IDs in URLs may result in incomplete and redundant crawling of your site.

Q: Does position of a URL in a Sitemap influence its use?

No. The position of a URL in the Sitemap has no impact on how it is used or regarded by search engines.

Q: Some of the pages on our site use frames. Should we include the frameset URLs or the URLs of the frame contents?

Please include both URLs.

Q: Can I zip my Sitemaps or do they have to be gzipped?

Please use gzip to compress your Sitemaps.

Q: Will the "priority" hint in the XML Sitemap change the ranking of my pages in search results?

No. The "priority" hint in your Sitemap only indicates the importance of a particular URL relative to other URLs on your own site.

Q: Is there an XML schema that I can validate my XML Sitemap against?

An XML schema is available for Sitemap files at http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd, and a schema for Sitemap index files is available at http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. You can read more about validating your Sitemap here.

No comments: