Searchable Dynamic Content With AJAX Crawling
Google’s robots parse HTML with ease; they can pull apart Word documents, PDFs and even images from the far corners of your website. But as far as they’re concerned, AJAX content is invisible.
The Problem With AJAX
The trust problem is trickier. Every website wants to come out first in search results; your website competes with everyone else’s for the top position. Google can’t just give you an API to return your content because some websites use dirty tricks like cloaking to try to rank higher. Search engines can’t trust that you’ll do the right thing.
Google needs a way to let you serve AJAX content to browsers while serving simple HTML to crawlers. In other words, you need the same content in multiple formats.
Two URLs For The Same Content
Let’s start with a simple example. I’m part of an open-source project called Spiffy UI. It’s a Google Web Toolkit (GWT) framework for REST and rapid development. We wanted to show off our framework, so we made SpiffyUI.org using GWT.
index.html file looks like this:
The URL in the address bar will look like this in most browsers:
We’ve fixed it up with HTML5. I’ll show you how later in this article.
The new URL has replaced the
?_escaped_fragment_=. Using a URL parameter instead of a hash tag is important, because parameters are sent to the server, whereas hash tags are available only to the browser.
That new URL lets us return the same content in HTML format when Google’s crawler requests it. Confused? Let’s look at how it works, step by step.
Snippets Of HTML
Google still thinks of a website as a set of pages, so we needed to serve our content that way. This was pretty easy with our application, because we have a set of pages, and each one is a separate logical section. The first step was to make the pages bookmarkable.
The hash tag is essentially a value that makes sense in the context of your application. Choose a tag that is logical for the area of your application that it represents, and add it to the hash like this:
You can choose anything you want for your hash tag, but try to keep it readable, because users will be looking at it. We give our hashes tags like
Because you can name the hash tag anything you want, adding the extra bang for Google is easy. Just slide it between the hash and the tag, like this:
Serving Up Snippets
The hash tag makes an application bookmarkable, and the bang makes it crawlable. Now Google can ask for special escaped-fragment URLs like so:
You can implement your server in PHP, Ruby or any other language, as long as it delivers HTML. SpiffyUI.org is a Java application, so we deliver our content with a Java servlet.
The escaped fragment tells us what to serve, and the servlet gives us a place to serve it from. Now we need the actual content.
css, we just have to serve
The template without any styling looks very plain, but Google just needs the content. Users see our page with all of the styles and dynamic features:
Google gets only the unstyled version:
You can see all of the source code for our
SiteMapServlet.java servlet. This servlet is mostly just a look-up table that takes an ID and serves the associated content from somewhere on our server. It’s called
SiteMapServlet.java because this class also handles the generation of our site map.
Tying It All Together With A Site Map
Our site map tells the crawler what’s available in our application. Every website should have a site map; AJAX crawling doesn’t work without one.
Site maps are simple XML documents that list the URLs in an application. They can also include data about the priority and update frequency of the app’s pages. Normal entries for site maps look like this:
<url> <loc>http://www.spiffyui.org/</loc> <lastmod>2011-07-26</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url>
Our AJAX-crawlable entries look like this:
<url> <loc>http://www.spiffyui.org/#!css</loc> <lastmod>2011-07-26</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url>
The hash bang tells Google that this is an escaped fragment, and the rest works like any other page. You can mix and match AJAX URLs and regular URLs, and you can use only one site map for everything.
You could write your site map by hand, but there are tools that will save you a lot of time. The key is to format the site map well and submit it to Google Webmaster Tools.
Google Webmaster Tools
Google Webmaster Tools gives you the chance to tell Google about your website. Log in with your Google ID, or create a new account, and then verify your website.
Once you’ve verified, you can submit your site map and then Google will start indexing your URLs.
And then you wait. This part is maddening. It took about two weeks for SpiffyUI.org to show up properly in Google Search. I posted to the help forums half a dozen times, thinking it was broken.
There’s no easy way to make sure everything is working, but there are a few tools to help you see what’s going on. The best one is Fetch as Googlebot, which shows you exactly what Google sees when it crawls your website. You can access it in your dashboard in Google Webmaster Tools under “Diagnostics.”
Enter a hash bang URL from your website, and click “Fetch.” Google will tell you whether the fetch has succeeded and, if it has, will show you the content it sees.
If Fetch as Googlebot works as expected, then you’re returning the escaped URLs correctly. But you should check a few more things:
- Validate your site map.
- Manually try the URLs in your site map. Make sure to try the hash-bang and escaped versions.
- Check the Google result for your website by searching for
Making Pretty URLs With HTML5
Twitter leaves the hash bang visible in its URLs, like this:
This works well for AJAX crawling, but again, it’s slightly ugly. You can make your URLs prettier by integrating HTML5 history.
Spiffy UI uses HTML5 history integration to turn a hash-bang URL like this…
… into a pretty URL like this:
HTML5 history makes it possible to change this URL parameter, because the hash tag is the only part of the URL that you can change in HTML4. If you change anything else, the entire page reloads. HTML5 history changes the entire URL without refreshing the page, and we can make the URL look any way we want.
Earlier, I mentioned cloaking. It is the practice of trying to boost a website’s ranking in search results by showing one set of pages to Google and another to regular browsers. Google doesn’t like cloaking and may remove offending websites from its search index.
Regardless of how it’s detected, cloaking is a bad idea. You might not get caught, but if you do, you’ll be removed from the search index.
Hash Bang Is A Little Ugly, But It Works
I’m an engineer, and my first response to this scheme is “Yuck!” It just feels wrong; we’re warping the purpose of URLs and relying on magic strings. But I understand where Google is coming from; the problem is extremely difficult. Search engines need to get useful information from inherently untrustworthy sources: us.
Thanks to Kristen Riley for help with some of the images in this article.