Spiga

Google crawls html forms too

As the post on the Google Webmaster blog says, the Googlebot knows now how to crawl links that are inside HTML forms too. So this means that if you have a web form that has as an action a redirect to another page (or that simply contains links within the form that directs the user to different sections in your website that are in other languages) the googlebot will index that page too. This way they say that Google will be able to index even more information. True if you're considering the following example, a big corporate site that on the main page has a form where you can select different languages - until now Google had no way of indexing the links in that web form, unless they were included in a special sitemap.

However, this new crawling behavior won't be practiced on all websites, for now they're just testing this on some websites that are considered more important. On one website I manage I've seen a couple of days back that Google indexed some redirect pages that weren't mentioned anywhere else and now it makes sense. This website has a purchase web form that lets you select the product, version and edition, and after that when the visitor clicks on the purchase button it is sent through a redirect page with parameters to the landing page. The obvious problem is that Google indexed about 30 pages that had different parameters in the url, but they all contained basically the same text - I solved this by adding the redirect page as a disallow in the robots.txt file.

Couple more things about the fact that Google now crawls forms:
- if a form contains as action an javascript function that includes urls, Google will be able to crawl the urls mentioned in the javascript too (Google has been recognizing links in javascript and flash objects for a while now)
- it would be useful if they would add some sort of parameter to add via Google webmaster tools that would disallow the googlebot to index all the forms on your website. Imagine you have in a web form 40 links that you don't want indexed, but they're all different - you would have to manually add those as disallow in the robots.txt

Somehow I don't really find this useful, because as a webmaster if you want something indexed you can add that url to a sitemap.

0 comments: