Sometimes there is information you just can’t get. This was made clear to me last week. There was a bunch of data that I wanted to use. The data is available but difficult to compile. All the data is on the Internet but on multiple websites in different forms. If search engines can get specific data why not copy what they do?
So I coded up a scraper/crawler fairly quickly and figured I would test it on one of the sites that I manage. Let me say here that this is a templated site that is from a vendor required by the franchise. I picked the few pieces of information I wanted to pick from the product pages and set the crawl. No luck. Didn’t get usable data. Back to the drawing board. Using Chrome to view the page source I looked at the product listings. I very quickly located the problem. There was nothing in the code to let me know what was item, description, price, etc. The product pages were built with very generic table html and even the css classes didn’t allow me to differentiate what was where.
All of a sudden the light bulb went on. This is why everyone has been talking about schema.org for the past year or so. If I can’t figure out how to extract the product from the page how is Google or Bing, that crawl billions of pages, going to figure it out.
So I went to a major classified listing website (a site the company I work for pays thousands of dollars a month to list our inventory) and the same thing happened. There was no way to easily get the product data from the html code. I was amazed. This was a major company and a major internet player that gets tens of millions of visits per month.
Opening my scraper back up I modified the code to get the product data off my website. Then I went to a competitor that uses the same company (same franchise) for their website and ran the scraper. No luck. Looking at the source code of the page it was just a little bit different. But this small difference made a big difference to the crawler.
So in order to get what I figured was easy, basic data to retrieve I had to write a separate scraper/crawler for each website. DANG!!! If these sites were using the schema markup it would be simple to get the product data. Now I understand why Google loves sites with schema or micro data/rich snippets. If these major websites and website providers had used any type of micro data I would have been able to scrape the site. Want to get your products into the major search engines use the new standard – Schema.org
Just as a test to see what would happen I took a few products and placed them on one of the blogs I run. Coded into the blog post was very basic schema product code. I didn’t post the link anywhere nor did I promote it in any way. Today the post was on page 2 of the Google search when searching for almost any product on the page. It’s important to note here that the blog I posed the products on is only a few months old* and generally gets only 4-5 visits per day.
If you’ve read my blog before you probably have a good idea what websites I’m talking about. For those of you who read this post to the end they were: cars.com, autotrader.com and Cobalt GM websites. If any of these companies read this and would like to hire me to consult feel free!
* if you visit the blog and notice that some of the posts are older that is because the blog was moved from a different domain about 2 months ago. Does domain age matter? That’s subject to ongoing debate. See: http://dagmarmarketing.com/does-domain-age-and-registration-length-matter-in-seo/