Machine Learning: How does it impact SEO?

machine learning seo

So, in the previous post we discussed what Machine Learning is. In this post we’ll go over how machine learning is impacting the way search engines (more precisely Google) work. How are they using machine learning (e.g. RankBrain) to deliver the best search results to their audience.

Without saying this is also the way Google is treating this, I want to split the impact into two subdomains:

  • How Google processes your search query and tries to understand intent.
  • How Google designs SERP’s that are relevant to the search query.

There’s a lot more to talk about than just these two subjects, but it’s the main deal. I’ll be explaining both by giving you a brief overview on how Google performed both actions before and after RankBrain. Let’s-a-go!

Processing Search Queries

With DMOZ closing recently, we’ve got a throwback to the early internet browsing behavior before search engines were a thing. The more pages were being made and thrown on the internet, the harder it became to just find what you were looking for. So people tried to solve this problem. People tried to ‘organise the internet’.

Categorising pages

Just like people were used to organise everything in these days, they started to gather the most important websites and put them into folders. You have a website about your local soccer team? We’ll put that here:

ALL -> Sports -> Soccer -> Teams -> Europe -> Belgium -> KMSK Deinze

This way DMOZ, at its closing time, categorized a stunning 3.861.210 websites into 1.031.722 categories over 90 languages. To do this, they had a team of 91.929 editors.

DMOZ Sports

This became an increasingly hard task to do, considering the enormous volume of websites going live on the internet each hour of the day. We needed a new, easier way to find the website page you were looking for.

Search engines based on query/document matching

google processing search queries

Go ahead, type in anything you want.

Why not let people type in the thing they’re looking for and return all the pages that contain the exact search term?

That’s where search engines started. Matching exact search queries to documents. If I had a document online that has the title ‘Coffee Machine’ and I used the phrase ‘coffee machine’ a lot in the document, it would be a very relevant result for the search query ‘coffee machine’.

There are a lot of different ways to determine the relevance of a document considering a search term. Consider just the following possibilities:

  • Keyword Usage: Is the document using this query? How many times does it use it (in absolute / relative terms)?
  • Term Frequency x Inverse Document Frequency (TF*IDF): This method takes into account the commonality of a word used in the query. If we’re looking for ‘great guitars’, the word ‘great’ will be more common, so the word ‘guitars’ will be more important to determine the relevance.
  • Co-occurence: Assuming you have a lot of data, you could check which words frequently co-occur with the search query. For example: If a document is about ‘guitar lessons’, it will probably mention ‘chords’, ‘frets’, ‘notes’ and other relevant words. A document containing these co-occuring words (measured across documents) will be considered more relevant.
  • Topic Modeling (e.g. LDA): This is were it gets though. Notice that co-occurence doesn’t imply the words are relevant. Topic modeling is a bunch of ways to determine which words are related to each other. For example the word ‘up’ and ‘down’ are related to each other. They are both related to ‘elevators’ but they are also related in a total different way to ‘manic depression’. Topic modeling uses vectors to determine how words are related. There is an awesome post from 2010 on the Moz blog about LDA and how it’s correlated to rankings. It also visually explains the previous topics.

This works great but has two downsides:

  • Exact search query usage: Matching documents to search queries doesn’t take search intent into account. This means that two different search queries, having the same intent, will have two different results. Also: misspellings are a big issue.
  • Manual topic modeling: The topic modeling used is mostly based on human, non-automated work. This means an enormous amount of work and editors needed. (DMOZ, anyone? 😉 )

Search engines using machine learning

What is needed is a machine learning system that learns how words, topics and concepts relate to each other. We need Artificial Intelligence to make search engines understand the questions we are asking so they can give us the correct answer.

I’ve found this great talk from Hang Li (Huawei Technologies), who presented his view on how to use machine learning for advanced query / document matching. The main problem being: how to adapt to natural language (synonyms, misspellings, similar queries <-> same intent,…)?

If you don’t want to watch the full video, the main aspects are here:

Hang speaks about matching the keywords and concepts on different levels:

  • Term: Comparable to the query/document matching. If a document uses the term ‘NY’ a lot, it’s probably relevant for the search term ‘NY’.
  • Phrase: Just like before but on the level of phrases. Term-level matching ‘hot’ and ‘dog’ will not necessarily give you the documents that are relevant to the phrase ‘hot dog’.
  • Word Sense: This is where it starts to get interesting. On this level of matching, we need to be connecting similar word senses. The system should know that ‘NY’ is actually ‘New York’, and that someone searching for ‘utube’ probably is looking for ‘YouTube’.
  • Topic: Even further we should be able to match the topics of the queries being used. If we can link ‘microsoft office’ to ‘powerpoint’, ‘excel’, … and other relevant terms, this gives us an extra layer to determine relevancy of a document.
  • Structure: On this level, we should be able to get the intent of the search, no matter how it is formulated. So the structure of the language should be understood. The system should ask ‘What is/are the most defining part/s of this search?’

So the way this works from a ‘Query Understanding’-standpoint:

search query understanding machine learning

  1. The searcher enters the query ‘michael jordan berkele‘, which contains a typo.
  2. On a term level, the spelling error is corrected. So ‘berkele’ is interpreted as ‘berkeley’.
  3. On a phrase level ‘michael jordan’ is identified as being a phrase.
  4. On the sense level there are similar queries like ‘michael l. jordan’ or just ‘michael jordan’.
  5. Importantly, on a topic level, the system recognizes the topic as being ‘machine learning’. If ‘Berkeley’ wasn’t in the query, there would have been confusion on the topic as ‘Michael Jordan’ is obviously also a very famous former basketball player.
  6. On a structure level it becomes clear that Michael Jordan is the main phrase of importance. It’s not Berkeley.

Looking at it from the other side, we have a similar process:

So when both the query and document can be understood on these levels, the system can start matching the search query intent to the most relevant documents. Hang goes further into this process, but this first part explains a lot about the task that’s been given to machine learning.

This process of including machine learning into understanding language and search intent has come a long way. Google uses TensorFlow to have machines learning language. Through a massive input of language data, it can make it’s own knowledge by understanding vectorial correlations between words or phrases. There’s little doubt that this technology is part of RankBrain.

So from a query-processing standpoint, machine learning is helping query/document matching by developing its own understanding of language.

Ranking search results

As said earlier, search engines have two main objectives: First, understanding the search intent to match the right pages. Then, rank all the matched pages so the most useful will be highest in the list.

When we finally decided which pages are probably relevant to the searcher’s intent, we’ll have to make a guess on what page will be the best to rank first. And there are a lot of factors being used to do that. But as you might have learned from the previous blog in this series, all these possibilities become too hard to handle right for every search. And that’s where machine learning and stuff like RankBrain come into play.

So let’s see how we could rank pages.

Pages ranked based on query / document matching

Plain and simple. We let the matching-algorithm run and define scores based on on-page relevance of the document. The document with the highest score, gets ranked first.

Although simple, this is not the best way to do this as it is an easy-to-trick system. Once you know how the query / document matching is done, you’ll be able to design a document that is very relevant according to the algorithm, but not for the user.

Pages ranked based on a set of manually weighted factors

Second thing to do is add in extra factors which can define if a page will be relevant or not. Then manually setting the weight these different factors should have to rank the search results. There are a lot of factors:

  • Page level: query / document matching score, links to the page, linking C-blocks to the page, …
  • Domain level: overall topical relevance, links to the domain, quality of content, …
  • Search level: branded search on this topic, …
  • User level: has visited this website before, visits video content regularly, …
  • Device level: what device is used, how’s the internet connection, …

Problem is, different searches will need different weighting in factors. And that’s more than any man can do…

Pages ranked based on machine learning

Not only does Google have the necessary information on query / document matching, incoming links to the domain and the page, overall relevance and power of the domain… It also gathers information on how well the search results are working. It measures click-through rate, bounce rate, etc…

For example, if you perform a search and get a search results page, there are a couple of things that can happen. Suppose you don’t click the first result. Why in hell, would you not click the first result? The possible list of answers is endless.

  • You’ve already visited this domain in the past and didn’t like it.
  • The search result is not relevant to your particular situation.
  • You think this website is for older people.
  • You don’t like the way the meta description is written.

Everything from user profile (demographics, interests, …) to on- or off-page factors (domain, meta title, …) can be in play. It is too much for a manually updated algorithm to get al these factors right. But given you have enough data (// enough searches), a self-learning algo could do the job.

It can work its way back from the results (‘What is the page that people clicked and probably had a good user experience?’) to define how the different algorithm factors should be weighted.

 

Google Fred Update: What just happened?

google fred update what happened

Although Belgian weather has been on the rise the last couple of days, Moz reported some rainy and stormy weather on Google’s rankings.

More and more people got the same feeling and word’s out that there has been a ‘Fred’ update. Google did not confirm this update, and they probably never will.

Let’s get into what actually happened the last couple of days and probably is still happening. Let me know what you’ve noticed!

Fred Update Stormy Weather maavdnbo

What are the signs?

The signs are pretty straightforward:

  • Fluctuations in rankings.
  • Big drop in external links since last Moz Index and corresponding Domain Authority drops.
  • People have reported big organic traffic drops or rises. (I have seen nothing significant myself.)
  • Other reports say some PBN’s got de-indexed.

I don’t have sufficient data on particularly hit sites to come to a final conclusion at this. Many signs are pointing at a external link-related problem. Knowing this, you can expect a lot of discussion in the black hat world. And although this guy (‘seoguy81’) is being laughed at on the Black Hat Forum, he might be closer to the cause than any of the other commenters:

fred update black hat forum

Ofcourse the changes in TF/CF and PA/DA are not the root of the problem. They’re only metrics. But it is remarkable that these are also undergoing fluctuations, meaning that these metrics have a changing factor in common that is also influencing rankings right now.

Seeing these fluctuations in metrics across different SEO-tools confirms to me that this is a link-based situation. But it might not be what we expect on first sight.

What might be going on?

Seems to me that there are at least 3 ways this might be happening:

  • Google has de-indexed a lot of websites.
  • Google has improved and tightened their (automated?) decision-making on de-indexing.
  • Google has new spam flags based on link quality (including over-optimizing anchor texts).

Why do I think so? There are some clues that are very hard to ignore.

Fluctuations in both Google Rankings and Moz Metrics.

Given this example:

Fred update Visibility Drop

We can see that every domain in this example has suffered a loss in indexed links by Moz:

Fred Update External Links Lost

And it’s not just this one vertical:

Resulting in some big changes in Domain Authority:

fred update domain authority

The Domain Authority and changed MozTrust have a big correlation with the changes in Search Visibility. The amount of lost external links after the change combined with the MozTrust built up by the remaining links are a very good indicator of what has happened to the domain’s search visibility.

This, for me, is a big clue. My best guess so far, is that Google has (automatically) de-indexed a lot of domains. This has cut out part of the benefits from spammy links. Domains now thrive on the remainder of their links, so basically:

  • Spammy links not de-indexed (yet).
  • Quality and volume of their remaining link profile.

The only thing really missing to prove this, are some de-indexed domains. 😀
There were some reports of some PBN’s having been de-indexed, but since I don’t use them, I don’t have any examples. 😉
If you happen to have examples, let me know! (I won’t tell anyone…)

Ranking loss mainly for secondary keywords.

Second thing that catches the eye, is that ranking drops are mainly for secondary keywords, most of the time not even targeted by the on-page content. These rankings seemed to have been gained by optimizing the external links’ anchor texts.

Question remaining is the reason why they dropped:

  • Were these secondary anchor texts used in the now de-indexed links?
  • Has Google updated it’s algorithm to be harder on weak anchor text / on page content correlations?

I personally think the last one would be strange, considering RankBrain and its ability to understand language. Would be strange then to penalize anchor text based on TF-IDF-like relevance scores.

 

Should URL Structure Follow Site Structure?

It’s a frequently heard advice when talking about URL structures:

Your URL structure should follow your site’s navigation as closely as possible. This way Google understands your site structure better.

For example Kissmetrics states it on its blog:
https://blog.kissmetrics.com/site-structure-enhance-seo/

So does Moz.com when talking about URL Structure best practices:
Moz.com url structure site navigationBut is this actually true? Let’s dive into this.

URL Structure vs. Site Structure

So, where does this discussion come from?
It’s something that every slightly organized mind struggles with when deciding on url structures. Should it match the site’s structure? Should it match the site’s navigation?

URL’s basically look like this:

https://www.example.com/example/example/some-keywords
or, technically speaking
https://subdomain.domain.tld/folder/subfolder/page-slug

This is used because the way websites are built (most of the time) is that they are actually a giant folder with a lot of pages, structured within subfolders.

Seems normal so far. But how do we handle pages that could be in more than one (sub)folder?

I know, right?

Can we still use the site structure as a guidance to url structure?

URL Structures: Best Practices

What should your URL structure look like when you follow the best practices?

  • Readable: Your URL should be readable by both humans and search engines. Don’t see anything change in this part.
  • Keywords: Your URL should contain your focus keyword, but you can’t stuff it in there a million times. So what’s the point in having a url like /shoes/red-shoes/red-shoes-with-white-unicorn-prints?
  • Short: Your URL should be as short as possible. Hmmm…

Ranking Correlations

Bleh, best practices. Let’s check Moz’s ranking correlation study from 2015 for clues:
URL Structure Ranking FactorsFolder depth of URL (measured by number of trailing slashes) has a -0.03 Spearman correlation with higher rankings. URL length in characters even has a -0.11 Spearman correlation.

I couldn’t find any other relevant ranking factors. This is a debate that’s very hard to measure. Do folders / subfolders pass Page Authority? We don’t know. I’m not sure how we could measure correctly to isolate the benefits of having your site structure in your URL structure.

Keep in mind that these are correlations in the end. This proves no causal effect whatsoever. You could expect that many non-SEO-optimized websites that have /these-urls-that-are-way-too-long/and-have/many/subfolders are responsible for this…

Search Result

So you’ll just do it for the search result, right?

Breadcrumb trail:
You want your search result to have this shiny breadcrumb trial that ‘shows Google understands your website structure’:
URL Structure Breadcrumbs

Well actually this has more to do with your breadcrumbs being formatted right, then it has to do with URL structure:
Breadcrumb versus url

Sitelinks:
Yes, ok, but you still need it so Google can understand your site’s structure enough to provide sitelinks, don’t you?
sitelinks url structure

Nope:
Sitelinks URL structure

This is koelkaststore.be’s url structure in a Tree:
url structure tree

How does Google read structure?

Well, I’m wondering too. But from what I see in this example, I’m guessing that internal links used in navigation and throughout the website are way more important than URL structure. Look at their homepage:

I’ve marked the sitelinks generated by Google

My best guess is a combination of these factors:

  1. Total Number of Internal Links
  2. Breadcrumbs
  3. Other Vertical Linking
  4. Horizontal Linking
  5. Page Depth Measured By Minimum Clicks From Root Domain

The big question is if URL structure should be added to this list… Could be that ‘Page Depth Measured By Trailing Slashes’ or ‘Folder Depth‘ is also a way to read structure, but I’m starting to doubt it.

It’s not their XML Sitemap, although there’s some other interesting stuff going on there. 😀

BTW: I assume that when generating sitelinks for the search result, these metrics are also in play:

  • Clickthrough Rate from Search Result
  • Search Volume
  • User Behavior Signals
  • Number of External Links / Referring Domains

Just use the damn structure!

So you’re still willing to use your site structure for SEO reasons?
Then here’s my final argument: You’re wasting your crawl budget.

If, because of your site structure, a page has multiple URL’s, you’ll have to use canonical tags to decide which one is the best page. This is where many e-commerce sites struggle: What category do we use for a product’s URL? I’ve seen 3 solutions so far:

  • Root: Just have the product slug right behind the root. example.com/donkey-milk
  • Main Category: Give every product a main category which is the folder for the product. example.com/drinks/donkey-milk
  • Bad Solutions.

TL;DR:

I’m not saying you should NEVER follow your site structure when building your URL structure. If this makes perfect sense to users and search engines and you’re outranking your competitors, by all means, do it. All I’m saying is that there isn’t that much evidence that you should be doing it. There’s even evidence you shouldn’t be…