Machine Learning: How does it impact SEO?

machine learning seo

So, in the previous post we discussed what Machine Learning is. In this post we’ll go over how machine learning is impacting the way search engines (more precisely Google) work. How are they using machine learning (e.g. RankBrain) to deliver the best search results to their audience.

Without saying this is also the way Google is treating this, I want to split the impact into two subdomains:

  • How Google processes your search query and tries to understand intent.
  • How Google designs SERP’s that are relevant to the search query.

There’s a lot more to talk about than just these two subjects, but it’s the main deal. I’ll be explaining both by giving you a brief overview on how Google performed both actions before and after RankBrain. Let’s-a-go!

Processing Search Queries

With DMOZ closing recently, we’ve got a throwback to the early internet browsing behavior before search engines were a thing. The more pages were being made and thrown on the internet, the harder it became to just find what you were looking for. So people tried to solve this problem. People tried to ‘organise the internet’.

Categorising pages

Just like people were used to organise everything in these days, they started to gather the most important websites and put them into folders. You have a website about your local soccer team? We’ll put that here:

ALL -> Sports -> Soccer -> Teams -> Europe -> Belgium -> KMSK Deinze

This way DMOZ, at its closing time, categorized a stunning 3.861.210 websites into 1.031.722 categories over 90 languages. To do this, they had a team of 91.929 editors.

DMOZ Sports

This became an increasingly hard task to do, considering the enormous volume of websites going live on the internet each hour of the day. We needed a new, easier way to find the website page you were looking for.

Search engines based on query/document matching

google processing search queries

Go ahead, type in anything you want.

Why not let people type in the thing they’re looking for and return all the pages that contain the exact search term?

That’s where search engines started. Matching exact search queries to documents. If I had a document online that has the title ‘Coffee Machine’ and I used the phrase ‘coffee machine’ a lot in the document, it would be a very relevant result for the search query ‘coffee machine’.

There are a lot of different ways to determine the relevance of a document considering a search term. Consider just the following possibilities:

  • Keyword Usage: Is the document using this query? How many times does it use it (in absolute / relative terms)?
  • Term Frequency x Inverse Document Frequency (TF*IDF): This method takes into account the commonality of a word used in the query. If we’re looking for ‘great guitars’, the word ‘great’ will be more common, so the word ‘guitars’ will be more important to determine the relevance.
  • Co-occurence: Assuming you have a lot of data, you could check which words frequently co-occur with the search query. For example: If a document is about ‘guitar lessons’, it will probably mention ‘chords’, ‘frets’, ‘notes’ and other relevant words. A document containing these co-occuring words (measured across documents) will be considered more relevant.
  • Topic Modeling (e.g. LDA): This is were it gets though. Notice that co-occurence doesn’t imply the words are relevant. Topic modeling is a bunch of ways to determine which words are related to each other. For example the word ‘up’ and ‘down’ are related to each other. They are both related to ‘elevators’ but they are also related in a total different way to ‘manic depression’. Topic modeling uses vectors to determine how words are related. There is an awesome post from 2010 on the Moz blog about LDA and how it’s correlated to rankings. It also visually explains the previous topics.

This works great but has two downsides:

  • Exact search query usage: Matching documents to search queries doesn’t take search intent into account. This means that two different search queries, having the same intent, will have two different results. Also: misspellings are a big issue.
  • Manual topic modeling: The topic modeling used is mostly based on human, non-automated work. This means an enormous amount of work and editors needed. (DMOZ, anyone? 😉 )

Search engines using machine learning

What is needed is a machine learning system that learns how words, topics and concepts relate to each other. We need Artificial Intelligence to make search engines understand the questions we are asking so they can give us the correct answer.

I’ve found this great talk from Hang Li (Huawei Technologies), who presented his view on how to use machine learning for advanced query / document matching. The main problem being: how to adapt to natural language (synonyms, misspellings, similar queries <-> same intent,…)?

If you don’t want to watch the full video, the main aspects are here:

Hang speaks about matching the keywords and concepts on different levels:

  • Term: Comparable to the query/document matching. If a document uses the term ‘NY’ a lot, it’s probably relevant for the search term ‘NY’.
  • Phrase: Just like before but on the level of phrases. Term-level matching ‘hot’ and ‘dog’ will not necessarily give you the documents that are relevant to the phrase ‘hot dog’.
  • Word Sense: This is where it starts to get interesting. On this level of matching, we need to be connecting similar word senses. The system should know that ‘NY’ is actually ‘New York’, and that someone searching for ‘utube’ probably is looking for ‘YouTube’.
  • Topic: Even further we should be able to match the topics of the queries being used. If we can link ‘microsoft office’ to ‘powerpoint’, ‘excel’, … and other relevant terms, this gives us an extra layer to determine relevancy of a document.
  • Structure: On this level, we should be able to get the intent of the search, no matter how it is formulated. So the structure of the language should be understood. The system should ask ‘What is/are the most defining part/s of this search?’

So the way this works from a ‘Query Understanding’-standpoint:

search query understanding machine learning

  1. The searcher enters the query ‘michael jordan berkele‘, which contains a typo.
  2. On a term level, the spelling error is corrected. So ‘berkele’ is interpreted as ‘berkeley’.
  3. On a phrase level ‘michael jordan’ is identified as being a phrase.
  4. On the sense level there are similar queries like ‘michael l. jordan’ or just ‘michael jordan’.
  5. Importantly, on a topic level, the system recognizes the topic as being ‘machine learning’. If ‘Berkeley’ wasn’t in the query, there would have been confusion on the topic as ‘Michael Jordan’ is obviously also a very famous former basketball player.
  6. On a structure level it becomes clear that Michael Jordan is the main phrase of importance. It’s not Berkeley.

Looking at it from the other side, we have a similar process:

So when both the query and document can be understood on these levels, the system can start matching the search query intent to the most relevant documents. Hang goes further into this process, but this first part explains a lot about the task that’s been given to machine learning.

This process of including machine learning into understanding language and search intent has come a long way. Google uses TensorFlow to have machines learning language. Through a massive input of language data, it can make it’s own knowledge by understanding vectorial correlations between words or phrases. There’s little doubt that this technology is part of RankBrain.

So from a query-processing standpoint, machine learning is helping query/document matching by developing its own understanding of language.

Ranking search results

As said earlier, search engines have two main objectives: First, understanding the search intent to match the right pages. Then, rank all the matched pages so the most useful will be highest in the list.

When we finally decided which pages are probably relevant to the searcher’s intent, we’ll have to make a guess on what page will be the best to rank first. And there are a lot of factors being used to do that. But as you might have learned from the previous blog in this series, all these possibilities become too hard to handle right for every search. And that’s where machine learning and stuff like RankBrain come into play.

So let’s see how we could rank pages.

Pages ranked based on query / document matching

Plain and simple. We let the matching-algorithm run and define scores based on on-page relevance of the document. The document with the highest score, gets ranked first.

Although simple, this is not the best way to do this as it is an easy-to-trick system. Once you know how the query / document matching is done, you’ll be able to design a document that is very relevant according to the algorithm, but not for the user.

Pages ranked based on a set of manually weighted factors

Second thing to do is add in extra factors which can define if a page will be relevant or not. Then manually setting the weight these different factors should have to rank the search results. There are a lot of factors:

  • Page level: query / document matching score, links to the page, linking C-blocks to the page, …
  • Domain level: overall topical relevance, links to the domain, quality of content, …
  • Search level: branded search on this topic, …
  • User level: has visited this website before, visits video content regularly, …
  • Device level: what device is used, how’s the internet connection, …

Problem is, different searches will need different weighting in factors. And that’s more than any man can do…

Pages ranked based on machine learning

Not only does Google have the necessary information on query / document matching, incoming links to the domain and the page, overall relevance and power of the domain… It also gathers information on how well the search results are working. It measures click-through rate, bounce rate, etc…

For example, if you perform a search and get a search results page, there are a couple of things that can happen. Suppose you don’t click the first result. Why in hell, would you not click the first result? The possible list of answers is endless.

  • You’ve already visited this domain in the past and didn’t like it.
  • The search result is not relevant to your particular situation.
  • You think this website is for older people.
  • You don’t like the way the meta description is written.

Everything from user profile (demographics, interests, …) to on- or off-page factors (domain, meta title, …) can be in play. It is too much for a manually updated algorithm to get al these factors right. But given you have enough data (// enough searches), a self-learning algo could do the job.

It can work its way back from the results (‘What is the page that people clicked and probably had a good user experience?’) to define how the different algorithm factors should be weighted.

 

Machine Learning & Digital Marketing: What is Machine Learning?

RankBrain, Programmatic Buying, Artificial Intelligence, Real Time Bidding, Algorithm Updates… Digital marketing these days is all about big words and the math behind them. How is machine learning actually impacting digital marketing?

That’s what I’m exploring in this series on ‘Machine Learning & Digital Marketing’. Although I’m not a machine learning expert, I’m trying to give you an insight on how the practice itself is changing the way we do (digital) marketing today and how we will do it in the future. In the next episodes, we’ll be covering SEO, SEA, Media Buying and Analytics. But first, in this intro, let’s take a look at machine learning.

What is machine learning?

First things first! You’ve probably already heard about these 3 terms:

  • Artificial Intelligence
  • Machine Learning
  • Deep Learning

It’s good to know that there’s a difference between those 3 terms. In fact, Nvidia wrote a great blog about this subject. In short:

Artificial Intelligence is human intelligence exhibited by machines. Machine Learning is an approach to achieve artificial intelligence. Deep Learning is a technique for implementing machine learning.

For example with what is called “narrow AI” we can ask a machine to do a very specific task, like ‘beating a human at chess’ or ‘given a certain word, returning the most relevant page of a website’. Notice how the AI doesn’t need to understand Alexis de Tocqueville’s view on democracy. It doesn’t need to mimic the human brain, just do what is needed to perform te task at hand.

Artificial Intelligence: The art of beating a human at chess

There are lots of ways to make a computer beat a human at chess:

Source: Maarten van den Heuvel @ unsplash.com

  • Ask expert chess players for their strategy and implement it as a combination of ‘If this then that’-rules.
  • Gather data on every chess game between two humans. For every situation, plan out the possible actions and the probability of winning the game for each action. Let the system always choose the action that gives the highest probability of winning the game.

You might have noticed that there’s a problem with these two solutions. If the data the AI is based on, is static, the AI becomes very predictable. Even though it might beat humans a few times, once the human gets the decision-making performed by the AI, it should never win a game again considering the human is then able to a develop a counter-tactic against the ‘highest probability’-choices. The AI will not be changing its strategy. So we need a new way.

Machine Learning: The art of beating a human at chess again and again

The new way would be something like this:

  • Let the computer play millions of games and gather data on winning probabilities for every action in every possible situations. Make it constantly learn to adjust its own choices, including as many hand-written parameters you can imagine.

This last part, machine learning, ensures that the AI will be able to keep beating humans in chess in the future. Keep in mind that it will need to make mistakes and lose games to be able to learn how to win them. It will not go on a winning streak of 100% starting at its first win. (There is a very good life lesson in this paragraph. 😉 )

To get back to our example, the chess computer might probably learn that the hand-written parameter of ‘randomness’ is important. If it doesn’t want to be perfectly predictable, the AI might want to sometimes pick ‘the second highest probability move’ to challenge the human’s processing capacity. But excellence will be in the balance. It should not lower its chances of succes by too much.

Another example:

Artificial Intelligence: Simulating a game of football

The one thing that started my interest in AI is gaming. Most of all sim(ulation) gaming. For example (and I’m sorry non-football lovers) the game Football Manager.

Football Manager AI

The amount of hours I played this game…

It essentially mimics the game of football being played in the real world, with excellent precision. The game seems simple:

  • You have a club with a group of players, each of them having their own set of abilities. For example: Scott Davidson, a central defender at the Scottish League Two club Stirling Albion:
    Scott Davidson Stirling
  • When playing a match, Scott is put in a line-up, combined with some high level strategy decisions that will guide his decision making:
  • In-game these players are constantly making decisions. For example, Henderson in this case gets the ball and should decide what to do:

    Artificial Intelligence Football Manager

    Run down the flank? Pass the ball? To whom?

And this is when it becomes interesting: Henderson has to make a decision, which is based on his parameters (Vision, Anticipation, Decisions…), tactical guidelines (‘Pass Shorter’, ‘Take No Risk’) and many more factors. Once he has made his decision, the execution of his action is also based on factors like his parameters (Passing, Technique, Dribbling, …), the pitch quality, his fitness level…

Machine Learning: Keeping the game interesting to play

This would be (and for most people: is 😀 ) a very boring game if there was one tactic that would win every game. The thing with the game is that, given certain limitations, the ‘other coaches’ are adapting their tactic to what you are doing.

This ensures that you’ll have to keep changing your tactics to keep winning games. Makes it a very frustrating game at times, but in essence, makes this game endlessly playable (and some of us do…).

So, now we know what Artificial Intelligence and Machine Learning are, what’s this deep learning thing?

Deep Learning: Mimicking neural networks

Then for the absolute abstract part of this. What deep learning is actually doing is very close to how the think our brains work: through neural networks:

Artificial Neural Network

A very simple Artificial Neural Network – Source: Wikipedia.com

There is a certain amount of input that is being divided over different nodes. This input is getting transformed in different hidden layers of nodes. The amazing thing is that the nodes are connected give their ‘transformed input’ and a weighting of their own input (considering the output) to the next layer.

Given the rising processing capacities and math innovations that science has created in the latest years, we are capable of doing ‘sort of what the brain does’ on a smaller scale.

Dr. Pete Meyers actually explained this brilliantly simple on MozCon 2016:

The way a neural network works is: We have these [layers of] inputs and we have these [layers of] outputs we want to achieve. […] So we’re trying to put something in between that can model that [input to output]. […] We put in this data to train it, but then the machine itself can handle new inputs that’s never seen before.

So actually, by letting the machine learn backwards from the output to the input, we create Artificial Intelligence that processes new input into the desired output. This allows us (bearing in mind the quality of the training data, processor capacity…) to build better data processing tools then our mind is consciously facilitating us to do by hand. That’s crazy.

And this is and will be impacting the world in general and digital marketing in specific. In the next episodes we will be discussing the impact this has on SEO, SEA, Media Buying and Analytics. If you have any other ideas on this, be sure to let me know!

Google Fred Update: What just happened?

google fred update what happened

Although Belgian weather has been on the rise the last couple of days, Moz reported some rainy and stormy weather on Google’s rankings.

More and more people got the same feeling and word’s out that there has been a ‘Fred’ update. Google did not confirm this update, and they probably never will.

Let’s get into what actually happened the last couple of days and probably is still happening. Let me know what you’ve noticed!

Fred Update Stormy Weather maavdnbo

What are the signs?

The signs are pretty straightforward:

  • Fluctuations in rankings.
  • Big drop in external links since last Moz Index and corresponding Domain Authority drops.
  • People have reported big organic traffic drops or rises. (I have seen nothing significant myself.)
  • Other reports say some PBN’s got de-indexed.

I don’t have sufficient data on particularly hit sites to come to a final conclusion at this. Many signs are pointing at a external link-related problem. Knowing this, you can expect a lot of discussion in the black hat world. And although this guy (‘seoguy81’) is being laughed at on the Black Hat Forum, he might be closer to the cause than any of the other commenters:

fred update black hat forum

Ofcourse the changes in TF/CF and PA/DA are not the root of the problem. They’re only metrics. But it is remarkable that these are also undergoing fluctuations, meaning that these metrics have a changing factor in common that is also influencing rankings right now.

Seeing these fluctuations in metrics across different SEO-tools confirms to me that this is a link-based situation. But it might not be what we expect on first sight.

What might be going on?

Seems to me that there are at least 3 ways this might be happening:

  • Google has de-indexed a lot of websites.
  • Google has improved and tightened their (automated?) decision-making on de-indexing.
  • Google has new spam flags based on link quality (including over-optimizing anchor texts).

Why do I think so? There are some clues that are very hard to ignore.

Fluctuations in both Google Rankings and Moz Metrics.

Given this example:

Fred update Visibility Drop

We can see that every domain in this example has suffered a loss in indexed links by Moz:

Fred Update External Links Lost

And it’s not just this one vertical:

Resulting in some big changes in Domain Authority:

fred update domain authority

The Domain Authority and changed MozTrust have a big correlation with the changes in Search Visibility. The amount of lost external links after the change combined with the MozTrust built up by the remaining links are a very good indicator of what has happened to the domain’s search visibility.

This, for me, is a big clue. My best guess so far, is that Google has (automatically) de-indexed a lot of domains. This has cut out part of the benefits from spammy links. Domains now thrive on the remainder of their links, so basically:

  • Spammy links not de-indexed (yet).
  • Quality and volume of their remaining link profile.

The only thing really missing to prove this, are some de-indexed domains. 😀
There were some reports of some PBN’s having been de-indexed, but since I don’t use them, I don’t have any examples. 😉
If you happen to have examples, let me know! (I won’t tell anyone…)

Ranking loss mainly for secondary keywords.

Second thing that catches the eye, is that ranking drops are mainly for secondary keywords, most of the time not even targeted by the on-page content. These rankings seemed to have been gained by optimizing the external links’ anchor texts.

Question remaining is the reason why they dropped:

  • Were these secondary anchor texts used in the now de-indexed links?
  • Has Google updated it’s algorithm to be harder on weak anchor text / on page content correlations?

I personally think the last one would be strange, considering RankBrain and its ability to understand language. Would be strange then to penalize anchor text based on TF-IDF-like relevance scores.

 

Should URL Structure Follow Site Structure?

It’s a frequently heard advice when talking about URL structures:

Your URL structure should follow your site’s navigation as closely as possible. This way Google understands your site structure better.

For example Kissmetrics states it on its blog:
https://blog.kissmetrics.com/site-structure-enhance-seo/

So does Moz.com when talking about URL Structure best practices:
Moz.com url structure site navigationBut is this actually true? Let’s dive into this.

URL Structure vs. Site Structure

So, where does this discussion come from?
It’s something that every slightly organized mind struggles with when deciding on url structures. Should it match the site’s structure? Should it match the site’s navigation?

URL’s basically look like this:

https://www.example.com/example/example/some-keywords
or, technically speaking
https://subdomain.domain.tld/folder/subfolder/page-slug

This is used because the way websites are built (most of the time) is that they are actually a giant folder with a lot of pages, structured within subfolders.

Seems normal so far. But how do we handle pages that could be in more than one (sub)folder?

I know, right?

Can we still use the site structure as a guidance to url structure?

URL Structures: Best Practices

What should your URL structure look like when you follow the best practices?

  • Readable: Your URL should be readable by both humans and search engines. Don’t see anything change in this part.
  • Keywords: Your URL should contain your focus keyword, but you can’t stuff it in there a million times. So what’s the point in having a url like /shoes/red-shoes/red-shoes-with-white-unicorn-prints?
  • Short: Your URL should be as short as possible. Hmmm…

Ranking Correlations

Bleh, best practices. Let’s check Moz’s ranking correlation study from 2015 for clues:
URL Structure Ranking FactorsFolder depth of URL (measured by number of trailing slashes) has a -0.03 Spearman correlation with higher rankings. URL length in characters even has a -0.11 Spearman correlation.

I couldn’t find any other relevant ranking factors. This is a debate that’s very hard to measure. Do folders / subfolders pass Page Authority? We don’t know. I’m not sure how we could measure correctly to isolate the benefits of having your site structure in your URL structure.

Keep in mind that these are correlations in the end. This proves no causal effect whatsoever. You could expect that many non-SEO-optimized websites that have /these-urls-that-are-way-too-long/and-have/many/subfolders are responsible for this…

Search Result

So you’ll just do it for the search result, right?

Breadcrumb trail:
You want your search result to have this shiny breadcrumb trial that ‘shows Google understands your website structure’:
URL Structure Breadcrumbs

Well actually this has more to do with your breadcrumbs being formatted right, then it has to do with URL structure:
Breadcrumb versus url

Sitelinks:
Yes, ok, but you still need it so Google can understand your site’s structure enough to provide sitelinks, don’t you?
sitelinks url structure

Nope:
Sitelinks URL structure

This is koelkaststore.be’s url structure in a Tree:
url structure tree

How does Google read structure?

Well, I’m wondering too. But from what I see in this example, I’m guessing that internal links used in navigation and throughout the website are way more important than URL structure. Look at their homepage:

I’ve marked the sitelinks generated by Google

My best guess is a combination of these factors:

  1. Total Number of Internal Links
  2. Breadcrumbs
  3. Other Vertical Linking
  4. Horizontal Linking
  5. Page Depth Measured By Minimum Clicks From Root Domain

The big question is if URL structure should be added to this list… Could be that ‘Page Depth Measured By Trailing Slashes’ or ‘Folder Depth‘ is also a way to read structure, but I’m starting to doubt it.

It’s not their XML Sitemap, although there’s some other interesting stuff going on there. 😀

BTW: I assume that when generating sitelinks for the search result, these metrics are also in play:

  • Clickthrough Rate from Search Result
  • Search Volume
  • User Behavior Signals
  • Number of External Links / Referring Domains

Just use the damn structure!

So you’re still willing to use your site structure for SEO reasons?
Then here’s my final argument: You’re wasting your crawl budget.

If, because of your site structure, a page has multiple URL’s, you’ll have to use canonical tags to decide which one is the best page. This is where many e-commerce sites struggle: What category do we use for a product’s URL? I’ve seen 3 solutions so far:

  • Root: Just have the product slug right behind the root. example.com/donkey-milk
  • Main Category: Give every product a main category which is the folder for the product. example.com/drinks/donkey-milk
  • Bad Solutions.

TL;DR:

I’m not saying you should NEVER follow your site structure when building your URL structure. If this makes perfect sense to users and search engines and you’re outranking your competitors, by all means, do it. All I’m saying is that there isn’t that much evidence that you should be doing it. There’s even evidence you shouldn’t be…