So, in the previous post we discussed what Machine Learning is. In this post we’ll go over how machine learning is impacting the way search engines (more precisely Google) work. How are they using machine learning (e.g. RankBrain) to deliver the best search results to their audience.
Without saying this is also the way Google is treating this, I want to split the impact into two subdomains:
- How Google processes your search query and tries to understand intent.
- How Google designs SERP’s that are relevant to the search query.
There’s a lot more to talk about than just these two subjects, but it’s the main deal. I’ll be explaining both by giving you a brief overview on how Google performed both actions before and after RankBrain. Let’s-a-go!
Processing Search Queries
With DMOZ closing recently, we’ve got a throwback to the early internet browsing behavior before search engines were a thing. The more pages were being made and thrown on the internet, the harder it became to just find what you were looking for. So people tried to solve this problem. People tried to ‘organise the internet’.
Just like people were used to organise everything in these days, they started to gather the most important websites and put them into folders. You have a website about your local soccer team? We’ll put that here:
ALL -> Sports -> Soccer -> Teams -> Europe -> Belgium -> KMSK Deinze
This way DMOZ, at its closing time, categorized a stunning 3.861.210 websites into 1.031.722 categories over 90 languages. To do this, they had a team of 91.929 editors.
This became an increasingly hard task to do, considering the enormous volume of websites going live on the internet each hour of the day. We needed a new, easier way to find the website page you were looking for.
Search engines based on query/document matching
Why not let people type in the thing they’re looking for and return all the pages that contain the exact search term?
That’s where search engines started. Matching exact search queries to documents. If I had a document online that has the title ‘Coffee Machine’ and I used the phrase ‘coffee machine’ a lot in the document, it would be a very relevant result for the search query ‘coffee machine’.
There are a lot of different ways to determine the relevance of a document considering a search term. Consider just the following possibilities:
- Keyword Usage: Is the document using this query? How many times does it use it (in absolute / relative terms)?
- Term Frequency x Inverse Document Frequency (TF*IDF): This method takes into account the commonality of a word used in the query. If we’re looking for ‘great guitars’, the word ‘great’ will be more common, so the word ‘guitars’ will be more important to determine the relevance.
- Co-occurence: Assuming you have a lot of data, you could check which words frequently co-occur with the search query. For example: If a document is about ‘guitar lessons’, it will probably mention ‘chords’, ‘frets’, ‘notes’ and other relevant words. A document containing these co-occuring words (measured across documents) will be considered more relevant.
- Topic Modeling (e.g. LDA): This is were it gets though. Notice that co-occurence doesn’t imply the words are relevant. Topic modeling is a bunch of ways to determine which words are related to each other. For example the word ‘up’ and ‘down’ are related to each other. They are both related to ‘elevators’ but they are also related in a total different way to ‘manic depression’. Topic modeling uses vectors to determine how words are related. There is an awesome post from 2010 on the Moz blog about LDA and how it’s correlated to rankings. It also visually explains the previous topics.
This works great but has two downsides:
- Exact search query usage: Matching documents to search queries doesn’t take search intent into account. This means that two different search queries, having the same intent, will have two different results. Also: misspellings are a big issue.
- Manual topic modeling: The topic modeling used is mostly based on human, non-automated work. This means an enormous amount of work and editors needed. (DMOZ, anyone? 😉 )
Search engines using machine learning
What is needed is a machine learning system that learns how words, topics and concepts relate to each other. We need Artificial Intelligence to make search engines understand the questions we are asking so they can give us the correct answer.
I’ve found this great talk from Hang Li (Huawei Technologies), who presented his view on how to use machine learning for advanced query / document matching. The main problem being: how to adapt to natural language (synonyms, misspellings, similar queries <-> same intent,…)?
If you don’t want to watch the full video, the main aspects are here:
Hang speaks about matching the keywords and concepts on different levels:
- Term: Comparable to the query/document matching. If a document uses the term ‘NY’ a lot, it’s probably relevant for the search term ‘NY’.
- Phrase: Just like before but on the level of phrases. Term-level matching ‘hot’ and ‘dog’ will not necessarily give you the documents that are relevant to the phrase ‘hot dog’.
- Word Sense: This is where it starts to get interesting. On this level of matching, we need to be connecting similar word senses. The system should know that ‘NY’ is actually ‘New York’, and that someone searching for ‘utube’ probably is looking for ‘YouTube’.
- Topic: Even further we should be able to match the topics of the queries being used. If we can link ‘microsoft office’ to ‘powerpoint’, ‘excel’, … and other relevant terms, this gives us an extra layer to determine relevancy of a document.
- Structure: On this level, we should be able to get the intent of the search, no matter how it is formulated. So the structure of the language should be understood. The system should ask ‘What is/are the most defining part/s of this search?’
So the way this works from a ‘Query Understanding’-standpoint:
- The searcher enters the query ‘michael jordan berkele‘, which contains a typo.
- On a term level, the spelling error is corrected. So ‘berkele’ is interpreted as ‘berkeley’.
- On a phrase level ‘michael jordan’ is identified as being a phrase.
- On the sense level there are similar queries like ‘michael l. jordan’ or just ‘michael jordan’.
- Importantly, on a topic level, the system recognizes the topic as being ‘machine learning’. If ‘Berkeley’ wasn’t in the query, there would have been confusion on the topic as ‘Michael Jordan’ is obviously also a very famous former basketball player.
- On a structure level it becomes clear that Michael Jordan is the main phrase of importance. It’s not Berkeley.
Looking at it from the other side, we have a similar process:
So when both the query and document can be understood on these levels, the system can start matching the search query intent to the most relevant documents. Hang goes further into this process, but this first part explains a lot about the task that’s been given to machine learning.
This process of including machine learning into understanding language and search intent has come a long way. Google uses TensorFlow to have machines learning language. Through a massive input of language data, it can make it’s own knowledge by understanding vectorial correlations between words or phrases. There’s little doubt that this technology is part of RankBrain.
So from a query-processing standpoint, machine learning is helping query/document matching by developing its own understanding of language.
Ranking search results
As said earlier, search engines have two main objectives: First, understanding the search intent to match the right pages. Then, rank all the matched pages so the most useful will be highest in the list.
When we finally decided which pages are probably relevant to the searcher’s intent, we’ll have to make a guess on what page will be the best to rank first. And there are a lot of factors being used to do that. But as you might have learned from the previous blog in this series, all these possibilities become too hard to handle right for every search. And that’s where machine learning and stuff like RankBrain come into play.
So let’s see how we could rank pages.
Pages ranked based on query / document matching
Plain and simple. We let the matching-algorithm run and define scores based on on-page relevance of the document. The document with the highest score, gets ranked first.
Although simple, this is not the best way to do this as it is an easy-to-trick system. Once you know how the query / document matching is done, you’ll be able to design a document that is very relevant according to the algorithm, but not for the user.
Pages ranked based on a set of manually weighted factors
Second thing to do is add in extra factors which can define if a page will be relevant or not. Then manually setting the weight these different factors should have to rank the search results. There are a lot of factors:
- Page level: query / document matching score, links to the page, linking C-blocks to the page, …
- Domain level: overall topical relevance, links to the domain, quality of content, …
- Search level: branded search on this topic, …
- User level: has visited this website before, visits video content regularly, …
- Device level: what device is used, how’s the internet connection, …
Problem is, different searches will need different weighting in factors. And that’s more than any man can do…
Pages ranked based on machine learning
Not only does Google have the necessary information on query / document matching, incoming links to the domain and the page, overall relevance and power of the domain… It also gathers information on how well the search results are working. It measures click-through rate, bounce rate, etc…
For example, if you perform a search and get a search results page, there are a couple of things that can happen. Suppose you don’t click the first result. Why in hell, would you not click the first result? The possible list of answers is endless.
- You’ve already visited this domain in the past and didn’t like it.
- The search result is not relevant to your particular situation.
- You think this website is for older people.
- You don’t like the way the meta description is written.
Everything from user profile (demographics, interests, …) to on- or off-page factors (domain, meta title, …) can be in play. It is too much for a manually updated algorithm to get al these factors right. But given you have enough data (// enough searches), a self-learning algo could do the job.
It can work its way back from the results (‘What is the page that people clicked and probably had a good user experience?’) to define how the different algorithm factors should be weighted.