Yesterday OpenAI announced their "GPTBot". The information around it is still limited, but there is a lot of speculation on what this could mean.
I believe we see the initial steps to the upcoming GPT5 and training on web-data. And that this might turn out to be very important to keep and eye on for SEOs in the future.
What is GPTBot
GPTBot is an AI-powered web crawler developed by OpenAI. It functions similarly to Googlebot, which is Google's web crawler.
The main purpose of GPTBot is according to OpenAI to collect web data that will be used to train future AI models.
GPTBot is designed to browse various websites on the internet and gather information that can be used to improve AI models and enhance their capabilities.
If we look at the more technical aspects, the user agent for GPTBot is "GPTBot/1.0" and its full user-agent string is "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)" and it will be arriving from one the following IPs:
- 20.15.240.64/28
- 20.15.240.80/28
- 20.15.240.96/28
- 20.15.240.176/28
- 20.15.241.0/28
- 20.15.242.128/28
- 20.15.242.144/28
- 20.15.242.192/28
- 40.83.2.64/28
When GPTBot accesses a website, it follows a set of guidelines to ensure that it only crawls pages that are suitable for training AI models. So not "sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies".
How can GPTBot be an opportunity for SEOs?
I already get asked rather frequently how a company can optimise the answers and results that LLMs like ChatGPT are producing.
This is a natural progressions as more and more people use ChatGPT as their new search engine for more advanced queries. I also covered this aspects in 'Google vs ChatGPT: The end of Google as we know it?'
As of today, it is not possible to affect the results in ChatGPT, as its different models are producing answers based on training data up until September 2021.
But you might be able to affect how future models will answer different queries, by creating content the upcoming models are going to be trained on. And this is where GPTBot bot comes into the picture.
I (and others) suspect that GPTBot will be let loose on the internet for the training of GPT5 (the next bigger expected model from OpenAI to follow its current GPT4).
It might also be that future LLM/AI-models are not as static in their training-data set as the current versions, but more fluently can browse the internet to get the latest knowledge into their dataset. Or (a bit like if you use a ChatGPT plugin today) they will be able to determine if they should browse the internet if an answer might be sometime that have happened outside of their training-data date cut-off.
Is this different from Googlebot and Google Bard
So if we put my forecast on the future of LLMs including GPTBot into perspective of how Google does things with Googlebot and Google Bard there are a lot of similarities.
To some extend you could say that Google already have a trainingsdata-set for their Google Bard - its the whole internet indexed by it's Googlebot. So to some degree Googles is already in the position that I see the AI-models and ChatGPT will move.
Today, there is still not a clear SEO strategy to follow in order to get features in Google Bard (or Search Generative Experience / SGE as we also call it). The best educated guess is currently to follow the same principles as when you want to be featured in a Google Snippet in the SERPs.
And this again means optimising towards the good old Googlebot. If ChatGPT (or Bing powered by ChatGPT) is going to take market share percentages at some point, we might see SEOs start paying more attention to GPTBot and the inner workings of the LLM/ChatGPT algorithms.
Even if ChatGPT are not going to win market-shares, there might be more specialised use-cases where some companies see their target audience use the AI/chat-platforms heavily and want to affect the outcome. And focus on SEO specialised in this.
Background and frequent asked questions
Can the answers in ChatGPT be affected with SEO?
As of now, it is not possible to directly affect the answers in ChatGPT. The models generate responses based on the training data they have been provided and cannot be influenced by external sources.
What is Googlebot?
Googlebot is a web crawling robot or a software agent employed by Google to automatically discover and retrieve web pages. It is responsible for scouring the internet, visiting websites, and collecting information about them to index in Google's search engine.
Googlebot follows links on websites and analyzes the content of these pages, including text, images, and links. It helps update Google's search index by continuously crawling websites, allowing users to find the most relevant and recent information when conducting searches on Google.
The behavior of Googlebot is governed by specific guidelines set by Google to ensure fair and efficient scanning of web pages while respecting website owners' preferences and instructions through a robot.txt file.
What is trainingsdata?
Training data for models like ChatGPT, including GPT-4, is a comprehensive and diverse collection of text that represents human language. It's used to train the model to understand and generate text in a way that mimics human-like writing and thinking.
The training data for ChatGPT consists of a large corpus of text gathered from various sources. This can include books, websites, Wikipedia, and other textual content available on the internet such as Reddit.
The text corpus is designed to be diverse and wide-ranging, covering numerous subjects, styles, and domains. This helps the model learn the nuances of human language, including grammar, syntax, semantics, and context.
In some cases, models like ChatGPT may undergo supervised fine-tuning using human-generated examples. This involves providing the model with specific input-output pairs to guide its responses in particular directions or styles.
Did ChatGPT not have access to the internet before GPTBot?
No, and it still does not unless you work with plugins. OpenAI experimented a bit with allowing ChatGPT a Bing search and then reading the results, but that was temporarily disabled again. GPTBot might be a way to open for this again, but also providing websites the opportunity to opt out of being crawled and used in ChatGPT (and hereby not only a tool for training of future models).
How is GPTBot providing opportunities for SEOs?
I see two main areas.
- Optimizing Answers for LLMs: With the rise of Language Models like ChatGPT, companies/SEOs are interested in optimizing content to influence the answers these models produce.
- Potential Influence on Future Models: SEOs may be able to affect how future models like GPT-5 respond to queries by creating content that GPTBot will crawl use for training of the models.
Want to try the #1 AI Writer for SEO Copywriting?
Create anything from blog posts to product descriptions with 1-click AI drafts or our chat assistant. Powered by a next-gen SEO engine that ensures your content actually ranks. Try it now with a free trial→