Reddit business model in the AI era!
how generative AI is fundamentally changing the business model for user-generated platforms!
As Reddit IPOed, the most interesting aspect of it has nothing to do with its core but with something else related to how AI is reshaping the whole business modeling landscape for these companies.
Let me give you some insight into how social media (and, more generally, user-generated platforms) business models might change.
That's how Reddit's business model looked before the explosion of generative AI:
Disclaimer: what you'll read ahead is not financial advice, and it has nothing to do with the company's potential valuation in the near future. Instead, I dissect the potential trend of an entire industry (user-generated platforms) in the generative AI era!
As of 2023, the company reported over $800 million in revenue and a net loss of $90 million.
Most of that revenue came from advertising.
Yet things are quickly shifting, as Generative AI is taking over the social media industry by storm. Why?
Let's take a couple of steps back and then move forward with an overview of how the landscape might change in the coming years.
How does Reddit work?
Reddit is unique in terms of its ability to enable user-generated content at scale and yet keep this content highly relevant to the niches and small communities that make up its community.
As we'll see, this is critical and extremely valuable for the generative AI industry, as Reddit has a multi-layered system based on the following:
Subreddits and Moderators: Communities on Reddit, called subreddits, are curated collections of conversations led by volunteer moderators who define the purpose, create rules, and maintain the community's focus. In December 2023, there were over 60,000 daily active moderators.
Rules Enforcement: Moderators enforce rules to keep conversations aligned with the community's purpose. Automated tools assist in content removal and spam detection. Reddit also has site-wide Content Policy enforced by employees.
Posts: Community members share content in various formats like text, images, videos, and links. Posts serve as conversation prompts within communities.
Conversations: Every post initiates comment threads where members engage in conversations, sharing feedback, ideas, and responses. Comment threads can grow infinitely and branch in multiple directions.
Votes: Community members anonymously upvote or downvote posts and comments, collectively deciding their visibility based on a calculation called the "Hot Score."
Search: Reddit offers comprehensive search functionalities allowing users to find information within posts, comments, communities, and media. Machine learning is employed to enhance search ranking models.
Karma: Karma reflects a user's contribution to the platform through upvotes received on their content. It incentivizes quality contributions and builds trust within communities.
Reddit Gold and Contributor Program: Reddit Gold, awarded for quality content, can be exchanged for participation in the Contributor Program, enabling users to earn real money for their contributions.
Reddit Avatars: Users can create cartoon versions of themselves, known as Reddit Avatars, to express their digital identity. Avatars deepen engagement and offer customization options, some of which are exclusive to Reddit Premium users.
Reddit Premium: A subscription service offering benefits like ad-free browsing, access to exclusive features, and customization options for Reddit Avatars.
These filters are critical as they enable Reddit users to keep posting as "avatars" (without disclaiming their real identities) and prevent AI from taking over the platform with fake, AI-based accounts.
Indeed, Reddit's community rules are strict, and each community's standing will also be highly dependent on it. This is a massive filter for future AI-generated bots, who are significant threats for other social media platforms.
Yet while platforms like Google, Meta, and X (Twitter) will need to tackle this via authentication, Reddit might be able to handle it with its existing community-based, multi-layered moderation system.
The whole multi-layered moderation system is based on a few pillars:
Site-wide Content Policy: Governed by Reddit, Inc., the Content Policy sets fundamental rules for all users, prohibiting behaviors such as harassment, violence, and sharing private information without consent. Violations result in consequences ranging from warnings to bans, as per severity.
Community Moderation: Each subreddit has its own set of rules created and enforced by volunteer moderators, allowing communities to maintain their culture and values. Moderators have the authority to remove content that violates community rules without Reddit's intervention.
User Empowerment: Reddit's system encourages user participation in content ranking through voting. Both upvotes and downvotes are crucial in shaping community culture and consensus. The voting system turns every user into a content moderator, ensuring community-driven content curation.
Transparency: Reddit publishes bi-annual Transparency Reports, providing insights and metrics on removed content, ensuring accountability and transparency in platform governance.
Scalability: Reddit believes in the scalability of self-moderation at the community level, where moderators and users collaborate to enforce rules tailored to the community's unique needs, fostering an environment conducive to discussion and engagement.
Reddit's Struggling (At Scale) Ads Machine
Reddit reported $804.03 million in revenue in 2023 and $90.82 million in net losses, compared to $666.7 million in revenue and $158.55 million in net losses in 2022.
And this is on top of an advertising machine that reported over 73 million daily active users globally, of which over 36 million in the US.
Shouldn't the Ads business be enough for Reddit to keep scaling its operations? Unfortunately not...
The advertising business has become a winner-take-all. And the winner is not Reddit...
That is clearly shown in Reddit's ARPU numbers (for now).
By Q4 2023, Reddit reported $3.42 ARPU globally, $5.51 in the US, and $1.34 in the Rest of the World.
For some context, Facebook reported a worldwide ARPU of $13.12 in the same period. In the US & Canada, it was $68.44; in Europe, it was $23.14; in Asia-Pacific, $5.52; and in the rest of the world, it was $4.50.
And this, by taking into account that Meta has over 3 billion users globally, of which over 270 million are in the US & Canada!
Meanwhile, Reddit has somehow managed to create an advertising machine with a wild user base, often made up of quite opinionated users who can express themselves freely inside their small communities (and behind their avatars).
This is a massive strength for Reddit as a social media platform. Yet, it is also an enormous weakness for an advertising business model, where brands look for reliability and the ability to control their brand's perceptions.
This makes advertising on Reddit still too risky.
However, this is how the Reddit's advertising machine has managed to pull things off:
Ad Server for Delivery and Auction: Utilizes a horizontally scaled-out Ad Server service for real-time ad delivery globally. Ad Server runs a second-price auction to select ads for Redditors based on various factors like bid type, value, and Redditor engagement probability.
Advertiser Experience Service: Allows advertisers to define campaign parameters, create ad creatives, and provides reporting and analytics on ad performance. Integrated with ad delivery and auction systems for seamless campaign management.
Advertiser Creative Services: Introduced the "AI Headline Generator" to help advertisers quickly create multiple Reddit-friendly ad headlines using generative AI, enabling efficient ad campaign management.
Audience Reach: Offers multiple audience reach options including interest, community, keyword, geography, custom audience, and reengagement using the Reddit Pixel. Assisted by internal usage data and Redditor engagement model for effective targeting.
Measurement: Measures various aspects of ad campaign performance such as reach, impressions, clicks, conversions, and incrementality. Supports both first-party (Reddit Brand Lift, Reddit Conversion Lift) and third-party measurement options.
Brand Suitability: Provides multiple brand safety capabilities including Global (always-on) and Custom options. Global capabilities ensure ads run on safe product surfaces within Reddit properties, while Custom capabilities offer additional control through community exclusions, custom negative keywords, and filtering based on contextual intelligence services.
But if scaling its advertising machine is so hard (considering that it won't be easy for Reddit to scale its user base without breaking the current equilibrium), what can Reddit do to scale its monetization efforts?
The Generative AI Gold Rush
In its pre-IPO, Reddit emphasized a multi-chapter growth strategy, where data licensing (for generative AI) will play a critical role in Reddit's future business model.
Generative AI matter so much to Reddit and vice-versa?
In the last couple of years, I've been explaining this in what I defined as a new "AI Business Ecosystem."
In the latest earnings release, Jensen Huang, CEO and co-founder of NVIDIA, highlighted three key paradigm shifts, which we're looking at right now, and that are fundamentally shaking the whole software industry from within:
Paradigm Shift 1: The move from general to accelerated computing will dramatically improve energy efficiency and cost by 20x, improving speed by a step change magnitude too.
Paradigm Shift 2: Generative AI as a new fundamental way of doing software and a new way of computing, thus redefining the whole cloud industry (from retrieval to inference).
Paradigm Shift 3: A whole new industry from hardware to software; for the first time, a data center is not just about computing data and storing data; there is a new type of data center, which is about AI generation.
Companies like Microsoft, Google, and Amazon have all been building their own AI supercomputers to process massive amounts of data:
Where do we find the intersection of generative AI and social media?
Primarily at two levels:
Pre-training: Given the massive amount of data needed for an LLM to become valuable and general-purpose.
And inference: Given the need to tie these LLMs to proper context based on real-time data and content, looping on the fly, a RAG architecture capable of making AI-search-based answers more accurate, relevant, and timely.
Let me articulate the above.
Why does user-generated content matter in the pre-training stage of Generative AI?
Indeed, the premise of the current generative AI race is that you can leverage massive computing resources by feeding billions and billions of data points into the unsupervised transformer architecture, thus processing an enormous amount of unlabeled data in a pre-training process!
Indeed, once pre-training is done, that is what you get...
As the model's scale increases, performance improves across tasks while unlocking new capabilities.
If transformers need a massive amount of data, the core question to ask is:
Where is this data coming from?
While we don't know how much data major LLM companies, like OpenAI, took, we know that they leveraged sources like Wikipedia, Common Crawl (a free web archive consisting of petabytes of data collected since 2008), and Reddit.
It's interesting to note, in that respect, that Sam Altman, CEO of OpenAI, holds a significant stake in Reddit, giving it about 8.7% of the total outstanding shares and a 9.2% voting power in the company.
Why does Reddit's data matter at an inference level?
When a tool like Perplexity AI gives you back an answer, a good chunk of what its AI can offer on the fly is also based on adjusting the underlying LLM capability to match a proper source of information in a stored database in what's known as RAG architecture.
This system is critical for now as it enables answer engines to offer relevant answers to users' queries.
This contradicts pre-training, where you need a massive amount of data upfront.
A RAG architecture will need a continuous flow of real-time data, which can be accessed on the fly to infer the underlying LLM.
In short, as a user jumps on a tool like Peplexity AI, once the user asks any question, the underlying answer will be a combination of knowledge understanding from foundational LLMs leveraged by Perplexity and real-time data plugged on the fly to make the answer as relevant as possible.
In that respect, a company like Perplexity might want access to Reddit's (or Pheraps X's) APIs to plug that user-generated content on the fly to make the answer as relevant, fresh, and accurate as possible.
That's where user-generated data becomes so valuable in the context of inferencing LLMs.
Now that we have tackled these two aspects let's examine a possible deal structure for Reddit in the AI era.
How will social media platforms make money in the future?
Authentication: to ensure that users are not AI bots, which is critical to the survival of social media platforms.
Data licensing: to provide AI companies with massive data for pre-training their LLMs as a one-time or multi-year license deal.
Per-inference consumption: to provide a continuous flow of real-time data via APIs to RAG-based AI systems, can provide real-time, relevant, and timely answers to users' queries.
Indeed, while for other major social media/user-generated platforms (primarily Meta's Facebook/Instagram, X/Twitter, and Google's search) authentication might become a primary route to prevent AI bots from taking over these user-generated platforms, for Reddit, things look a bit different.
As I explained above, the multi-layered, highly community-central user-generated content policy system might help Reddit shield—at least initially—a lot of initial AI bots, thus giving it a head start on monetizing its user-generated content.
What will an AI deal look like for Reddit?
Here, the deal will happen at two data layers:
Pre-training licensing: A first data layer will be used for pre-training the LLM, which will probably become a multi-year agreement to access a massive corpus of data.
Real-time access: Data will be used on the fly (e.g., via RAG) to incorporate real-time information into custom chatbot answers. Here, the deal might be based on consumption.
What will Reddit's AI monetization look like?
Based on the above, monetization might take two forms.
The pre-training deal: the AI company gets access to massive amounts of data for LLMs' foundational training. This deal might be a multi-million to potential billion-dollar, multi-year licensing deal, depending on the company's ability to value its data, the amount of data required by the Generative AI company and the frequency to re-assess that data for pre-training.
Inference deal: The inference deal might be a consumption and cap-based deal. Also, here, depending on the frequency of access to APIs and consumption of real-time data, this might look like anywhere between the hundreds of dollars a year to millions (as LLM-based search scales up, the value of inference deals might go up as well).
Let me expand on these points.
Data licensing for pre-training
In the data licensing deal, what will matter is:
Reddit's Valuable Data Corpus:
Reddit possesses one of the internet's largest collections of authentic and continually updated human-generated experiences.
Recognizing the increasing importance of data in various applications, Reddit believes its information is crucial for meeting consumer needs and organizational priorities.
Differentiated Solution:
Reddit offers a unique solution for organizations seeking real-time human perspectives, whether it's for product feedback or market sentiment analysis.
The platform's genuine and authentic interactions provide valuable insights for a wide range of applications.
Data Advantage and Intellectual Property:
Reddit's data advantage and intellectual property are expected to remain integral in training future Large Language Models (LLMs).
The platform's data constantly grows and regenerates through user interactions, contributing to the evolution of AI technology.
Third-Party Data Licensing:
Reddit is in the early stages of allowing third parties to license access to historical and real-time data from its platform.
This includes offering Data API Access, enabling customers to access real-time data streams for various topics like sports, movies, news, and fashion.
Reddit's data also plays a foundational role in training AI models, with its vast corpus of conversational data expected to continue shaping and improving LLMs over time.
How much has Reddit made already from AI data licensing?
Reddit is already exploring opportunities to monetize its data through licensing arrangements with third parties.
In January 2024, Reddit entered into data licensing agreements with a total contract value of $203.0 million.
Contract terms range from two to three years.
Reddit expects to recognize a minimum of $66.4 million in revenue for the year ending December 31, 2024.
Reddit's platform generates and grows data continuously as users interact with communities.
Reddit sees its growing data as valuable for training large language models (LLMs) and as an additional revenue stream.
What about inferencing?
Again, imagine a scenario in which Reddit makes its real-time user-generated content available to AI-based search tools, like Perplexity AI.
How much would a scaled-up AI-based search solution be willing to pay for inference-based consumption of user-generated content?
That will be the basis of the deal, which will work more like a SaaS/consumption-based enterprise deal with fixed subscription revenue, capped limits, and consumption-based monetization.
Recap: What Did You Learn In This Issue?
Reddit's Financials: In 2023, Reddit reported revenue exceeding $800 million but with a net loss of $90 million, primarily driven by advertising.
Unique Platform Structure: Reddit's ecosystem relies on subreddits, moderation, and user engagement, making it conducive to AI applications.
Challenges in Advertising: Despite its large user base, Reddit faces hurdles in scaling its advertising business due to user-generated content and brand safety concerns.
Generative AI's Role: Generative AI is poised to reshape Reddit's business model, with a focus on data licensing becoming increasingly significant.
Intersection of AI and Social Media: The integration of generative AI and social media occurs notably in the pre-training and inference stages, leveraging Reddit's vast data resources.
Data Licensing Opportunities: Reddit's data holds immense value for training large language models (LLMs), leading to potential multi-million to billion-dollar licensing deals.
Exploration of Data Licensing: Reddit is actively exploring data licensing agreements, projecting revenue of over $66 million in 2024 alone.
Inference-based Revenue: Inference-based consumption of real-time user-generated content represents another avenue for revenue, structured as a SaaS/consumption-based deal.
Ownership Structure Implications: Reddit's ownership ties, including significant shares held by OpenAI's CEO, hint at potential collaborations and ventures in AI.
With ♥️ Gennaro, FourWeekMBA