rADAr

Can Reddit predict reality?

A deep dive into the hidden patterns between online communities and real-world events.

Scroll to explore
Research Overview

When Online Meets Offline

Every day, 50 million users log on Reddit to discuss a variety of topic, ranging from politics to finance and even sports. Is this digital chatter merely noise, or does it contain meaningful information about the real world?

Initial Hypothesis

Social media platforms have become a primary arena for public discourse. Reddit uses a community-driven structure organized into topic-specific forums called "subreddits". It provides a unique perspective into how people discuss real-world events: unlike other platforms, Reddit's open nature allows us to observe cross-community interactions, i.e., how discussions in one subreddit spill over into others.

In this study, we analyze the interaction volume and sentiment between subreddits over time using the SNAP dataset. Rather than counting the raw number of posts, we measure how frequently communities engage with one another, what is the overall sentiment (positive or negative tone) of each interaction and more subject-specific metrics. These metrics reveal the intensity of cross-community discourse and, as we will demonstrate, correlates strongly with real-world events.

Our hypotheses are the following: if Reddit truly reflects reality, we should observe measurable spikes in subreddit interactions coinciding with major real-world events and real-world events can be predicted using only sentiment on Reddit. We test this across three domains: sports, finance and politics.

Empirical Evidence

Below are two examples from our dataset demonstrating the correlation between Reddit activity and major real-world events.

Sports Community Dynamics

r/NFL cross-subreddit interactions (2014-2017)

Weekly interactions
Super Bowl week

Sports: The Super Bowl Effect

The NFL subreddit shows a consistent pattern: every February, interaction volume with other subreddits spikes. These peaks correspond precisely to Super Bowl weekends (XLIX in 2015, 50 in 2016, and LI in 2017). The regularity of this pattern demonstrates that Reddit engagement directly mirrors the periodicity of major sporting events. Out of all the sporting events, the Super Bowl is the only one that makes the interaction volume spike every year.

Political Discourse Intensity

r/politics cross-subreddit interactions (2016)

Daily interactions
Election day

Politics: The 2016 Election

The r/politics subreddit data from 2016 reveals a striking spike on November 9th - the day after Donald Trump was elected president. This single-day anomaly represents a nearly 10x increase in cross-subreddit interactions, reflecting the magnitude of public response to this political event.

What These Patterns Tell Us

These two examples illustrate a fundamental insight: Reddit activity is not random noise. The platform captures genuine shifts in public attention. When 100 million viewers tune into the Super Bowl or when a major political event unfolds, the digital footprint on Reddit reflects these real-world dynamics.

This correlation raises an important question that we explore in the following sections: if Reddit so clearly reflects reality, could it also anticipate it? Do changes in post volume or sentiment precede events or do they simply follow them?

Structure of Our Analysis

Our analysis is divided in three parts. We begin with a General Analysis of the Reddit ecosystem: examining dataset characteristics, temporal patterns, and the overall landscape of cross-subreddit interactions that form the foundation of our study.

From there, we dive into two carefully chosen domains: Sports and Finance. This pairing is deliberate. These domains occupy opposite ends of a crucial spectrum: the relationship between public opinion and real-world outcomes.

Financial Markets, especially cryptocurrencies, are inherently sentiment-driven. Prices move because people believe they will move. When Reddit sentiment fluctuates, traders share the same sentiment, and the market responds accordingly.

Sports, by contrast, remain stubbornly immune to public opinion. No amount of fan enthusiasm on r/nfl can influence the outcome of a game. Touchdowns are scored on the field, not in comment threads. Reddit can only observe and react; it cannot predict or influence.

From this contrast emerges a clear conclusion: if Reddit sentiment can anticipate market movements but only react to sports outcomes, we learn when and where online discourse carries genuine predictive power.

The Big Picture

General Context

Before diving into specific domains, we explored the dataset as a whole. How do online communities interact, and what signals do they emit during global events?

The Dataset: Reddit Hyperlink Network

Our analysis is built upon the Reddit Hyperlink Network dataset, which maps interactions between subreddits from January 2014 to April 2017. Rather than tracking individual users, it focuses on community-level dynamics, i.e., how information and sentiment flow through hyperlinks between communities.

40k+
Communities
40
Months of Data
LIWC
Emotion Tracking

Each hyperlink in the dataset comes with rich NLP features including sentiment scores and emotional markers from the LIWC framework. This allows us to measure not just how much communities interact, but how they feel when they do. The dataset captures inter-community signals, standardized emotional tracking, and spans 40 months, although it is limited to public subreddits and relies on automated sentiment analysis.

With this foundation, we asked: can Reddit's collective voice tell us something meaningful about the world?

Case Study

Collective Emotion: The 2015 Paris Attacks

On November 13th, 2015, a series of coordinated terrorist attacks struck Paris. News of the tragedy spread across the globe, and Reddit became an immediate outlet for collective grief, fear, and solidarity.

Emotional Signals During Crisis

Dampened Baseline

Immediate Spike

Anxiety and religion-related discourse surged within 24 hours, reflecting the shock and search for meaning.

Prolonged Mourning

While anger died down quickly, sadness remained high for weeks.

Gradual Normalization

By late November, most emotional baselines reverted, except religion-related keywords which stayed elevated.

This pattern demonstrates Reddit's capacity to act as a real-time emotional barometer during crises.

But emotions aren't just reactive, they also fuel conflict. We turned our attention to political polarization.

Political Analysis

Measuring Digital Polarization

The 2016 US Presidential Election was one of the most divisive in modern history. By analyzing interactions between "Left-leaning" and "Right-leaning" subreddits, we can measure the intensity of online polarization.

1

Positive Sentiment Volume

This visualization isolates the volume of interactions with purely positive sentiment between left-wing and right-wing subreddits.

It provides a baseline for "friendly" or constructive cross-partisan engagement, allowing us to see how much interaction is driven by positive outreach versus conflict.

2

Radicalization Trend

By tracking the average weight of positive and negative sentiments, we can identify "peaks" of extreme discourse.

As positive and negative averages diverge from the neutral baseline, discourse becomes more emotionally charged. The Right became less negative towards the election, corroborating electoral results.

3

Interaction Initiative Ratio

Defining the Initiative Ratio as (Left → Right posts) / (Total cross-partisan volume):

  • Value > 0.5: The Left is initiating or responding more actively
  • Value < 0.5: The Right is initiating or responding more actively

Significant spikes correlate with major events like Election Day. After the election, the Right became significantly more active.

Dataset Limitations

Understanding the Boundaries

While Reddit data offers valuable insights, it comes with inherent limitations. These examples illustrate where the dataset falls short.

1

Incomplete Coverage

The Eurovision 2016 spike is clear, but that's the only data point we have. The dataset only covers 2014-2017, so we can't verify if this pattern repeats annually.

Limitation: Single occurrence makes it impossible to confirm whether this is a recurring pattern or a one-time anomaly.

2

Inverted Signals

Contrary to expectations, the chart shows dips instead of spikes around major SpaceX events. When the Falcon 9 crashed, LIWC "space" vocabulary actually decreased.

Limitation: The metric captures something different than expected. Perhaps attention shifts away from technical vocabulary during dramatic events.

3

Misleading Correlation

The NBA data shows playoff-aligned spikes, but the second major peak falls outside the playoff window entirely.

Limitation: Off-season drama (trades, controversies) can generate spikes unrelated to games. Correlation does not imply causation.

These examples remind us that Reddit data requires careful interpretation. Signals are not always what they seem, and context is essential.

First Analysis

And what about sports?

Exploring the emotional landscape of NFL communities on Reddit.

" Serious sport has nothing to do with fair play. It is bound up with hatred, jealousy, boastfulness, disregard of all rules and sadistic pleasure in witnessing violence. In other words, it is war minus the shooting.
- George Orwell

Beyond the perhaps over-dramatic tone of this quote, it conveys the emotional, passion-inducing and even sometimes violent nature of sports and the communities built around them. This is precisely why we chose sports as one of our case studies to correlate real-world events and Reddit sentiment.

Why the NFL?

We picked the NFL and its teams specifically because it is by far the most popular sport on Reddit. All 32 NFL team subreddits rank in the top 350 most active subreddits in our dataset. We also observed massive peaks in activity around the Super Bowl each year, as shown in the introduction.

32 Teams
in our dataset
Top 350
most active subreddits

Winners vs. Losers: The Vocabulary Divide

Now that we have both a topic and the subreddits associated with it, let's look at the most important part of sports (irrespective of what your parents might have told you as a child): winning and losing!

We defined losing and winning teams as the 10 teams with respectively the lowest or highest winning percentage across the 2015 and 2016 NFL seasons. While the mean sentiment in subreddits of winning and losing teams is essentially the same, the main difference we found is in the vocabulary used.

LIWC Analysis: Winning vs. Losing Teams

Linguistic Inquiry and Word Count categories with highest relative frequency

About LIWC: Linguistic Inquiry and Word Count is a transparent text analysis program that counts words in psychologically meaningful categories. Altough it is hard to extract clear conclusions from such a graph, we can still notice a real difference in the vocabulary used by the fans of losing and winning teams.

Digging Deeper: External Dataset

Despite the (extremely sad) fact that there weren't enough posts in the subreddits of NFL teams in the SNAP dataset to do a rigorous sentiment analysis, the LIWC plot gave us a clear indication that we could dig deeper.

To do so, we used an external dataset from Cornell with many more posts, but limited to the subreddits of 11 NFL teams. With a much higher number of posts, we chose to focus on the most exciting days in a sports fan's life: Gamedays!

How Do Fans Feel When Their Team Wins or Loses?

We extracted the average sentiment across all Reddit posts on the day before, the day of, and the day after an NFL game for the 11 teams in our dataset. Is there a pattern that we could observe?

New York Giants - Sentiment Around Games

Mean sentiment before, on, and after gamedays

Day Before Game
Gameday
Day After Game

We can generally see a decrease in sentiment when a team loses and an increase when they win. This observation becomes even clearer when looking at the mean sentiment across all teams:

Mean Sentiment Across All Teams

Aggregated sentiment before, during, and after wins vs. losses

Key Finding: Even if the increase in sentiment after a win is relatively small, we can clearly see a steep decline after a loss. NFL fans get notably negative after their team loses!

Can We Predict Game Results Using Sentiment?

While looking at sentiment is interesting, can we go even further? Can we predict the results of NFL games using only sentiment?

We trained a Random Decision Tree model on results and sentiments from the 2015-16 NFL season, then used it to predict the results of the 2016-17 season. Using only three measures for sentiment (before, on, and after gamedays), we achieved a remarkable result:

84%
Prediction Accuracy
using only sentiment data

Houston Texans - Predicted vs. Actual Wins

Cumulative wins throughout the 2016-17 season

Actual Wins
Predicted Wins

This shows that, even though we are looking at relatively simple and polarizing events with binary outcomes (winning or losing), real-life events can significantly affect the overall sentiment of certain subreddits.

This made us hopeful that we could find even more interesting correlations and possibly even predictions in another domain that might be even more emotion-inducing than sports...

Explore Finance Analysis →
Analysis 02

Finance

When Reddit talks crypto, does the market listen?

Can Reddit Predict Markets?

We analyzed crypto and finance subreddits from 2014 to 2017 to see whether Reddit activity could inform trading decisions. For each day, we collected post counts and ran sentiment analysis, including VADER scores, anxiety and anger levels, and other NLP metrics.

Beyond raw post counts, we constructed derived signals: moving averages to smooth out daily noise, trend indicators comparing recent activity to historical baselines, and a "volume spike" ratio to identify days with unusually high posting activity.

Our main finding is the following: Reddit activity correlates well with volatility, especially for Bitcoin. When posts spike, prices tend to swing, but not in any predictable direction.

2014-2017
analysis period
BTC + S&P
two asset classes tested
3,024
strategy combinations tested
1

First Attempt: Machine Learning

Our first approach was to train ML models to predict whether prices would go up or down the next day. We fed 5 different classifiers with Reddit sentiment features (anxiety, anger, positive/negative emotion, VADER scores, and post counts). If Reddit discussions contained any predictive signal, these models should find it.

Result: All Models Failed

None could reliably beat random guessing (50%)

ML Models: Predicting Bitcoin Price Direction from Reddit Sentiment
Random Baseline (50%)
Accuracy (%)
656055504540
Model

Why did this fail? Markets are efficient. If Reddit sentiment could reliably predict price direction, hedge funds would have already exploited that edge until it disappeared. The "wisdom of the crowd" is already priced in.

2

What Actually Works: Volatility Prediction

Since predicting direction failed, we asked a different question: can Reddit predict how much prices will move, regardless of direction? The answer is yes. When Reddit activity spikes, volatility follows. This is especially true for Bitcoin, where the retail-driven market responds strongly to social media buzz.

Higher Reddit Activity = Higher Volatility

Average daily price swings by Reddit posting level

Bitcoin: Market Volatility by Reddit Activity Level
Average Market Volatility
0.60.40.20.0
Reddit Activity Level
S&P 500: Market Volatility by Reddit Activity Level
Average Market Volatility
0.140.100.060.00
Reddit Activity Level

Key insight: For Bitcoin, volatility nearly doubles from low to high activity periods. The correlation is statistically significant (p = 0.003).

For S&P 500, the effect is much weaker (p = 0.636). This makes sense: crypto was retail-driven in 2014-2017, while the stock market is dominated by institutions that don't post on Reddit.

3

Building a Trading Strategy

Since Reddit predicts volatility but not direction, we built a simple strategy: exit the market when Reddit activity spikes (high volatility coming), then re-enter after things calm down. We tested 3,024 combinations of different signals, thresholds, and waiting periods to find what works best.

Understanding Strategy Parameters

Each strategy in the dropdown is defined by 4 parameters. For example, Volume Spike (p97, High, 10d) means:

Signal Variable

The Reddit metric we track. "Volume Spike" is today's post count divided by the 20 day average. Other options include raw Post Count, VADER sentiment scores, or Anxiety levels.

Threshold (p97, p65, etc.)

The percentile that triggers an exit. "p97" means we sell when the signal is higher than 97% of all historical values. Higher percentiles are more selective and trigger less often.

Direction (High or Low)

"High" means exit when the signal spikes up (too much attention). "Low" means exit when the signal drops (too little attention). Bitcoin works better with High exits, S&P with Low.

Re-entry Delay (10d, 2d, etc.)

How many days to wait before buying back in after exiting. Longer delays let volatility settle but risk missing rallies. Bitcoin's best strategy waits 10 days; S&P only waits 2.

Let's see exactly how this works in practice. The chart below zooms into a 4-month period in 2016, showing what happens when Reddit activity spikes above our threshold.

How the Strategy Works

Zoomed view: June-September 2016

The chart shows how our signal works. The orange line is the "Volume Spike" ratio, calculated as today's post count divided by the 20 day average. When it crosses the red threshold (the 97th percentile), we exit the market.

The purple line shows Bitcoin's volatility. Notice how spikes in posting activity often coincide with or come right before volatility spikes. The red shaded zones mark periods when we're out of the market, sitting safely in cash while prices swing wildly.

The key insight: We don't try to predict direction. We just avoid being in the market when everyone is talking about it, because that's when the biggest crashes tend to happen.

3.02.01.00.0
6%4%2%0%
JunJulAugSep
Volume Spike
Volatility
Threshold
Out of Market

Now let's put these strategies to the test. The interactive chart below shows a "race" between our Reddit-based strategy and simple buy-and-hold investing. You can switch between Bitcoin and S&P 500, and try different signals to see which combinations actually beat the market.

Strategy Race: Bitcoin (2014-2017)

Compare different Reddit signals against Buy & Hold

Buy & Hold0%
Reddit Strategy0%
Return (%)
150011257503750
2014201520162017
Year

Volume Over Sentiment

How much people post matters more than what they say. Raw activity beats fancy sentiment scores.

Patience Pays

After exiting, wait for the noise to die down. Jumping back in too early gets you caught in aftershocks.

Avoid the Worst

You don't need to catch every rally. Dodging the big crashes is what drives outperformance.

4

Overfitting Check: Do These Strategies Actually Work?

The strategy race above shows impressive results, but there's a problem: we found those strategies by searching the full 2014-2017 dataset. That's a classic overfitting trap. To test if these patterns are real, we split the data temporally and validate on unseen data.

Train/Test Validation Results

Testing strategies on data they've never seen

Training Period (60%)
2014-05 to 2016-02
Used to find the top 10 strategies
Test Period (40%)
2016-02 to 2017-04
Unseen data for validation
Bitcoin
4/10
strategies generalize
S&P 500
4/10
strategies generalize

What this means: Only about 40% of strategies that looked great on training data actually beat buy-and-hold when tested on unseen data. The impressive backtested returns in the race above are likely overstated. The correlation between Reddit activity and volatility is real, but profitably timing the market based on it is much harder than the numbers suggest.

Analysis & Interpretation

1

Reddit Measures Attention, Not Intelligence

Reddit's value as a financial signal lies not in the wisdom of the crowd, but in measuring crowd size. When posting volume reaches extreme levels (97th percentile and above), it typically means a topic has achieved peak mainstream attention. This is often the point where early investors have already positioned themselves and latecomers are piling in.

The useful signal is not what people are saying, but the fact that everyone is suddenly saying something. Extreme activity reliably precedes extreme volatility, even if the direction remains unpredictable.

2

Why Simple Beats Complex

Among all tested strategies, the best performers were remarkably simple. The "Volume Spike" indicator (today's posts divided by the 20 day average) outperformed sophisticated sentiment composites like our Fear Index. More features did not mean better results.

The problem with complex models is overfitting. More variables create more opportunities to find patterns that only existed in the training data. A simple strategy that works across both Bitcoin and the S&P 500 likely captures something real about how retail attention concentrates at market turning points.

3

Crypto vs. Traditional Markets

Bitcoin and S&P 500 required opposite strategies. For BTC, high Reddit activity was a sell signal (exit at the 97th percentile, wait 10 days). For S&P 500, low activity worked better (exit at the 65th percentile, re-enter after 2 days). This asymmetry reflects their different market structures.

Crypto in 2014-2017 was almost entirely retail-driven. Reddit communities like r/Bitcoin and r/CryptoCurrency were where actual market participants gathered. A surge in posts often signaled inexperienced buyers flooding in at the top, creating conditions for a correction. Exiting before the crowd peaked was profitable.

S&P 500 is dominated by institutional capital, pension funds, and algorithmic traders. Reddit finance discussions (r/investing, r/stocks) were more about sharing opinions than moving capital. Low activity there might coincide with market complacency or holiday periods, which are times when small corrections tend to occur. The effect was weaker (strategy beat buy-and-hold by ~9% vs. ~60% for BTC), but the pattern still appeared in the data.

4

Limitations

This analysis has notable constraints. The Reddit Hyperlinks dataset ends in April 2017, before the major 2017 bull run that brought crypto mainstream. Market dynamics have shifted since then: institutional players now dominate crypto, algorithmic traders scrape Reddit in real-time, and the subreddits themselves have grown by orders of magnitude.

We also did not account for transaction costs, slippage, or execution difficulties. A strategy that exits at the 97th percentile might trigger on days of extreme volatility, exactly when execution quality suffers most. Future work could incorporate more realistic trading assumptions and test on post-2017 data to see if these patterns persist.

Note: This is a retrospective analysis for educational purposes. Past performance in backtests does not predict future results. Markets adapt, edges decay, and what worked in 2014-2017 may not work today. Always do your own research.

Conclusion

Bringing It All Together

Throughout this analysis, we explored whether Reddit sentiment could serve as a meaningful signal for real-world events. We examined two distinct domains - sports and finance - each with different characteristics but united by one common thread: the passionate communities that discuss them on Reddit.

1 Sports: Emotion Follows Outcome

Our analysis of NFL team subreddits revealed a clear pattern: fan sentiment closely follow game outcomes. Using LIWC analysis, we found distinct vocabulary differences between fans of winning and losing teams. More importantly, by examining an external dataset from Cornell with richer data on 11 NFL teams, we observed consistent sentiment shifts around gamedays.

The most striking finding was the asymmetry of emotional response: losses trigger a notably steeper decline in sentiment than wins produce a positive uplift. This pattern was consistent enough that a Random Decision Tree model, trained solely on sentiment features, achieved 84% accuracy in predicting game outcomes for the 2016-17 season.

2 Finance: Activity Predicts Volatility, Not Direction

In finance, we asked whether Reddit discussions could inform trading decisions. Traditional machine learning models failed to predict price directionm, achieving accuracies no better than random guessing. This aligns with efficient market theory: if sentiment reliably predicted prices, that edge would quickly be arbitraged away.

However, we discovered a different relationship. Reddit activity strongly correlates with volatility. When posting volume spikes, prices tend to swing dramatically, although not in any predictable direction. This was especially true for Bitcoin during 2014-2017, a period when retail investors dominated the market and Reddit communities like r/Bitcoin were where actual participants gathered.

Building on this insight, we constructed a simple strategy: exit the market when Reddit activity reaches extreme levels (97th percentile), then re-enter after volatility subsides. For Bitcoin, this approach outperformed buy-and-hold by approximately 60% over the analysis period. However, our train/test validation revealed that only about 40% of strategies that appeared profitable on training data actually generalized to unseen data, a sobering reminder of overfitting risks in any backtested strategy.

3 The Common Thread

Despite their differences, both domains share a fundamental insight: how much people talk often matters more than what they say. In sports, the volume and intensity of discussion around gamedays reflects genuine emotional investment. In finance, extreme posting activity signals peak attention, often coinciding with market turning points.

Reddit, in this sense, functions not merely as a mirror reflecting real-world events, but as a leading indicator. Collective sentiment concentrates and amplifies at moments of significance, providing a window into public mood that traditional data sources cannot easily capture.

Limitations & Future Directions

This analysis uses the Reddit Hyperlinks dataset spanning 2014 to April 2017. Since then, both Reddit and the markets we studied have transformed significantly. Cryptocurrency has shifted from retail-dominated to institutionally-driven; Reddit itself has grown by orders of magnitude; and algorithmic traders now scrape social media in real-time. Patterns that existed in our historical window may no longer hold.

We also did not account for transaction costs, execution slippage, or the practical difficulties of trading during high-volatility periods. Future work could incorporate more realistic trading assumptions, extend the analysis to post-2017 data, and explore whether these signals persist as they become more widely known.

"The signal is the truth. The noise is what distracts us from the truth."