Open dataset of 1.78b links from the public web, 2016-2019 (396MB compressed)
GDELT, a digital news monitoring service, has released a massive, open dataset of linking data. The dataset contains "domain-level graph recording over the period April 22, 2016 through January 28, 2019 how many times each news outlet linked to URLs on any other domain (including subdomains of itself). More info...
Download (396MB compressed / 986MB uncompressed)
Summary¶
The archive contains only one file MASTER-GKG-OUTLINKS-2016-2018.CSV
(986MB). It's possible to load the file into the RAM so let's use pandas to see what's inside.
import pandas as pd
data = pd.read_csv('~/data/gkg-domain-graph/MASTER-GKG-OUTLINKS-2016-2018.CSV')
First 10 lines
data.head(10)
The file contains 4 columns:
fromsite
tosite
numdays
- how many days there was at least one link fromfromsite
totosite
totlinks
- how many total links there were fromfromsite
totosite
Descriptive statistics
pd.options.display.float_format = '{:.2f}'.format ## reset scientific format for numbers
data.describe()
Top 10 records by totlinks
data.nlargest(10, 'totlinks')
Most frequent urls in tosite
grouped = data.groupby('tosite')[['fromsite']].count().nlargest(10, 'fromsite')
grouped.rename(columns={'fromsite': 'N'}, inplace=True)
grouped
Check other datasets by the GDELT project