GKG Domain Data

Open dataset of 1.78b links from the public web, 2016-2019 (396MB compressed)

GDELT, a digital news monitoring service, has released a massive, open dataset of linking data. The dataset contains "domain-level graph recording over the period April 22, 2016 through January 28, 2019 how many times each news outlet linked to URLs on any other domain (including subdomains of itself). More info...

Download (396MB compressed / 986MB uncompressed)

Summary

The archive contains only one file MASTER-GKG-OUTLINKS-2016-2018.CSV (986MB). It's possible to load the file into the RAM so let's use pandas to see what's inside.

In [2]:
import pandas as pd
data = pd.read_csv('~/data/gkg-domain-graph/MASTER-GKG-OUTLINKS-2016-2018.CSV')

First 10 lines

In [4]:
data.head(10)
Out[4]:
fromsite tosite numdays totlinks
0 theprovince.com capcomvancouver.com 1 1
1 house.gmw.cn langya.cn 34 43
2 businessinsider.com.au robesonian.com 1 1
3 hayspost.com paypal.com 1 1
4 elmostrador.cl womad.cl 12 14
5 thequietus.com inglebygallery.com 1 1
6 journaldequebec.com Smartsource.ca 1 1
7 thedenverchannel.com oevp.at 1 1
8 mmajunkie.com theatlantic.com 1 1
9 teenvogue.com cookpolitical.com 7 8

The file contains 4 columns:

  • fromsite
  • tosite
  • numdays - how many days there was at least one link from fromsite to tosite
  • totlinks - how many total links there were from fromsite to tosite

Descriptive statistics

In [6]:
pd.options.display.float_format = '{:.2f}'.format ## reset scientific format for numbers
data.describe()
Out[6]:
numdays totlinks
count 30072787.00 30072787.00
mean 3.94 14.50
std 22.62 1843.27
min 1.00 1.00
25% 1.00 1.00
50% 1.00 1.00
75% 2.00 2.00
max 1013.00 3874533.00

Top 10 records by totlinks

In [21]:
data.nlargest(10, 'totlinks')
Out[21]:
fromsite tosite numdays totlinks
24043720 keskustelu.kauppalehti.fi kauppalehti.fi 872 3874533
9699529 entornointeligente.com twitter.com 729 3406915
23084853 schoolloop.com google.com 313 3254827
19416797 udn.com pixnet.net 918 2727253
15876001 cfi.net.cn cfi.cn 1009 2223133
29438959 special.tass.ru tass.ru 1013 2054466
11909386 indiatimes.com economictimes.com 817 1938563
21027686 news.meta.ua meta.ua 993 1859567
8508727 iheart.com twitter.com 1013 1848048
8332068 hiphople.com hiphopLE.com 805 1401873

Most frequent urls in tosite

In [25]:
grouped = data.groupby('tosite')[['fromsite']].count().nlargest(10, 'fromsite')
grouped.rename(columns={'fromsite': 'N'}, inplace=True)
grouped
Out[25]:
N
tosite
twitter.com 43538
facebook.com 41823
youtube.com 35025
google.com 29455
wikipedia.org 29160
instagram.com 24565
t.co 24396
nytimes.com 21208
theguardian.com 19395
bit.ly 18609

Check other datasets by the GDELT project

By Open Datasets in
Tags : #Web,