Telegram group leaders research. Pandas, Seaborn, Missingno

Due to i'm a member of a Telegram group chat i'd like to research about the most active users there by plotting them in different perspectives:

Top10 leaderboard.
Top10 leaderboard per month.
Message frequency of the top two leaders per month.

To calculate it the chat history of almost a half year has been downloaded by the convenient export function. Then explored and prepared data converting its types and dropping unnecessary fields.

Prepare exported data

Read and convert text to a DataFrame.

import json
import pandas as pd 

with open('ChatExport/result.json') as file:
    data = file.read()

jdata = json.loads(data)
messages_df = pd.json_normalize(jdata['messages'])

Explore data

Make a short general table view of the small data portion.

Start with filtering out missing data, type conversion and samples.

Missing values

The great utility for this purpose is missingno . It visualizes the missings for you as matrix, bar, heatmap, dendogram.

import matplotlib.pyplot as plt
import missingno as msno

msno.bar(messages_df)

plt.show()

png

Anonymization

Hide the actual names and IDs faking them with mimesis .

On the missing data chart, there are only the first 8 columns full of data, therefore the others can be dropped.

import mimesis

def to_md(df, index=False):  # helper function to represent DataFrame as a Markdown table
    print(df.to_markdown(index=index))

columns = messages_df.loc[:, 'type':'text_entities']
sample = columns.sample().transpose()
sample.loc['from'] = mimesis.Person().full_name()
sample.loc['from_id'] = mimesis.Cryptographic().uuid()

to_md(sample, index=True)

	43440
type	message
date	2022-12-21T12:28:44
date_unixtime	1671600524
from	Caleb Dickerson
from_id	aea0ceeb-9174-4bf1-bf79-8d60fee9540d
text	Добро, спасибо!
text_entities	[{'type': 'plain', 'text': 'Добро, спасибо!'}]

Types and samples

columns = columns.dtypes
columns = pd.concat([columns, sample], axis='columns')
columns = columns.reset_index()
columns.columns = ['column_name', 'type', 'sample']
  
to_md(columns)

column_name	type	sample
type	object	message
date	object	2022-12-21T12:28:44
date_unixtime	object	1671600524
from	object	Caleb Dickerson
from_id	object	aea0ceeb-9174-4bf1-bf79-8d60fee9540d
text	object	Добро, спасибо!
text_entities	object	[{'type': 'plain', 'text': 'Добро, спасибо!'}]

Ok, now we see how data from the exported history looks like.

Top10 leaderboard

The chart displays the overall messages count per user.

All the charts on the objectives are built upon time and user. So from here on out i'm dealing with three columns date, form, form_id. Give them better names.

renamed_df = messages_df[['date', 'from', 'from_id']]
renamed_df = renamed_df.rename(columns={'from': 'user_name', 'from_id': 'user_id'})

leaders_count = 10
leader_indexes = renamed_df['user_id'].value_counts().iloc[:leaders_count].index
top10_leaders = renamed_df[renamed_df['user_id'].isin(leader_indexes)]

names_ids = top10_leaders[['user_name', 'user_id']].value_counts()
anonymized = {'user_name': {v: mimesis.Person().full_name()
                             for v in names_ids.index.get_level_values('user_name')},
              'user_id': {v: mimesis.Cryptographic().uuid()
                             for v in names_ids.index.get_level_values('user_id')}}
top10_leaders = top10_leaders.replace(anonymized)

to_md(top10_leaders.sample(5))

date	user_name	user_id
2022-09-22T14:57:10	Nicky Prince	b86c8f80-686f-4d72-af98-b3f3ad19656e
2022-12-12T10:58:10	Jaleesa Barrett	300604c5-3564-431a-b09c-bf31be86779d
2022-10-19T15:34:31	Berry Herrera	71927cb8-9e3f-490c-ae5f-6b67791964dd
2022-07-09T03:00:47	Nicky Prince	b86c8f80-686f-4d72-af98-b3f3ad19656e
2022-09-15T19:39:32	Nicky Prince	b86c8f80-686f-4d72-af98-b3f3ad19656e

import seaborn as sns

order = top10_leaders['user_name'].value_counts().index
sns.countplot(top10_leaders, y='user_name', order=order, palette='Paired')
sns.set_theme(style='white')
sns.set_style({ 
    'axes.spines.left': False,
    'axes.spines.bottom': False,
    'axes.spines.right': False,
    'axes.spines.top': False
})
sns.set_context({
    'axes.labelsize': 0,
    'axes.titlesize': 0,
})

plt.show()

png

Done. Overall view of the leaders, the top two are the most active:

The 1st is, obviously, the group owner.
The 2nd is a regular user and talks much more frequently in comparison to the others.

The color palette Paired is taken for matching leaders pairwise and keeps it constant further.

Time to split it up on facets per month.

Top10 leaderboard per month

Add a month column for pointing it to in the Seaborn facet chart method.

top10_leaders['date'] = pd.to_datetime(top10_leaders['date'])
top10_leaders['month'] = top10_leaders['date'].dt.month_name()

to_md(top10_leaders.sample(5))

date	user_name	user_id	month
2022-09-21 01:48:03	Nicky Prince	b86c8f80-686f-4d72-af98-b3f3ad19656e	September
2022-12-03 23:24:58	Tyron Pickett	e1073fd1-2ca9-43fb-90d3-48aeb4d9b061	December
2022-09-02 14:27:54	Jaleesa Barrett	300604c5-3564-431a-b09c-bf31be86779d	September
2022-07-26 13:05:15	Tyron Pickett	e1073fd1-2ca9-43fb-90d3-48aeb4d9b061	July
2022-08-25 01:39:11	Nicky Prince	b86c8f80-686f-4d72-af98-b3f3ad19656e	August

fg = sns.catplot(top10_leaders, y='user_name', col='month', kind='count',
            order=order, col_wrap=3, palette='Paired')
sns.set_context({'axes.labelsize': 'medium'})
fg.set_axis_labels("", "")
fg.set_titles(col_template="{col_name}")
fg.despine(top=True, right=True, left=True, bottom=True)

plt.show()

png

What do i see here?

The top two hold their places every month.
In December the 2nd leader is changed. And the month is only when the 9th was abnormally active.
The top two prevail over the others' activeness.
The top two's activeness proportion differs month by month.
The leaders from 3rd are evenly active.

Top2 leaders message frequency per month

The top two messaged more than others, so plotting a distribution could reveal new activity trends. Data of the others are dropped.

Add a day value for each message to see how it distributes over a month.

top10_leaders['day'] = top10_leaders['date'].dt.day
leaders_count = 2
leader_indexes = top10_leaders['user_id'].value_counts().iloc[:leaders_count].index
top2_leaders = top10_leaders[top10_leaders['user_id'].isin(leader_indexes)]

to_md(top2_leaders.sample(5))

date	user_name	user_id	month	day
2022-11-02 08:23:06	Jaleesa Barrett	300604c5-3564-431a-b09c-bf31be86779d	November	2
2022-07-24 22:33:38	Nicky Prince	b86c8f80-686f-4d72-af98-b3f3ad19656e	July	24
2023-02-17 16:00:17	Jaleesa Barrett	300604c5-3564-431a-b09c-bf31be86779d	February	17
2022-09-03 17:37:39	Nicky Prince	b86c8f80-686f-4d72-af98-b3f3ad19656e	September	3
2022-08-04 23:30:30	Jaleesa Barrett	300604c5-3564-431a-b09c-bf31be86779d	August	4

fg = sns.displot(data=top2_leaders, y='day', col='month', hue='user_name', multiple='stack', col_wrap=3, kind='hist', palette='Paired')
sns.set_context({'axes.labelsize': 'medium'})
fg.set_axis_labels("", "")
fg.set_titles(col_template="{col_name}")
fg.despine(top=True, right=True, left=True, bottom=True)

plt.show()

png

Ok, November and December contain more spikes.

As the leaders are paired, their frequency can be compared side by side relatively.

fg = sns.displot(data=top2_leaders, y='day', col='month', hue='user_name', multiple='fill',
                 col_wrap=3, kind='kde', palette='Paired')
sns.set_context({'axes.labelsize': 'medium'})
fg.set_axis_labels("", "")
fg.set_titles(col_template="{col_name}")
fg.despine(top=True, right=True, left=True, bottom=True)

plt.show()

png

The appearance is intuitive and wordy. The June chart's shape differs significantly.

Afterword

Before getting started, i already knew who talks too much and have proved it by making displaying it with all those charts. Along with it, i got a new understanding of:

the top two's messages ratio
the 2nd to others ratio