Telegram group leaders research. Pandas, Seaborn, Missingno

Due to i'm a member of a Telegram group chat i'd like to research about the most active users there by plotting them in different perspectives:

  • Top10 leaderboard.
  • Top10 leaderboard per month.
  • Message frequency of the top two leaders per month.

To calculate it the chat history of almost a half year has been downloaded by the convenient export function. Then explored and prepared data converting its types and dropping unnecessary fields.

Prepare exported data

Read and convert text to a DataFrame.

import json
import pandas as pd 

with open('ChatExport/result.json') as file:
    data = file.read()

jdata = json.loads(data)
messages_df = pd.json_normalize(jdata['messages'])

Explore data

Make a short general table view of the small data portion.

Start with filtering out missing data, type conversion and samples.

Missing values

The great utility for this purpose is missingno . It visualizes the missings for you as matrix, bar, heatmap, dendogram.

import matplotlib.pyplot as plt
import missingno as msno

msno.bar(messages_df)

plt.show()

png

Anonymization

Hide the actual names and IDs faking them with mimesis .

On the missing data chart, there are only the first 8 columns full of data, therefore the others can be dropped.

import mimesis

def to_md(df, index=False):  # helper function to represent DataFrame as a Markdown table
    print(df.to_markdown(index=index))

columns = messages_df.loc[:, 'type':'text_entities']
sample = columns.sample().transpose()
sample.loc['from'] = mimesis.Person().full_name()
sample.loc['from_id'] = mimesis.Cryptographic().uuid()

to_md(sample, index=True)
43440
type message
date 2022-12-21T12:28:44
date_unixtime 1671600524
from Caleb Dickerson
from_id aea0ceeb-9174-4bf1-bf79-8d60fee9540d
text Добро, спасибо!
text_entities [{'type': 'plain', 'text': 'Добро, спасибо!'}]

Types and samples

columns = columns.dtypes
columns = pd.concat([columns, sample], axis='columns')
columns = columns.reset_index()
columns.columns = ['column_name', 'type', 'sample']
  
to_md(columns)
column_name type sample
type object message
date object 2022-12-21T12:28:44
date_unixtime object 1671600524
from object Caleb Dickerson
from_id object aea0ceeb-9174-4bf1-bf79-8d60fee9540d
text object Добро, спасибо!
text_entities object [{'type': 'plain', 'text': 'Добро, спасибо!'}]

Ok, now we see how data from the exported history looks like.

Top10 leaderboard

The chart displays the overall messages count per user.

All the charts on the objectives are built upon time and user. So from here on out i'm dealing with three columns date, form, form_id. Give them better names.

renamed_df = messages_df[['date', 'from', 'from_id']]
renamed_df = renamed_df.rename(columns={'from': 'user_name', 'from_id': 'user_id'})

leaders_count = 10
leader_indexes = renamed_df['user_id'].value_counts().iloc[:leaders_count].index
top10_leaders = renamed_df[renamed_df['user_id'].isin(leader_indexes)]

names_ids = top10_leaders[['user_name', 'user_id']].value_counts()
anonymized = {'user_name': {v: mimesis.Person().full_name()
                             for v in names_ids.index.get_level_values('user_name')},
              'user_id': {v: mimesis.Cryptographic().uuid()
                             for v in names_ids.index.get_level_values('user_id')}}
top10_leaders = top10_leaders.replace(anonymized)

to_md(top10_leaders.sample(5))
date user_name user_id
2022-09-22T14:57:10 Nicky Prince b86c8f80-686f-4d72-af98-b3f3ad19656e
2022-12-12T10:58:10 Jaleesa Barrett 300604c5-3564-431a-b09c-bf31be86779d
2022-10-19T15:34:31 Berry Herrera 71927cb8-9e3f-490c-ae5f-6b67791964dd
2022-07-09T03:00:47 Nicky Prince b86c8f80-686f-4d72-af98-b3f3ad19656e
2022-09-15T19:39:32 Nicky Prince b86c8f80-686f-4d72-af98-b3f3ad19656e
import seaborn as sns

order = top10_leaders['user_name'].value_counts().index
sns.countplot(top10_leaders, y='user_name', order=order, palette='Paired')
sns.set_theme(style='white')
sns.set_style({ 
    'axes.spines.left': False,
    'axes.spines.bottom': False,
    'axes.spines.right': False,
    'axes.spines.top': False
})
sns.set_context({
    'axes.labelsize': 0,
    'axes.titlesize': 0,
})

plt.show()

png

Done. Overall view of the leaders, the top two are the most active:

  • The 1st is, obviously, the group owner.
  • The 2nd is a regular user and talks much more frequently in comparison to the others.

The color palette Paired is taken for matching leaders pairwise and keeps it constant further.

Time to split it up on facets per month.

Top10 leaderboard per month

Add a month column for pointing it to in the Seaborn facet chart method.

top10_leaders['date'] = pd.to_datetime(top10_leaders['date'])
top10_leaders['month'] = top10_leaders['date'].dt.month_name()

to_md(top10_leaders.sample(5))
date user_name user_id month
2022-09-21 01:48:03 Nicky Prince b86c8f80-686f-4d72-af98-b3f3ad19656e September
2022-12-03 23:24:58 Tyron Pickett e1073fd1-2ca9-43fb-90d3-48aeb4d9b061 December
2022-09-02 14:27:54 Jaleesa Barrett 300604c5-3564-431a-b09c-bf31be86779d September
2022-07-26 13:05:15 Tyron Pickett e1073fd1-2ca9-43fb-90d3-48aeb4d9b061 July
2022-08-25 01:39:11 Nicky Prince b86c8f80-686f-4d72-af98-b3f3ad19656e August
fg = sns.catplot(top10_leaders, y='user_name', col='month', kind='count',
            order=order, col_wrap=3, palette='Paired')
sns.set_context({'axes.labelsize': 'medium'})
fg.set_axis_labels("", "")
fg.set_titles(col_template="{col_name}")
fg.despine(top=True, right=True, left=True, bottom=True)

plt.show()

png

What do i see here?

  • The top two hold their places every month.
  • In December the 2nd leader is changed. And the month is only when the 9th was abnormally active.
  • The top two prevail over the others' activeness.
  • The top two's activeness proportion differs month by month.
  • The leaders from 3rd are evenly active.

Top2 leaders message frequency per month

The top two messaged more than others, so plotting a distribution could reveal new activity trends. Data of the others are dropped.

Add a day value for each message to see how it distributes over a month.

top10_leaders['day'] = top10_leaders['date'].dt.day
leaders_count = 2
leader_indexes = top10_leaders['user_id'].value_counts().iloc[:leaders_count].index
top2_leaders = top10_leaders[top10_leaders['user_id'].isin(leader_indexes)]

to_md(top2_leaders.sample(5))
date user_name user_id month day
2022-11-02 08:23:06 Jaleesa Barrett 300604c5-3564-431a-b09c-bf31be86779d November 2
2022-07-24 22:33:38 Nicky Prince b86c8f80-686f-4d72-af98-b3f3ad19656e July 24
2023-02-17 16:00:17 Jaleesa Barrett 300604c5-3564-431a-b09c-bf31be86779d February 17
2022-09-03 17:37:39 Nicky Prince b86c8f80-686f-4d72-af98-b3f3ad19656e September 3
2022-08-04 23:30:30 Jaleesa Barrett 300604c5-3564-431a-b09c-bf31be86779d August 4
fg = sns.displot(data=top2_leaders, y='day', col='month', hue='user_name', multiple='stack', col_wrap=3, kind='hist', palette='Paired')
sns.set_context({'axes.labelsize': 'medium'})
fg.set_axis_labels("", "")
fg.set_titles(col_template="{col_name}")
fg.despine(top=True, right=True, left=True, bottom=True)

plt.show()

png

Ok, November and December contain more spikes.

As the leaders are paired, their frequency can be compared side by side relatively.

fg = sns.displot(data=top2_leaders, y='day', col='month', hue='user_name', multiple='fill',
                 col_wrap=3, kind='kde', palette='Paired')
sns.set_context({'axes.labelsize': 'medium'})
fg.set_axis_labels("", "")
fg.set_titles(col_template="{col_name}")
fg.despine(top=True, right=True, left=True, bottom=True)

plt.show()

png

The appearance is intuitive and wordy. The June chart's shape differs significantly.

Afterword

Before getting started, i already knew who talks too much and have proved it by making displaying it with all those charts. Along with it, i got a new understanding of:

  • the top two's messages ratio
  • the 2nd to others ratio