Research moderators activity in the Streamlit community. Dask, Pandas, Seaborn
I'm curious how well the Streamlit company develops its community . To make it better, a lot of questions from very beginners to advanced users have to be answered or, at least, responded to in a reasonable time. Fortunately, the community forum is based on the broadly spread forum engine Discourse . The engine exposes API and all the data can be pulled out from the website to analyze later on for how actively the empowered users(moderators) communicate to people.
To represent the activity the following subjects are researched:
-
Moderatos summary view.
-
Responded topics distribution.
-
First response delay stats.
-
Common delay stats.
To reveal the details about them data, first of all, are collected and then preprocessed into the compact format parquet.
Collect and view data
The two objects are requested from the website topic and post. Topic is nothing than the first post wrapped up with additional information. Both are accessable through the api endpoints:
https://{defaultHost}/latest.json
https://{defaultHost}/t/{id}/posts.json
The suffix .json
of an endpoint says that all the topics and posts you see on a web page are downloaded as json documets, very handy for machine processing. For example, https://discuss.streamlit.io/t/31582.json
.
The whole history of documents has been downloaded by me, not worth to desctibe it here.
What a file content is, how it looks like?
Files overview
There are thousands of small json files on disk named like topic-4288.json
.
$ find /collected -name "topic-*" | wc -l
25139
Each of them contains the topic's posts in the following structure
$ jq -r . files/topic-4288.json | head -n 9
{
"post_stream": {
"posts": [
{
"id": 12049,
"name": "Yordan Radev",
"username": "StuckDuckF",
"avatar_template": "/user_avatar/discourse.holoviz.org/stuckduckf/{size}/4347_2.png",
"created_at": "2022-09-22T19:27:47.715Z",
The whole file topic-4288.json .
Sample overview
What information should be extracted from every single topic? Taking a deeper look.
import json
import pandas as pd
with open('files/topic-4288.json') as file:
topic = json.load(file)
df = pd.json_normalize(topic)
sample = pd.concat([df.dtypes.to_frame(), df.sample(1).T], axis=1)
timeline_lookup | object | [[1, 2]] |
---|---|---|
suggested_topics | object | [] |
tags | object | [] |
id | int64 | 39050 |
title | object | Streamlit Drawable Canvas issue |
fancy_title | object | Streamlit Drawable Canvas issue |
posts_count | int64 | 1 |
created_at | object | 2023-03-09T17:06:26.380Z |
views | int64 | 20 |
reply_count | int64 | 0 |
The whole file sample-overview.txt .
Normalize json to dict records unfolding nested posts list and adding meta columns. The function is used later on while distributed computing.
def get_posts(json_data, as_df=False) -> pd.DataFrame:
df = pd.json_normalize(json_data,
record_prefix='post.',
record_path=['post_stream', 'posts'],
meta_prefix='topic.',
meta=['id', 'created_at'])
if as_df:
return df
return df.to_dict('records')
topic_posts_sample = get_posts(topic, as_df=True)
post.id | 83797 |
---|---|
post.name | Chinar Dankhara |
post.username | chinardankhara |
post.avatar_template | /user_avatar/discuss.streamlit.io/chinardankhara/{size}/20276_2.png |
post.created_at | 2023-03-09T17:06:26.464Z |
post.post_number | 1 |
post.post_type | 1 |
post.updated_at | 2023-03-09T17:06:26.464Z |
post.reply_count | 0 |
post.reply_to_post_number |
Information in the fields topic.id, post.user_id, post.username, post.id, post.created_at, post.post_number, post.staff, post.moderator, post.admin
is comprehensive to compute statistics on the subjects.
Preprocess data
Denormalized json data should be converted to tabular. Here is a phase where Dask is taken to handle multiple files and preprocess them in parallel in a pipeline. The result is saved as the column-oriented data format Parquet .
Run a cluster
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(dashboard_address='127.0.0.1:8787',
worker_dashboard_address='127.0.0.1:0',
n_workers=8,
threads_per_worker=1,
memory_limit='400MiB')
client = Client(cluster)
Serialize posts
# Serialize
import json
from pathlib import Path
import dask.bag
import pandas as pd
posts_bag = (dask.bag.read_text('streamlit/latest/order-created/ascending-False/page-*/topic-*')
.map(json.loads)
.map(get_posts))
Reduce fields and filter duplicates
from tabulate import tabulate
from IPython.display import display_markdown
total = posts_bag.flatten().pluck('post.id').count().compute()
duplicated = posts_bag.flatten().pluck('post.id').frequencies(sort=True).filter(lambda item: item[1] > 1).count().compute()
total | duplicated |
---|---|
44147 | 210 |
# Select necessary fields, drop duplicates, save on disk
def select_keys(dict_, *keys):
d = {key: value for key, value in dict_.items() if key in keys}
return d
posts_bag = posts_bag.flatten().map(select_keys, 'topic.id', 'post.user_id', 'post.username', 'post.id', 'post.created_at', 'post.post_number', 'post.staff', 'post.moderator', 'post.admin')
posts_bag = posts_bag.distinct('post.id')
posts_ddf = posts_bag.to_dataframe()
posts_ddf.to_parquet('files/preprocessed/')
Analyse data
Data is ready to compute statistics and find trends.
Extract moderatos. Summary view
# Filter moderators, admins, staff
posts_df = pd.read_parquet('files/preprocessed/part.0.parquet')
moderators_df = posts_df[posts_df[['post.admin', 'post.moderator', 'post.staff']].any(axis=1)]
groupped = moderators_df.groupby(['post.user_id', 'post.username'])
counts = groupped.size().rename('count')
firsts = groupped.first().loc[:, ['post.admin', 'post.moderator', 'post.staff']]
summary_df = pd.merge(counts, firsts, left_index=True, right_index=True).sort_values('count', ascending=False)
summary_df = summary_df.reset_index()
post.user_id | post.username | count | post.admin | post.moderator | post.staff | |
---|---|---|---|---|---|---|
0 | 1064 | randyzwitch | 5278 | True | False | True |
1 | -1 | system | 1294 | True | True | True |
2 | 4771 | snehankekre | 931 | True | True | True |
3 | 5621 | Caroline | 864 | True | True | True |
4 | 706 | andfanilo | 786 | False | True | True |
5 | 1108 | Charly_Wargnier | 610 | True | True | True |
6 | 11947 | blackary | 472 | False | True | True |
7 | 6 | thiago | 300 | False | True | True |
8 | 1326 | okld | 233 | False | True | True |
9 | 2 | tc1 | 179 | True | True | True |
10 | 18 | tim | 157 | False | True | True |
11 | 2511 | Jessica_Smith | 132 | False | True | True |
12 | -2 | streamlitbot | 122 | True | False | True |
13 | 686 | arnaud | 102 | True | True | True |
14 | 2064 | kmcgrady | 98 | False | True | True |
15 | 3241 | jrieke | 90 | False | True | True |
16 | 4146 | dataprofessor | 46 | False | True | True |
17 | 228 | kantuni | 45 | False | True | True |
18 | 14194 | tonykip | 25 | True | True | True |
19 | 3819 | vdonato | 24 | False | True | True |
20 | 15351 | StreamlitTeam | 15 | True | True | True |
21 | 13976 | jcarroll | 12 | False | True | True |
22 | 16008 | Alexandru_Toader | 8 | False | True | True |
23 | 10717 | kseniaanske | 2 | True | True | True |
# Visualize as three divisions
summary_df['response_amount'] = pd.qcut(summary_df['count'], 3, labels=['small', 'medium', 'big'])
sns.set_style('white')
sns.catplot(data=summary_df,
x='count', y='post.username',
col='response_amount', col_order=['big', 'medium', 'small'],
kind='bar', sharey=False, sharex=False,
color=sns.color_palette("flare_r")[0],
)
sns.despine(left=True, bottom=True)
The top two moderators are the most active, and the second is a bot. randyzwitch
is the only leader here.
Responded topics disrtibution
Responses distribution over the whole community lifetime.
# Drop bots, add flags and convert datetime
posts_df = posts_df[~posts_df['post.username'].isin(['system', 'streamlit'])]
posts_df['is_moderator'] = posts_df[['post.staff', 'post.moderator','post.admin']].any(axis=1)
posts_df['post.created_at'] = pd.to_datetime(posts_df['post.created_at'])
# Figure out whether the first posts were responded and when
responded = posts_df.groupby('topic.id').aggregate({'is_moderator': 'any', 'post.created_at': 'min'})
responded.rename(columns={'is_moderator': 'is_responded'}, inplace=True)
responded['each_month'] = responded['post.created_at'].dt.strftime('%Y-%m')
# Visualize distribution
sns.set_style('white')
hist = sns.histplot(responded, x='each_month', hue='is_responded', multiple='stack')
sns.despine(left=True, bottom=True)
hist.tick_params(left=False)
last_pos = hist.get_xticks()[-1]
hist.set_xticks([0, last_pos],
labels=[responded['each_month'].iloc[0],
responded['each_month'].iloc[-1]]
)
Findings:
- Interestingly, on the right part of the picture, there are lots of unresponded topics.
- The response frequency is decreased on the right part.
- Almost all topics on the left are responded. I guess they are simply closed after a long while.
First response delay stats
How fast moderators respond to a new topic.
posts_df['is_first_post'] = posts_df['post.post_number'] == 1
first_posts = posts_df[(posts_df['is_first_post']) & (~posts_df['is_moderator'])].sort_values('topic.id')
first_responses = posts_df[posts_df['is_moderator']].sort_values('post.created_at').groupby('topic.id', as_index=False).first()
# Set indexes for joining
first_posts.set_index('topic.id', verify_integrity=True, inplace=True)
first_responses.set_index('topic.id', verify_integrity=True, inplace=True)
responded = first_posts.join(first_responses, how='inner', lsuffix='.posts', rsuffix='.responses')
responded = responded[['post.username.posts', 'post.created_at.posts', 'post.username.responses', 'post.created_at.responses']]
responded['first_response_delay'] = responded['post.created_at.responses'] - responded['post.created_at.posts']
responded['first_response_delay'].describe()
first_response_delay | |
---|---|
count | 5584 |
mean | 145 days 04:17:46.949290294 |
std | 234 days 06:39:18.775907688 |
min | 0 days 00:00:01.231000 |
25% | 0 days 05:39:14.983000 |
50% | 1 days 18:19:36.327500 |
75% | 393 days 23:08:31.948750 |
max | 1093 days 14:33:57.247000 |
Too big dispearson. The dataset should get trimmed up to compute a better fit, due to huge difference between 50%
and both 75%
, max
measurements.
Three quantiles are inspected sepately, min - 50%
, 50% - 75%
, 75% - max
. Checking a count of each first.
pd.qcut(responded['first_response_delay'],
[0, .5, .75, 1],
labels=['min - 50%', '50% - 75%', '75% -max'])\
.value_counts()
first_response_delay | |
---|---|
min - 50% | 2792 |
50% - 75% | 1396 |
75% - max | 1396 |
Each of them contains the significant amount of responses. I think the responses are given within 393
and 1093
days shoudn't be considered as abnormaly big and have zero value to the questioners.
The seconds unit isn't appropriate for reporting, converting to hours.
# Take 0.5 and hours
responded['first_response_delay_hours'] = responded['first_response_delay'] / pd.Timedelta('1 hour')
responded_lte_50 = responded.loc[responded['first_response_delay'] <= responded['first_response_delay'].quantile(0.5)]
The representative sample responded_lte_50
is filtered, visualize its distribution and descriptive statistics.
sns.set_theme(style='white')
sns.set_style({
'axes.spines.left': False,
'axes.spines.bottom': False,
'axes.spines.right': False,
'axes.spines.top': False
})
f, axs = plt.subplots(1, 2)
sns.stripplot(responded_lte_50,
x='first_response_delay_hours', y='post.username.responses',
hue='post.username.responses', palette='deep',
legend=False, jitter=0.3, ax=axs[0])
box = sns.boxplot(responded_lte_50,
x='first_response_delay_hours', y='post.username.responses',
palette='deep', ax=axs[1])
box.yaxis.set_ticklabels('')
box.yaxis.set_label_text('')
Findings:
- The majority responses within
~25
hours or one day. - A few come to the community even later.
Common delay stats
The common stats as a summary of those above.
sns.set_theme(style='dark')
box = sns.boxenplot(responded_lte_50, x='first_response_delay_hours', color='xkcd:mauve', linewidth=0.5)
locs = box.xaxis.get_ticklocs()
locs = locs.tolist()
agg_ticks = responded_lte_50['first_response_delay_hours'].agg([lambda s: s.quantile(0.25),
'median',
lambda s: s.quantile(0.75)])
for agg in reversed(agg_ticks.to_list()):
locs.insert(1, round(agg, 2))
box.xaxis.grid(True)
box.xaxis.set_ticks(locs)
5.6
hours is the average time you will get responded to by a moderator within 1.59
- 15.86
hours on Streamlit community ..