Research moderators activity in the Streamlit community. Dask, Pandas, Seaborn

I'm curious how well the Streamlit company develops its community . To make it better, a lot of questions from very beginners to advanced users have to be answered or, at least, responded to in a reasonable time. Fortunately, the community forum is based on the broadly spread forum engine Discourse . The engine exposes API and all the data can be pulled out from the website to analyze later on for how actively the empowered users(moderators) communicate to people.

To represent the activity the following subjects are researched:

  • Moderatos summary view.

  • Responded topics distribution.

  • First response delay stats.

  • Common delay stats.

To reveal the details about them data, first of all, are collected and then preprocessed into the compact format parquet.

Collect and view data

The two objects are requested from the website topic and post. Topic is nothing than the first post wrapped up with additional information. Both are accessable through the api endpoints:

  • https://{defaultHost}/latest.json
  • https://{defaultHost}/t/{id}/posts.json

The suffix .json of an endpoint says that all the topics and posts you see on a web page are downloaded as json documets, very handy for machine processing. For example, https://discuss.streamlit.io/t/31582.json.

The whole history of documents has been downloaded by me, not worth to desctibe it here.

What a file content is, how it looks like?

Files overview

There are thousands of small json files on disk named like topic-4288.json.

$ find /collected -name "topic-*" | wc -l
25139

Each of them contains the topic's posts in the following structure

$ jq -r . files/topic-4288.json | head -n 9

{
  "post_stream": {
    "posts": [
      {
        "id": 12049,
        "name": "Yordan Radev",
        "username": "StuckDuckF",
        "avatar_template": "/user_avatar/discourse.holoviz.org/stuckduckf/{size}/4347_2.png",
        "created_at": "2022-09-22T19:27:47.715Z",

The whole file topic-4288.json .

Sample overview

What information should be extracted from every single topic? Taking a deeper look.

import json
import pandas as pd

with open('files/topic-4288.json') as file:
     topic = json.load(file)   
        
df = pd.json_normalize(topic)
sample = pd.concat([df.dtypes.to_frame(), df.sample(1).T], axis=1)
timeline_lookup object [[1, 2]]
suggested_topics object []
tags object []
id int64 39050
title object Streamlit Drawable Canvas issue
fancy_title object Streamlit Drawable Canvas issue
posts_count int64 1
created_at object 2023-03-09T17:06:26.380Z
views int64 20
reply_count int64 0

The whole file sample-overview.txt .

Normalize json to dict records unfolding nested posts list and adding meta columns. The function is used later on while distributed computing.

def get_posts(json_data, as_df=False) -> pd.DataFrame:
    df = pd.json_normalize(json_data,
                            record_prefix='post.',                        
                            record_path=['post_stream', 'posts'], 
                            meta_prefix='topic.', 
                            meta=['id', 'created_at'])
    if as_df:    
        return df
    
    return df.to_dict('records')
    
topic_posts_sample = get_posts(topic, as_df=True)
post.id 83797
post.name Chinar Dankhara
post.username chinardankhara
post.avatar_template /user_avatar/discuss.streamlit.io/chinardankhara/{size}/20276_2.png
post.created_at 2023-03-09T17:06:26.464Z
post.post_number 1
post.post_type 1
post.updated_at 2023-03-09T17:06:26.464Z
post.reply_count 0
post.reply_to_post_number

Information in the fields topic.id, post.user_id, post.username, post.id, post.created_at, post.post_number, post.staff, post.moderator, post.admin is comprehensive to compute statistics on the subjects.

Preprocess data

Denormalized json data should be converted to tabular. Here is a phase where Dask is taken to handle multiple files and preprocess them in parallel in a pipeline. The result is saved as the column-oriented data format Parquet .

Run a cluster

from dask.distributed import Client, LocalCluster

cluster = LocalCluster(dashboard_address='127.0.0.1:8787',
                       worker_dashboard_address='127.0.0.1:0',
                       n_workers=8,
                       threads_per_worker=1,
                       memory_limit='400MiB')
client = Client(cluster)

Serialize posts

# Serialize
import json
from pathlib import Path
import dask.bag
import pandas as pd

posts_bag = (dask.bag.read_text('streamlit/latest/order-created/ascending-False/page-*/topic-*')
                     .map(json.loads)
                     .map(get_posts))

Reduce fields and filter duplicates

from tabulate import tabulate
from IPython.display import display_markdown

total = posts_bag.flatten().pluck('post.id').count().compute()
duplicated = posts_bag.flatten().pluck('post.id').frequencies(sort=True).filter(lambda item: item[1] > 1).count().compute()
total duplicated
44147 210
# Select necessary fields, drop duplicates, save on disk
def select_keys(dict_, *keys):
    d = {key: value for key, value in dict_.items() if key in keys} 
    return d
    
posts_bag = posts_bag.flatten().map(select_keys, 'topic.id', 'post.user_id', 'post.username', 'post.id', 'post.created_at', 'post.post_number', 'post.staff', 'post.moderator', 'post.admin')
posts_bag = posts_bag.distinct('post.id')

posts_ddf = posts_bag.to_dataframe()
posts_ddf.to_parquet('files/preprocessed/')

Analyse data

Data is ready to compute statistics and find trends.

Extract moderatos. Summary view

# Filter moderators, admins, staff
posts_df = pd.read_parquet('files/preprocessed/part.0.parquet')
moderators_df = posts_df[posts_df[['post.admin', 'post.moderator', 'post.staff']].any(axis=1)]

groupped = moderators_df.groupby(['post.user_id', 'post.username'])
counts = groupped.size().rename('count')
firsts = groupped.first().loc[:, ['post.admin', 'post.moderator', 'post.staff']]
summary_df = pd.merge(counts, firsts, left_index=True, right_index=True).sort_values('count', ascending=False)
summary_df = summary_df.reset_index()
post.user_id post.username count post.admin post.moderator post.staff
0 1064 randyzwitch 5278 True False True
1 -1 system 1294 True True True
2 4771 snehankekre 931 True True True
3 5621 Caroline 864 True True True
4 706 andfanilo 786 False True True
5 1108 Charly_Wargnier 610 True True True
6 11947 blackary 472 False True True
7 6 thiago 300 False True True
8 1326 okld 233 False True True
9 2 tc1 179 True True True
10 18 tim 157 False True True
11 2511 Jessica_Smith 132 False True True
12 -2 streamlitbot 122 True False True
13 686 arnaud 102 True True True
14 2064 kmcgrady 98 False True True
15 3241 jrieke 90 False True True
16 4146 dataprofessor 46 False True True
17 228 kantuni 45 False True True
18 14194 tonykip 25 True True True
19 3819 vdonato 24 False True True
20 15351 StreamlitTeam 15 True True True
21 13976 jcarroll 12 False True True
22 16008 Alexandru_Toader 8 False True True
23 10717 kseniaanske 2 True True True
# Visualize as three divisions
summary_df['response_amount'] = pd.qcut(summary_df['count'], 3, labels=['small', 'medium', 'big'])
sns.set_style('white')
sns.catplot(data=summary_df,
            x='count', y='post.username', 
            col='response_amount', col_order=['big', 'medium', 'small'],
            kind='bar', sharey=False, sharex=False,
            color=sns.color_palette("flare_r")[0],
)
sns.despine(left=True, bottom=True)

The top two moderators are the most active, and the second is a bot. randyzwitch is the only leader here.

Responded topics disrtibution

Responses distribution over the whole community lifetime.

# Drop bots, add flags and convert datetime
posts_df = posts_df[~posts_df['post.username'].isin(['system', 'streamlit'])]
posts_df['is_moderator'] = posts_df[['post.staff', 'post.moderator','post.admin']].any(axis=1)
posts_df['post.created_at'] = pd.to_datetime(posts_df['post.created_at'])

# Figure out whether the first posts were responded and when
responded = posts_df.groupby('topic.id').aggregate({'is_moderator': 'any', 'post.created_at': 'min'})
responded.rename(columns={'is_moderator': 'is_responded'}, inplace=True)
responded['each_month'] = responded['post.created_at'].dt.strftime('%Y-%m')

# Visualize distribution
sns.set_style('white')
hist = sns.histplot(responded, x='each_month', hue='is_responded', multiple='stack')
sns.despine(left=True, bottom=True)
hist.tick_params(left=False)
last_pos = hist.get_xticks()[-1]
hist.set_xticks([0, last_pos],
                labels=[responded['each_month'].iloc[0],
                        responded['each_month'].iloc[-1]]
)

Findings:

  • Interestingly, on the right part of the picture, there are lots of unresponded topics.
  • The response frequency is decreased on the right part.
  • Almost all topics on the left are responded. I guess they are simply closed after a long while.

First response delay stats

How fast moderators respond to a new topic.

posts_df['is_first_post'] = posts_df['post.post_number'] == 1
first_posts = posts_df[(posts_df['is_first_post']) & (~posts_df['is_moderator'])].sort_values('topic.id')
first_responses = posts_df[posts_df['is_moderator']].sort_values('post.created_at').groupby('topic.id', as_index=False).first()

# Set indexes for joining
first_posts.set_index('topic.id', verify_integrity=True, inplace=True)
first_responses.set_index('topic.id', verify_integrity=True, inplace=True)
responded = first_posts.join(first_responses, how='inner', lsuffix='.posts', rsuffix='.responses')
responded = responded[['post.username.posts', 'post.created_at.posts', 'post.username.responses', 'post.created_at.responses']]
responded['first_response_delay'] = responded['post.created_at.responses'] - responded['post.created_at.posts']

responded['first_response_delay'].describe()
first_response_delay
count 5584
mean 145 days 04:17:46.949290294
std 234 days 06:39:18.775907688
min 0 days 00:00:01.231000
25% 0 days 05:39:14.983000
50% 1 days 18:19:36.327500
75% 393 days 23:08:31.948750
max 1093 days 14:33:57.247000

Too big dispearson. The dataset should get trimmed up to compute a better fit, due to huge difference between 50% and both 75%, max measurements.

Three quantiles are inspected sepately, min - 50%, 50% - 75%, 75% - max. Checking a count of each first.

pd.qcut(responded['first_response_delay'],
        [0, .5, .75, 1],
        labels=['min - 50%', '50% - 75%', '75% -max'])\
  .value_counts()
first_response_delay
min - 50% 2792
50% - 75% 1396
75% - max 1396

Each of them contains the significant amount of responses. I think the responses are given within 393 and 1093 days shoudn't be considered as abnormaly big and have zero value to the questioners.

The seconds unit isn't appropriate for reporting, converting to hours.

# Take 0.5 and hours
responded['first_response_delay_hours'] = responded['first_response_delay']  / pd.Timedelta('1 hour')
responded_lte_50 = responded.loc[responded['first_response_delay'] <=  responded['first_response_delay'].quantile(0.5)]

The representative sample responded_lte_50 is filtered, visualize its distribution and descriptive statistics.

sns.set_theme(style='white')
sns.set_style({
    'axes.spines.left': False,
    'axes.spines.bottom': False,
    'axes.spines.right': False,
    'axes.spines.top': False
})

f, axs = plt.subplots(1, 2)
sns.stripplot(responded_lte_50, 
              x='first_response_delay_hours', y='post.username.responses',
              hue='post.username.responses', palette='deep',
              legend=False, jitter=0.3, ax=axs[0])
box = sns.boxplot(responded_lte_50,
                  x='first_response_delay_hours', y='post.username.responses',
                  palette='deep', ax=axs[1])
box.yaxis.set_ticklabels('')
box.yaxis.set_label_text('')

Findings:

  • The majority responses within ~25 hours or one day.
  • A few come to the community even later.

Common delay stats

The common stats as a summary of those above.

sns.set_theme(style='dark')
box = sns.boxenplot(responded_lte_50, x='first_response_delay_hours', color='xkcd:mauve', linewidth=0.5)
locs = box.xaxis.get_ticklocs()
locs = locs.tolist()
agg_ticks = responded_lte_50['first_response_delay_hours'].agg([lambda s: s.quantile(0.25),
                                                                'median',
                                                                lambda s: s.quantile(0.75)])
for agg in reversed(agg_ticks.to_list()):
    locs.insert(1, round(agg, 2))
box.xaxis.grid(True)
box.xaxis.set_ticks(locs)

5.6 hours is the average time you will get responded to by a moderator within 1.59 - 15.86 hours on Streamlit community ..