Explore and visualize the parser events and performance. Pandas. Seaborn. KLib. ADTK
First, i was going to start writing an article about researching Plotly community and changed my mind after a short while in favor of exploring how a multithreaded parser that i've implemented discourse-parser for the purpose of whether getting the community data runs performant. Did it have performance issues? Is a way of data requesting and writing worked as expected due to multithreading? Was parsing completed correctly and had no anomalies?
While the parser was working it produced a lot of logs, each log record is a specific type of event representing activity a thread is doing at a moment, for example data.get
to request a community website. Those log records are the target to explore, visualize, and get insights from. Also, they are time series.
For a broad view on the concerns the article is split into the following sections:
- Logs Overview.
- Continuous Running Index.
- Events frequency.
- Network Latency Time Series.
- Write on Disk Time Series.
Logs Overview and Sample
The logs file parser-latest-log.tar.bz2 that is produced by the parser to uncompress and read. Each log record inside is nothing than a json line generated by the structlog logging library. I like using that lib for events tracking. Let's print a sample of the records.
# Dataset
import tarfile
import json
import pandas as pd
with tarfile.open('files/parser-latest-log.tar.bz2') as tfile:
file = tfile.extractfile('parser-latest-log.txt')
logs_data = file.read().decode()
json_lines = [json.loads(line) for line in logs_data.splitlines()]
df = pd.DataFrame.from_records(json_lines)
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Table
df.sample(5).to_markdown(index=False)
event | timestamp | thread_name | target | args | path | exc_info |
---|---|---|---|---|---|---|
data.get | 2023-03-12 11:55:59.029086+00:00 | Consumer-11 | topic | {'page': '68', 'id': 67797} | nan | nan |
data.get | 2023-03-12 12:23:21.008310+00:00 | Consumer-6 | topic | {'page': '232', 'id': 52642} | nan | nan |
data.saved | 2023-03-12 12:59:10.965976+00:00 | Consumer-12 | topic | {'page': '447', 'id': 34298} | latest/order-created/ascending-False/page-447/topic-34298-2023-03-12T11:44:38+00:00.json | nan |
data.saved | 2023-03-12 11:59:56.268552+00:00 | Consumer-2 | topic | {'page': '91', 'id': 65603} | latest/order-created/ascending-False/page-91/topic-65603-2023-03-12T11:44:38+00:00.json | nan |
task.get | 2023-03-12 11:46:53.948818+00:00 | Producer-2 | latest | nan | nan | nan |
The handy library klib brings a brief dataset overview visualizing categories(a few column in the sample table), the most and least frequent value counts, values density per category. There are three categories we need to plot event
, thread_name
, target
. The others args
, path
, exc_info
are for the debugging purpose, timestamp
is considered in other sections.
import matplotlib.pyplot as plt
import klib
klib.cat_plot(df[['event', 'thread_name', 'target']])
plt.suptitle('')
As mentioned, there are two bloks representing each column, frequency and density. Density is a bottom block filled in with the thin lines, those lines are the most and least frequent values in the upper block. Zoom in the image to see text details.
A few interesting facts from the image:
- The
event
's most filled the vertical bar with teal color almost full77.0%
- While the
thread_name
's most are not so frequent within all the category values, its vertical bar contains a lot of white color that means filling in with other values. - In the
target
bar only two unique values and the most frequent valuetopic
filled the bar in for93.6%
Continuous Running Index
The index is simply a number that is expected to be constant, doesn't matter what a value is. Constant means there were no interruptions of parser running. The basic way to visualize such the number is just a line, straight line. So, Is that line straight?
# Dataset
threads_df = df[~df['thread_name'].str.startswith('Main')]
events_per_min_df = threads_df.set_index('timestamp')\
.pivot(columns='event', values='event')\
.resample('min')\
.count()\
.loc[:, ['task.get', 'data.get', 'data.received', 'data.saved']]
ts_diff_df = events_per_min_df.reset_index()\
.diff()\
.loc[:, 'timestamp']
ts_diff_df = ts_diff_df.dt.seconds / 60 # Convert to minutes
ts_diff_df.rename('Diff in minutes', inplace=True)
# Plot
import seaborn as sns
sns.set_theme(style='white', context={'axes.labelsize': 16, 'axes.titlesize': 16, 'legend.fontsize': 16})
sns.set_palette('pastel')
sns.set_style({'axes.spines.left': False, 'axes.spines.bottom': False, 'axes.spines.right': False, 'axes.spines.top': False})
line = sns.lineplot(data=ts_diff_df, lw=4)
last_pos = line.get_xticks()[-2]
line.set_xticks([1, last_pos],
labels=[events_per_min_df.index.min().strftime('%H:%M:%S'),
events_per_min_df.index.max().strftime('%H:%M:%S')])
line.figure.suptitle('Continuous Index', size=20)
line.figure.subplots_adjust(top=.9)
Throughout all the running timeTimedelta('0 days 02:20:00')
there were no interruptions as the straight line of constant value 1
means that difference between every consequent ticks is 1 min, no more no less.
Events Frequency
event
is the most valuable category in a parser work. We can manually count and interpret logs for how many each event type was produced, whom, when, etc. That way works fine if log records are a few, that's not our case. Records are thousands, print the dataset shape df.shape[0]
is 112880
.
Frequency Per Target
To make a general picture about all the events we can group them up into 5 minutes bins(groups) and split per target
to visualize the dataset in different dimensions at the same time. Grouping into small time ranges(bins) with event frequency(count) calculation is being made for having data visualization easy to interpret. So, we will see how often each event
type is produced per target
within 5 min
bins.
Only the events task.get
, data.get
, data.received
, data.saved
are left as the rest is about technical tracking and errors. Errors should be considered separately, i've manually checked their count and they are a couple.
# Datset
sub_df = df.drop(columns=['args', 'path', 'exc_info'])\
.dropna(subset=['target'])\
.loc[:, ['timestamp', 'event', 'target', 'thread_name']]\
.sort_index(axis=1)
events_occured_df = sub_df.pivot(index='timestamp',
columns=['thread_name', 'target', 'event'],
values='event')\
.mask(lambda df: df.notna(), 1)\
.fillna(0)\
.melt(ignore_index=False)\
.query('value == 1')\
.drop(columns='value')\
.sort_index()
cut_bins = pd.date_range(start=events_occured_df.index.min() , end=events_occured_df.index.max() + pd.Timedelta(minutes=5), freq='5min')
cut_labels = list(f'{i}th' for i in range(5, len(cut_bins) * 5, 5))
events_reduced_df = events_occured_df[~events_occured_df['event'].isin(['do', 'done', 'stop.blank', 'error.http'])]
events_reduced_df.loc[:, '5min bin'] = pd.cut(events_reduced_df.index,
bins=cut_bins,
labels=cut_labels,
include_lowest=True)
# Plot
displot = sns.displot(data=events_reduced_df,
x='5min bin',
hue='event',
col='target',
row='event',
multiple='stack')
loc, labels = plt.xticks()
displot.set_xticklabels(labels, rotation=45)
displot.fig.suptitle('Events Frequency Over Targets', size=20)
displot.fig.subplots_adjust(top=.9)
Nice, right away we see:
- That event's amount of the target
latest
is an order smaller than oftopic
. - The event
task.get
amount is bigger at both targets than other event types, but it's not representative of the plot type. It can be depicted better.
Visualizing the event amount as a ratio could help with the feeling of how much bigger task.get
is.
with sns.axes_style('whitegrid'):
displot = sns.displot(data=events_reduced_df,
x=events_reduced_df['5min bin'].apply(lambda val: val.removesuffix('th'))\
.astype(int)\
.rename('5min int bin'),
hue='event',
col='target',
multiple='fill',
kind='kde')
displot.fig.suptitle('Events Frequency Ratio', size=20)
displot.fig.subplots_adjust(top=.9)
Much more clear now:
task.get
of the targettopic
is around1.0 - 0.7
,0.3
of the whole. It looks balanced.- While
task.get
of the targetlatest
is around1.0 - 0.35
,0.65
is quite big.
Frequency Per Thread
A similar way to the frequency above the dataset is plotted, but a per thread dimension. How often each event
type is produced per thread_name
within 5 min
bins.
displot = sns.displot(data=events_reduced_df, x='5min bin', hue='event', col='thread_name', col_wrap=2, multiple='stack')
loc, labels = plt.xticks()
displot.set_xticklabels(labels, rotation=45)
displot.fig.suptitle('Events Frequency Over Threads', size=20)
displot.fig.subplots_adjust(top=.9)
The picture shows that each thread workload looks the same. That's a good observation, it means the parser was working as expected.
Network Latency Time Series
The network latency is a time delta between the events data.received, data.get
. In order to compute all the deltas, the dataset has to be sorted by two columns timestamp, thread_name
to avoid misplacing when, for example, event data.get
is followed by data.get
but of a different thread.
The valid event order:
task.get
data.get
data.received
data.saved
Check the events order validity
To check that an order after sorting is valid two columns should be added, the previous and next event. After columns are added, we can match accordingly to valid event order above each event with its previous and next to state it is valid in according to.
# Dataset
order_state_df = events_reduced_df.reset_index()\
.sort_values(['thread_name', 'timestamp'], ascending=True)\
.assign(prev_event=lambda df: df['event'].shift(1),
next_event=lambda df: df['event'].shift(-1))
# Table
order_state_df[order_state_df['thread_name'] == 'Consumer-1']\
.iloc[10:16]\
.loc[:, ['timestamp', 'event', 'prev_event', 'next_event']]
timestamp | event | prev_event | next_event |
---|---|---|---|
2023-03-12 11:44:48.148145+00:00 | data.received | data.get | data.saved |
2023-03-12 11:44:48.150142+00:00 | data.saved | data.received | task.get |
2023-03-12 11:44:48.150436+00:00 | task.get | data.saved | data.get |
2023-03-12 11:44:48.150829+00:00 | data.get | task.get | data.received |
2023-03-12 11:44:52.154315+00:00 | data.received | data.get | data.saved |
2023-03-12 11:44:52.155328+00:00 | data.saved | data.received | task.get |
The sample looks valid, now define a function with valid event order matching assigning results to new column is_valid
.
# Dataset
def is_order_valid(series):
if series['event'] == 'data.get':
return series['prev_event'] == 'task.get'
if series['event'] == 'data.received':
return series['prev_event'] == 'data.get'
if series['event'] == 'data.saved':
return series['prev_event'] == 'data.received'
return None
order_state_df['is_valid'] = order_state_df.apply(is_order_valid, axis=1)
# Table
order_state_df[['timestamp', 'event', 'prev_event', 'is_valid']].iloc[10:16]
timestamp | event | prev_event | is_valid |
---|---|---|---|
2023-03-12 11:44:48.148145+00:00 | data.received | data.get | True |
2023-03-12 11:44:48.150142+00:00 | data.saved | data.received | True |
2023-03-12 11:44:48.150436+00:00 | task.get | data.saved | |
2023-03-12 11:44:48.150829+00:00 | data.get | task.get | True |
2023-03-12 11:44:52.154315+00:00 | data.received | data.get | True |
2023-03-12 11:44:52.155328+00:00 | data.saved | data.received | True |
is_valid
marks rows as a boolean flag. Now we can group the dataset with and without is_valid
flag, just to visualize if group counts are different.
with sns.plotting_context({'ytick.labelsize': 16, 'xtick.labelsize': 0, 'xtick.major.width': 0, 'axes.labelsize': 0}):
fig, axes = plt.subplots(2, 1, sharey=False)
event_counts = order_state_df['event'].value_counts(dropna=False)\
.reset_index()
barplot = sns.barplot(data=event_counts,
ax=axes[0],
y='event',
x='count',
color=sns.colors.xkcd_rgb['light blue'])
axes[0].title.set_text('Unvalidated Event Counts')
for i in barplot.containers:
barplot.bar_label(i, fontsize=16)
valid_event_counts = order_state_df[['event', 'is_valid']]\
.value_counts(dropna=False)\
.reset_index()\
.assign(group=lambda df: df['event'].astype(str) + ' - ' + df['is_valid'].astype(str))
barplot = sns.barplot(data=valid_event_counts,
ax=axes[1],
y='group',
x='count',
color=sns.colors.xkcd_rgb['light blue'])
axes[1].title.set_text('Validated Event Counts')
for i in barplot.containers:
barplot.bar_label(i, fontsize=16)
Obviously, all the events are valid due to all the counts on the top and bottom charts are equal. Consequently, network latency computing can be done with no data filtering or cleaning. The whole dataset completely match valid event order, further computed time deltas are correct.
Latency Overall
Subtract the data.get
timestamp from data.received
one. For making it, a new column prev_event_timestamp
is added that contains timestamps of data.get
. Latency are seconds.
# Dataset
latency_diffs = order_state_df.assign(prev_event_timestamp=lambda df: df['timestamp'].shift(1))\
.loc[