Building a fully automated Medium stats pipeline to track my writing performance
Combining Prefect, Selenium, Pandas and Metabase to measure and manage my writing success on Medium
On October 30th, I took a day off from work to virtually attend the Project A Knowledge Conference to expand my knowledge in marketing, data/AI and sales through the many various practical talks. What started with a talk led to a bigger project idea that I wanted to share in this blogpost.
Motivation and Background
The idea behind it was mostly inspired by two particular talks from the conference that helped me to come up with the general idea and architecture:
1) Modern data stacks
This talk presented the most popular options for data ingestion, storage, transformation and visualization, as well as some examples of how Project A portfolio companies built their data stack. During the talk they also shared two excellent articles, I can highly recommend:
This talk inspired me to finally build a fully automated end-to-end data pipeline that I had been dreaming of for quite some time.
2) From zero to hero — A marketing sophistication framework
In this talk the authors presented six marketing sophistication stages and the respective approaches used in acquisition and engagement, analytics setup, tool infrastructure and organisation in place.
This talk pushed the idea that there are better ways to measure, promote and understand the “marketing” impact of my medium posts. In the end, I realized that the available stats were interesting but not detailed or flexible enough to let me better understand how well my medium posts are performing over time, or if there are certain channels or trends that yield better results.
One of my favorite quotes from Peter Ducker sums up the problem that I wanted to solve quite well:
You can’t manage what you don’t measure. — Peter Ducker
Hence, I started this project to better measure and manage my medium writing.
Building the Architecture
Since this project is one of the more complex ones, I used my favorite visualization tool draw.io to show you a high-level overview of the architecture that I came up with:
In the subsequent parts, I explain why I picked the frameworks or libraries and how everything is working in a bit more detail.
1. Scraping Medium Stats with Selenium
In order to get the stats out of Medium, I chose Selenium for these two reasons:
- Complex logging in via Google O-Auth/Authenticator app or with an alternative login box send to your inbox
While Selenium is less performant than other scraping frameworks like scrapy, you can still run it headless (without a browser opening up) and the data I scrape on a daily basis is also quite small, so performance did not matter that much.
For a more detailed perspective on what scraping framework to choose have a look at my previous blog post which is about A step-by-step guide for creating an authentic data science portfolio project.
In my get_medium_stats.py script, I first go to Medium to get an email with the login link to my inbox delivered. Another script then fetches the login link from my inbox and passes it on to selenium to login. In the authenticated session, I then first get all the main stats from the stats summary page and then loop through every post to get more detailed stats like external views:
2. Setting up the Extract, Transform Load Pipeline with Pandas
With the scraped data, I used pandas to set up a very simple ETL pipeline that extracts data from all CSVs, transforms and loads them into an sqlite database. With my learnings from the Udacity Data Engineering nanodegree, I set up a star schema model, with one dimension table for posts and two fact tables for post stats and external views:
To learn more about how I got started with Python Programming as a business student, check out this blogpost.
3. Orchestrating workflows with Prefect
To make this project really work, the orchestration and scheduling of the scraping and ELT jobs were key. While I learned some of the basics of Airflow during my Udacity Nanodegree, I found it difficult and annoying to set up and get Airflow working.
After some more research, I settled for Prefect, a newer and more modern version of “Airflow”. With a really sleek UI and excellent documentation, it took me very little time to schedule and automate my first workflow.
Prefect also provides other really nice visualisations, such as your flow run schematics. Here’s an example of the script that I described in step 1:
Along the way I really appreciated the help of the Prefect Slack community and Jimmy Le who helped me with some Selenium issues in combination with Prefect. He also wrote a very practical blog post about How to Automate Python Workflows in Prefect.
4. Visualization of Results
For the visualization of results I chose Metabase, an open source BI tool I work every day with at N26. The tool is not only free, but you can also set up some very nice looking dashboards with filters in a very short time:
I designed the Dashboard with two sections and two filters in mind:
1. Total Stats
This section helps me to see the “bigger picture” of my writing and motivates me to understand on a high level how every day my main key performance indicators develop. I particularly love to see my total earnings (not a millionaire yet) as well as how much time medium members spent reading my blog posts.
2. KPI daily Series Changes
In this section I wanted to better understand how KPIs shift from day to day. With the sorted stacked bar chart, I can easily see how different KPIs perform day by day (biggest ons sorted at the bottom). As I am collecting more and more data, I also plan to include some weekly aggregations in the future.
An exciting new feature that came with the latest release of metabase, is the possibility to set up click-behaviour. So I can easily click on one part of the stacked bar chart and will set the filter for that post and the whole dashboard updates:
Outlook and some even more exciting ideas
This dashboard is the first part of a much bigger automation plan. A key chart that I wanted to share is the external views chart:
In the coming weeks, I am planning to advance my Twitter automation script on Prefect even further. Currently, I check every day in the morning at 8 am if anyone shared my medium posts on Twitter. If yes my twitter Profile likes those automatically.
A smart Twitter Bot
For the next big automation step, I plan to scrape tweets that are about some relevant Medium blog post topic that I wrote about before. For example, if someone asks for advice on a Data Science Portfolio Project, I want to automatically share via Twitter my blogpost on A step-by-step guide for creating an authentic data science portfolio project.
In order to not spam people, the plan is that Prefect will send all those scraped tweets with a certain topic via Prefect to my Slack account on an hourly interval. Once I get a message I will classify and label the tweet on my phone and send the data back into a database. Another worker then checks my classification and shares my Blogpost via Twitter if there’s a match. Finally, with hopefully some data, I will also create an NLP classifier that starts predicting whether those tweets are relevant for sharing my writing.
If I ever find time to build this, I could then directly see the impact of automation and how it influences my KPIs in my Metabase dashboard.
And that’s it, the grand-marketing-automation-ai-scheduler-bot script that really ties together some of my favorite learnings and tools to build something useful that solves my own problems.