I’m back !
With the whole lockdown situation, I had a little more time to explore a space I’m in love with : Machine Learning.
I feel like data management and intelligence are going to be at the heart of growth hacking/engineering so i was looking for an opportunity to dabble in it.
Growth engineering and machine learning are fields that require very similar skills:
- Mostly Python developing
- Data processing
- Analytic and critical thinking
After a few very practical tutorials on the specifics that I went through on kaggle.com/learn, I was able to get my hands on my first few projects.
The audio route
The idea was suggested by a friend who is really anti-ads and was getting super annoyed with sponsored content in his podcasts.
He told me about SponsorBlock, which is a crowdsourced ad skipper for Youtube.
When you download their extension, you’ll see a highlighted portion corresponding to sponsored content:
You can also label yourself content for other users to skip if the video isn’t already labeled.
What’s really cool about SponsorBlock is that their database is completely open !
You get access to every single labeled video, with the start and end time of the sponsors.
Their SQL database is open and updated often, so i used it to answer my next question :
Is it possible to detect sponsored content from Youtube videos ?
I didn’t know where to start, but I knew I had two paths:
- Make a model learning on the audio
- Make a model learning on the transcripts
I felt like audios would have much more data than simple text (music, cadence…), so I started down that road.
Without knowing anything about the subject, I searched around for audio classification algorithms, and found my way on the Panotti repo, which is based on a CNN (Convolutional Neural Network).
It was used to successfully detect 12 different guitar effects at a 99.7% accuracy, making 11 mistakes on 4000 testing examples !
In my case i only needed two classes (sponsor and not sponsor)
The results were mixed:
- When learning on a single podcast, the model showed 99% accuracy on the training set and 95% on the test set which means the model was able to detect commercial portions accurately on 95% of the episodes the network didn’t see.
- When inputting different podcasts, the model wasn’t able to detect commercials with confidence, outputting less than 60% confidence on predictions
I figured that I simply needed to train the model on multiple different podcasts. Problem was it was getting super tough to handle the data : the dataset for 300 podcasts was about 50 gigs :(
SO, in order to get more “information” with less heavy data, I finally went with the transcripts !
The captions route
In the meantime I learnt about SponsorBlock which made everything easier.
I also looked around NLP (natural language processing) tutorials and found this excellent repo about sentiment analysis.
Once I went through it, I decided to go for the Transformer model which is based on BERT a NLP framework developed by Google.
To build the dataset, I still downloaded the videos with youtube-dl, except this time I fetched the automatic captions when they were available, giving me about 80k examples. (took about 35 hours on 100 mb/s internet)
To make things reliable, I chose for my training ads longer than 10s and shorter than 5 minutes, while making sure to keep only the videos only had one ad.
To have a balanced dataset, I took equal parts sponsor and content from the videos. (if the ad was 3 min long, the content training example was also 3 min long)
You can find the dataset right here : https://www.kaggle.com/anaselmhamdi/sponsor-block-subtitles-80k
The model yielded 93.79% testing accuracy !
You can find the notebook I ran over here : https://www.kaggle.com/anaselmhamdi/transformers-sponsorblock/ (It’s really raw and undocumented though)
Working with the model
The next step was to try and label automatically a random video with labels to see what would come out of it.
So here’s what popped up in my recommendations : https://www.youtube.com/watch?v=MlOPPuNv4Ec, a video by Linus Tech Tips.
It detected correctly the two sponsored segments (Ting segments) but labelled content as sponsor with a high confidence level.
A few remarks about the mistakes:
- The mistakenly labelled mistakes all contain upper case letters. I left them there because i thought they would indicate a brand, but in the captions, names, towns and such appear more frequently.
- Some of them also contained brackets which is probably biasing the process since [Music] and [Applause] are often in sponsored content.
- A conversation about MacBooks was flagged as sponsored content
- I chunked the captions very roughly in 10s parts for the labelled video inference. I feel like I need to do a finer set of chunks and cross the results to have better results.
I feel like there’s a huge margin for improvement, because of these few improvements, and more macro improvements such as :
- Extending the dataset to more videos (I only had a subset of them)
- Recalculating the BERT model parameters and fine tuning the neural network
- Implementing feedback from experimental data scientist which i’m absolutely not
It’s really telling about how the field has advanced that somebody like me who is a complete newbie could make something somewhat functioning.
All the data and the works I found were really well documented, and easy to pickup.
The toughest part was to actually deploy the model as a service, since I’m not used to systems running on a lot of resources.
I tried though !
You can try my (buggy) sponsor detecting tool here, let me know what worked for you !