Detecting sponsored content in Youtube Videos

(original article published on my blog right here)

I’m back !

With the whole lockdown situation, I had a little more time to explore a space I’m in love with : Machine Learning.

I feel like data management and intelligence are going to be at the heart of growth hacking/engineering so i was looking for an opportunity to dabble in it.

Growth engineering and machine learning are fields that require very similar skills:

  • Mostly Python developing
  • Scraping
  • Data processing
  • Analytic and critical thinking

After a few very practical tutorials on the specifics that I went through on, I was able to get my hands on my first few projects.

The audio route

The idea was suggested by a friend who is really anti-ads and was getting super annoyed with sponsored content in his podcasts.

He told me about SponsorBlock, which is a crowdsourced ad skipper for Youtube.

When you download their extension, you’ll see a highlighted portion corresponding to sponsored content:

You can also label yourself content for other users to skip if the video isn’t already labeled.

What’s really cool about SponsorBlock is that their database is completely open !

You get access to every single labeled video, with the start and end time of the sponsors.

Their SQL database is open and updated often, so i used it to answer my next question :

Is it possible to detect sponsored content from Youtube videos ?

I didn’t know where to start, but I knew I had two paths:

  • Make a model learning on the audio
  • Make a model learning on the transcripts

I felt like audios would have much more data than simple text (music, cadence…), so I started down that road.

Without knowing anything about the subject, I searched around for audio classification algorithms, and found my way on the Panotti repo, which is based on a CNN (Convolutional Neural Network).

It was used to successfully detect 12 different guitar effects at a 99.7% accuracy, making 11 mistakes on 4000 testing examples !

In my case i only needed two classes (sponsor and not sponsor)

I followed the instructions, learned how to use youtube-dl to download podcasts highlights, and scraped Radiocentre’s commercials database because until then, I hadn’t learned about SponsorBlock.

The results were mixed:

  • When learning on a single podcast, the model showed 99% accuracy on the training set and 95% on the test set which means the model was able to detect commercial portions accurately on 95% of the episodes the network didn’t see.
  • When inputting different podcasts, the model wasn’t able to detect commercials with confidence, outputting less than 60% confidence on predictions

I figured that I simply needed to train the model on multiple different podcasts. Problem was it was getting super tough to handle the data : the dataset for 300 podcasts was about 50 gigs :(

SO, in order to get more “information” with less heavy data, I finally went with the transcripts !

The captions route

In the meantime I learnt about SponsorBlock which made everything easier.

I also looked around NLP (natural language processing) tutorials and found this excellent repo about sentiment analysis.

Once I went through it, I decided to go for the Transformer model which is based on BERT a NLP framework developed by Google.

To build the dataset, I still downloaded the videos with youtube-dl, except this time I fetched the automatic captions when they were available, giving me about 80k examples. (took about 35 hours on 100 mb/s internet)

To make things reliable, I chose for my training ads longer than 10s and shorter than 5 minutes, while making sure to keep only the videos only had one ad.

To have a balanced dataset, I took equal parts sponsor and content from the videos. (if the ad was 3 min long, the content training example was also 3 min long)

You can find the dataset right here :

The model yielded 93.79% testing accuracy !

You can find the notebook I ran over here : (It’s really raw and undocumented though)

Working with the model

The next step was to try and label automatically a random video with labels to see what would come out of it.

So here’s what popped up in my recommendations :, a video by Linus Tech Tips.

I took my machine for a spring and you can find the raw results here or on this spreadsheet.

It detected correctly the two sponsored segments (Ting segments) but labelled content as sponsor with a high confidence level.

A few remarks about the mistakes:

  • The mistakenly labelled mistakes all contain upper case letters. I left them there because i thought they would indicate a brand, but in the captions, names, towns and such appear more frequently.
  • Some of them also contained brackets which is probably biasing the process since [Music] and [Applause] are often in sponsored content.
  • A conversation about MacBooks was flagged as sponsored content
  • I chunked the captions very roughly in 10s parts for the labelled video inference. I feel like I need to do a finer set of chunks and cross the results to have better results.

I feel like there’s a huge margin for improvement, because of these few improvements, and more macro improvements such as :

  • Extending the dataset to more videos (I only had a subset of them)
  • Recalculating the BERT model parameters and fine tuning the neural network
  • Implementing feedback from experimental data scientist which i’m absolutely not

It’s really telling about how the field has advanced that somebody like me who is a complete newbie could make something somewhat functioning.

All the data and the works I found were really well documented, and easy to pickup.

The toughest part was to actually deploy the model as a service, since I’m not used to systems running on a lot of resources.

I tried though !

You can try my (buggy) sponsor detecting tool here, let me know what worked for you !


Growth engineer @Quable —