Detecting sponsored content in Youtube Videos

  • Mostly Python developing
  • Scraping
  • Data processing
  • Analytic and critical thinking

The audio route

  • Make a model learning on the audio
  • Make a model learning on the transcripts
  • When learning on a single podcast, the model showed 99% accuracy on the training set and 95% on the test set which means the model was able to detect commercial portions accurately on 95% of the episodes the network didn’t see.
  • When inputting different podcasts, the model wasn’t able to detect commercials with confidence, outputting less than 60% confidence on predictions

The captions route

Working with the model

  • The mistakenly labelled mistakes all contain upper case letters. I left them there because i thought they would indicate a brand, but in the captions, names, towns and such appear more frequently.
  • Some of them also contained brackets which is probably biasing the process since [Music] and [Applause] are often in sponsored content.
  • A conversation about MacBooks was flagged as sponsored content
  • I chunked the captions very roughly in 10s parts for the labelled video inference. I feel like I need to do a finer set of chunks and cross the results to have better results.
  • Extending the dataset to more videos (I only had a subset of them)
  • Recalculating the BERT model parameters and fine tuning the neural network
  • Implementing feedback from experimental data scientist which i’m absolutely not



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store