What do people watch on Youtube? I got curious as I learn how to analyze social media data. So I go to SocialBlade, look at the “Top 500 YouTube Influential YouTube Channels (sorted by SB rank)”, and apply the lxml parser to generate the above data. In other words, I need to extract data from a website by web scraping, followed by cleaning up and organizing the data into a format (dataframe) that I can use.
First, to get a table of the names of the top 500 YouTube channels, number of views and subscribers,
The first entry of the list looks something like:
13th
A+
Toys and Colors
199
8,352,442
4,637,771,013
</br>
</br>
Next, I convert the list ‘col’ into a dataframe:
All information that I need is contained in 1 column of data:
0
0 \n1st\n\nA++ \n\n\nT-Series\n\n\n13,062\n\n82,…
The third step is to extract the information contained in that one long string of text, and put into separate columns in the dataframe. Need to ignore the unimportant parts, and split them up:
To find out which YouTube channel belongs to which category, I use a different method (etree.tostring) to extract that information than lh.fromstring used previously.
Importantly, the next step is to find out the sum of the number of subscription and views for each category, after grouping YouTube channels of the same category together:
We are almost there! Finally I plot the data in a barchart:
Looks like music and entertainment have the most subscribers and views. Next to these, people watch games and look up information about file(movies) on YouTube. Education (including shows designed for toddlers, young children)- related channels are pretty high up as well. Contrary to popular believe about people are watching cute cats all day long on YouTube, the ‘animal’ class falls outside of the top 10 most popular categories.