Skip to main content

How to Scrape Soundcloud data using Selenium? (from scratch)

Photo by ClĂ©ment H on Unsplash

Hello there, if you are new to web scraping or want to learn how you can scrape data from websites using Selenium then this article is for you.

In this article we are going to scrape data from SoundCloud but you can use this technique to scrape data from other websites also.

Before we move further and jump into coding, let’s take a look at what is web scraping. If you already hold knowledge about scraping you can jump to the coding section.

Web Scrapping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. 

As mentioned in the topic of the article, we are going to use Selenium for scraping the data. In case if you don’t know what selenium is:

In simple words “Selenium is a tool which automates web browser”.

First of all you need to install few things to setup the environment for scraping:

Install selenium library.

webdriver for your browser.

For installation guide to above mentioned library and drivers visit

So, now we have installed the library and web driver we are good to go. Our goal is to scrap top 50 music charts data from soundcloud. I will keep my code as simple as possible so that beginners will understand it easily. So, let’s start coding.

Note: I am using chrome browser, In case if you are using firefox or any other browser then use your browser’s driver for example: webdriver.Firefox(). This will open a new browser window and all the scraping will be done using this window only.

Next, we will call web driver to get to the soundcloud top charts page.

For getting top 50 charts for all music genres we need to find all the genres present in the website. To do so, First we need to click on the “Dropdown menu” . Why?, because clicking on dropdown button will revel all the categories inside that dropdown . Right click on the dropdown button indicating “All music genres” and click on inspect. This will open a inspection window on the side of the browser, from there right click on the highlighted(yellow colour) element and copy the selector.

Now we will find the button by copied css selector and create a click event on that button using .click() function.

Now that we have clicked on the dropdown menu, we can get all the different genres present in that menu. Along with genres we also need the link to those genres top charts. To get those links and genres name look for the element named “ linkMenu__list”.

An easy way to find this element is by right clicking on any of the genre in the dropdown menu. This element has three section elements which are idle from the structure and path. The first section contains details about charts for top 50 of ‘All music genres’ and ‘All Audio categories’. The second section contains details for each individual music genre and its respective link and that what we need. To get the second section we will write a custom xpath.

Now, we have got the element holding data of different genres and there links, all we have to do is extract that data from the element.

Next, we will create a data frame to hold this information together.

So now we have genres and their links, our next step is to get top 50 charts data for each genre. For this we will create a function which will take links data frame as an input and return a data frame with top 50 charts data for all the genres.

This final dataframe will contain 6 columns, song title, song rank, artist name or username, score this week and total score. Here score refers to the listen counts. But again, to get all this data we need to go to any genre’s top chart page and inspect the elements. Though this time it is easier to get the elements.

Go to any music genre top chart and right click on any song and click on inspect. You will find the element with class name ‘chartTrack__details’ which is the unique class in the inspection window. similarly, you can find the class ‘chartTrack__score’ which holds the score data for each song and this class is also unique. so this time we will use class name to locate the elements.

In our function, we we first create a empty dataframe with column names mentioned above. This will be our final dataframe which function will return.

Next, we will use a for loop to iterate over each row of the links data frame and grab the genre and link form each row.

Now, we will get that link using driver.get() function and scroll the web page all the way to the bottom so that all the elements will be visible.

Next step is to locate the elements using class name. We will use class ‘chartTrack__details’ for getting song title and artist name and class ‘chartTrack__score’ for getting song score data. After that we will extract data from these elements using a for loop.

Now we will filter the extracted data. We will use split function to split the data and store in the separate lists. Also if you notice, there are some tracks for which scores are not available. In this case we will use NA as the score. As we know its a top 50 chart and the first song on the chart will have rank one the second song will have rank 2 and so on, so we will create a list of number sequence from 1 to 50 and use it as the rank of the song.

Now we have all the data filtered, we will create a temporary dataframe which will hold all the data for passed genre and append this dataframe to the final dataframe which will hold data for all the genres and return this final dataframe.

Here is the combined code for the ‘scrap’ function.

Now it’s time to call the function and scrap some data. We will pass the link_data dataframe to the function.

So, finally we have scraped our data. This data is not ready for training ML models and require further cleaning, but let’s leave that for another article.

Here is the github link for the complete code:

Watch this video and learn how pooling layer works in CNN.

Check out my article on neural networks where I explained how neural network works in a very simple way without using any complex math.


  1. You only talk about scraping *metadata*. What about scraping the audio data from SoundCloud?


Post a Comment

Popular posts from this blog

Understanding mean Average Precision for Object Detection (with Python Code)

Photo by  Avel Chuklanov  on  Unsplash If you ever worked on object detection problem where you need to predict the bounding box coordinates of the objects, you may have come across the term mAP (mean average precision). mAP is a metric used for evaluating object detectors. As the name suggest it is the average of the AP. To understand mAP , first we need to understand what is precision, recall and IoU(Intersection over union). Almost everyone is familiar with first two terms, in case you don’t know these terms I am here to help you. Precision and Recall Precision: It tells us how accurate is our predictions or proportion of data points that our model says relevant are actually relevant. Formula for precision Recall: It is ability of a model to find all the data points of inte

Extract Captcha Text using CNN in Python(Captcha solver)

Photo by Janik Fischer on Unsplash Captcha solver or captcha text extraction is a process of extracting text from the captcha image. This can be done by using OCR (Optical character recognition) tools like ‘Tesseract’. But to understand Computer vision more deeply you can build your own custom captcha solver. So let’s see how you can build your own captcha solver with the help of openCV and keras. Building Captcha Solver In order to detect the text in the captcha we will build a CNN model trained on separate image of letters of the captcha. For building the model we need to separate out each letter image from the captcha and write it for training the model. After training our model we can pass