Part 1 — Data Collection utilizing Python BeautifulSoup Library

Yusuf Gulcan
5 min readJul 18, 2022

Lastly, I have done a project that requires collecting, wrangling, and visualizing data. I will only share the data collection step in this article in detail. Check out my medium profile to see the other steps.

The project goal is to collect data about smartphones to understand the situation in the Turkish market. I choose Trendyol.com, one of the most popular e-commerce retail web stores in Turkey.

First I asked myself; How detailed/What kind of information do I want?

I would like to have every piece of detail about products but it is not realistic. So I target the most critical and most reachable information. I want a list of data containing CPU, RAM, Storage, Screen size, Camera Resolution, Color, Operating System, Brand name, Model, and Price for each product.

Now that I know what I want, I need a plan to follow. I checked the structure of the website. I browsed until I get to the section where all smartphones are listed. However, the list page includes only a limited amount of information about the product, such as brand name, model name, price, and storage. Whereas, I want to get information about more features like color, CPU, RAM, screen size, and camera resolution.

In this case, I need to extract the links for each product page and then write a script to gather all the information I want from the links on the list page. The list page looks like this:

The Main List Page For Smartphones

A product page contains a section like this:

After looking at the structure, I start coding to get the links for each product page from the main category page. After collecting the data for each product, I want to store the data in lists so that I can use the data to generate a pandas data frame.

import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
import re

So I first import all necessary libraries for the project. I need pandas to create a data frame, CSV to convert it to a CSV file, requests to make requests, BeautifulSoup to parse HTML, and re(regex) to find patterns in the HTML.

Then I add my try and except conditions not to stop the entire process in case an error occurs in the loop, which mostly happens when the requested data piece is missing.

After that, I create my lists. I plan to store the different data types separately in these lists.

try:
linklist = []
Ramlist = []
BatteryPowerlist = []
Storagelist = []
Screensizelist = []
CameraResolutionlist = []
OperatingSystemlist = []
Colorlist = []
CPUlist = []
Pricelist = []
Modellist = []
brandlist = []

I send a ‘request’ to get a response from the website. After confirming it works, I create a soup by processing the ‘response’ using the Beautifulsoup function.

for num in range(1, pagenum):

url = f'https://www.trendyol.com/akilli-cep-telefonu-x-c109460?pi={num}'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
Phones = soup.find_all('div', class_='p-card-wrppr')

I also cover this part with a python loop to get information from multiple pages. The URL includes the page number -which can be seen in the brackets ‘{ }’ -for search results which I can manipulate by a range loop.

After parsing the HTML with Beautifulsoup I look for the specific location that includes the product page link. — in this case <a href > tag —

HTML Structure

Since it is a result page, there are many links under the same <a href> tag.

So I target all tags by using the ‘find_all’ function of the BeautfiulSoup library and create another loop to reach all of them.

            for each in Phones:
link = each.find('div', class_='p-card-chldrn-cntnr').a[
'href'] # First I had to find the links for individual product pages in the search result page.
link = f'https://www.trendyol.com{link}'
linklist.append(link)
print(link)

response1 = requests.get(link) # Then I created another soup for product-specific data.
soup1 = BeautifulSoup(response1.text, 'html.parser')
data = soup1.find('ul', class_='detail-attr-container').text

As I get product-specific page links one by one, I get information from those pages. I use created the same process of HTML parsing by using the links.

My target in the product-specific pages is the section where product specifications are listed. Here is what that section looks like:

Part Of The Product Page Where Specifications Are Shown In Detail

The specific code above returns plain text, so I can find patterns in the text and get the info I need.

The tags are named the same

Frankly, this is not the ideal way to get the information because you generally would like to target specific tags to get specific data pieces. But the pages are not structured in that way, instead, the lists of specifications shown above are bagged under one tag so targeting a specific piece of information is impossible.

rammatch = re.search('tesi ([1-90])',data)

if rammatch:
RAM = rammatch.group(1)
else:
RAM = None
Ramlist.append(RAM)

I utilize regular expression to mine data from the text patterns. I will not put all the code lines I wrote to get each product specification. For example, this is how I extracted RAM data from the text.

So I did similar things for all product features except for price and brand. Because these data are tagged specifically and relatively easy to reach. The code for price info is as follows;

price = soup1.find('span',class_='prc-dsc').text.strip()  
price = price[:-2].replace('.', '')
price = price.split(',')[0]
Pricelist.append(price)

I also modified the output format because it included commas and the currency sign.

After writing a script for each feature, I collected the data into a pandas data frame and added a Value error exception.

specifications = {'Model': Modellist,
# I gathered all of my data in a dictionary and created a pandas dataframe.
'Brand': brandlist,
'Price': Pricelist,
'CPU': CPUlist,
'RAM': Ramlist,
'Storage': Storagelist,
'Operating System': OperatingSystemlist,
'Camera Resolution': CameraResolutionlist,
'Screen Size': Screensizelist,
'Battery Power': BatteryPowerlist,
'Color': Colorlist,
'Link': linklist
}

df = pd.DataFrame(specifications)
df.to_csv(
'Trendyol_detailed33.csv') # I have written the dataframe into a csv file to be processed using the Jupyter Notebook.

except ValueError as e:
print(e)

I converted the data into a CSV file to further processing. Finally, I covered the entire code with a function that takes the number of pages to scrape as the argument.

The general structure is like this;

def Trendyoldata(pagenum): 

try:
...
except ValueError as e:
print(e)

Trendyoldata(pagenum)

This article covers the data collection stage of my project. Data wrangling and data visualization are also explained on my medium page.

Here you can download the file of the script.

--

--