Part 2 — Data Wrangling And Visualization

Yusuf Gulcan
5 min readJul 19, 2022

This is part 2 of my end-to-end data science project. The last article was about collecting data through Beautifulsoup. I will explain how to process the data in this article. Here is the link for part 1.

After saving the data into CSV, I start working on it using Jupyter Notebook. Since this is real data, it is full of null, cracked, and partial data. I need to process it to make it appropriate for the presentation and later for machine learning algorithms.

The raw data looks like this:

Preview of the raw data

At first glance, the scraping does not look so successful in gathering CPU and Battery Power values. The reason might be the messy structure of the website or that I made mistakes in writing my scrape code. Nevertheless, I still need to use the data, so I need to patch it up. First, I drop the ‘Unnamed: 0 ‘, ‘Links’, and ‘Battery Power’ columns because I think they are not helpful.

Information about the data
df.drop(columns=[‘Unnamed: 0’], axis= 1, inplace=True)
df.drop(columns = ['Link','Battery Power'],axis=1,inplace=True)

To fill the null values on the CPU column, I searched the CPU values for each phone model and assigned what I found on the internet to the corresponding rows.

pp = df.loc[(df.Brand==’Apple’)&(df.CPU.isna()==True)&(df.Model.str.contains(‘13’)==True),’Model’].index 
df.loc[pp,’CPU’] = 3.2
rr = df.loc[(df.Brand=='Apple')& (df.CPU.isna()==True)&(df.Model.str.contains('S')),'Model'].index
df.loc[rr,'CPU'] = 2.6
tt = df.loc[(df.Brand=='Apple')& (df.CPU.isna()==True)&(df.Model.str.contains('12')),'Model'].index
df.loc[tt,'CPU'] = 3.2
df.loc[tt]
df.loc[(df.Brand=='Apple') & (df.CPU.isna()==True),'CPU'] = 2.8

I finished filling the null CPU values on Apple products. It did not take long because there are relatively fewer Apple models.
I do the same for Samsung, Huawei, and Xiaomi because those are the most dominant brands.
Setting values is a way to make use of data with a lot of null values. Some analysts fill nulls with mean, median, or mode values. Some completely drop the rows with nulls. However, If you know the value or you can reach the specific value, just set it.
Another problem is the ‘Color’ column which includes partial Turkish values. I need to convert that column into English.

colors  = {'Siyah':'Black','Mavi':'Blue','Beyaz':'White','Ye':'Green','Gri':'Gray','Mor':'Purple','K':'Red',              'Alt':'Golden','Pembe':'Pink','Sar':'Yellow','Turuncu':'Orange','Metalik':'Metallic','Lacivert':'Blue',                'Turkuaz':'Turquoise','Bej':'Cream','Bordo':'Red','Kahverengi':'Brown'}

I create a dictionary for each unique value in the color column to convert the values. There is no shortcut for this, you need to do it manually. This is an example of why data wrangling is the most time-taking step of data analysis.

brands=df.groupby('Brand').count().sort_values('Model',ascending=False).index[:10]
df =df.loc[df.Brand.isin(brands)==True]

The data includes many brands with no recognition. I want to cut them off because I want my analysis to be relevant to the real world. Therefore, I only choose the 10 most common brands. Others are mostly mislabeled products on the website.

Number of Products per Brand

The most common brands on the website are Samsung, Xiaomi, and Apple. Other brands are simply not in demand. %73 of the products are from these three brands.

According to the data, if we would create a smartphone from the most common features it would be a Black Android 6-inch Samsung with 4 GB RAM,128 GB Storage, 2.3 GHz CPU, and 12 MP camera resolution.

The Number of product features per column

There is a deep accumulation between 3500 and 7000 Turkish liras in terms of price. When the products are sorted by ‘brand’, we see that most brands offer products where the accumulation zone is.

Distribution of Products by Price

Apple products are insurgent to the market norms. The average price of Apple products is 23,819 liras whereas the average of the rest of the market is only 6,431 liras.

Brands and Price

It is a huge difference. I want to check if the difference comes from hardware capacity. I create a set of scatter plots to see the place of the Apple products compared to others.

The graphs show us that the price difference, which is almost 3X, is not reflected in the hardware capacity. Apple products are still with the highest hardware capacity overall, but there is not as much difference.

The distribution of the smartphones on the several feature axes, as opposed to price, looks like this:

df['Hardware'] = df.RAM * df.Storage * df.CPU * df['Screen Size'] 
df['Price/Hardware'] = df.Hardware / df.Price

I create two new columns named ‘Hardware Capacity’ by simply multiplying the feature values for each product. I also calculated the price paid per hardware unit. Here is the table of mean values per brand sorted by Price/Hardware:

Price charged per Hardware Unit

I add the correlation chart by reminding you that ‘correlation does not necessarily mean causation.

Correlation Matrix

We see that price is in a strong correlation with CPU. Storage comes second, and third, RAM is weakly correlated with price.

Finally, I created a PowerBI dashboard assuming that all products can be sold at current prices. Here is a preview;

Dashboard Showing the Overall Situation

This article shows the highlights of my data analysis. I did not have a chance to show everything I did in the article. I hope it is still helpful. You can access the full code here

--

--