Mobile Price Classification

4 min readJan 9, 2021

Hey, guys!

I’ve been studying analytics — data science and I wish to share some competions I have been through. For this article I am doing the Mobile Price Classification. More details about the data you can find in this link. Firstly,Ill present the data.
For this project, I’ll use Python and Jupyter Notebook and some modules like scikit learn, matplotlib, numpy, pandas and etc.

The objective in this project is to classify by price. However, it’s not supposed to calculate the price but to classify in a range of categories.
Firstly, I’ll present the data.

battery_power — Total energy a battery can store in one time measured in mAh
blue — Has bluetooth or not
clock_speed — speed at which microprocessor executes instructions
dual_sim — Has dual sim support or not
fc — Front Camera mega pixels
four_g — Has 4G or not
int_memory — Internal Memory in Gigabytes
m_dep — Mobile Depth in cm
mobile_wt — Weight of mobile phone
n_cores — Number of cores of processor
pc — Primary Camera mega pixels
px_height — Pixel Resolution Height
px_width — Pixel Resolution Width
ram — Random Access Memory in Megabytes
sc_h — Screen Height of mobile in cm
sc_w — Screen Width of mobile in cm
talk_time — longest time that a single battery charge will last when you are
three_g — Has 3G or not
touch_screen — Has touch screen or not
wifi — Has wifi or not

Let’s start observing each variable avaiable.
For that, import all the modules. If you do not have it, you can use the pip command to install. In case of using anaconda, use the conda command in the prompt.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Now that we have imported all the modules, let’s create our dataframe.
It’s important to download all the data and put in the same directory of you script.

To avoid limited amount of columns and rows visible, I inserted this code to view all columns and rows and followed by the creation of the dataframe.

pd.options.display.max_columns = None 
pd.options.display.max_rows = None

Now we are ready to create our dataframe and observe the table with the commands below.

dataframe = pd.read_csv('train.csv')
dataframe.head(5)

It’s a great practice to view some statistics in order to obtain possibles errors on the dataframe or outliers or missing values. I’ll run only continous variables so we won’t get weird values. Actually, that’s one of the most importants part of the project and where you spend more time working on.

features_describe = ['battery_power', 'clock_speed', 'fc', 'int_memory', 'm_dep',
                    'mobile_wt', 'n_cores', 'pc', 'px_height', 'px_width', 'ram',
                    'sc_h', 'sc_w', 'talk_time']dataframe[features_describe].describe()

dataframe.isna().sum()

As we can see there are no missing values. The next step is to observe any outlier. The best method is to use boxplots.

I'll post just some of them to have a general idea.

sns.boxplot(x='battery_power', data=dataframe)

Despite some outliers, the amount would not cause any problem to the final result.

We can see the correlation between variables. The best way to do so is to create a heatmap. For decision tree, multicolinearity is no big deal because it splits the variables unlikely regression. The only good feature is ram variable with 0.92 but we’ll loose to many data if we discard all features so I goingo to choose all feature that ranks better than 0.2.

plt.figure(figsize=(20, 10))
heatmap = sns.heatmap(dataset.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':16}, pad=16);

The best correlation between variables and price_range is the variable ram. However, I'll choose all categories that ranks equal or over 0.22.

Filtering the data, we have

features = ['battery_power', 'px_height', 'px_width', 'ram']

With all the features, let’s train our dataframe and evaluate with accuracy test.

Data = dataframe[features]
Target = dataframe['price_range']X_train, X_test, Y_train, Y_test = train_test_split(Data, Target, test_size=0.2, random_state=100)dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(X_train, Y_train)y_pred = dtc.predict(X_test)accuracy = accuracy_score(Y_test,y_pred)*100
print(accuracy)

With these conditions we got an accuracy of 80.25%. But we can improve a little bit doing this trick

for i in range(1, 51):
    dtc = tree.DecisionTreeClassifier(max_depth=i, min_samples_leaf=5, random_state=100)
    dtc = dtc.fit(X_train, Y_train)
    y_pred = dtc.predict(X_test)
    accuracy = accuracy_score(Y_test,y_pred)*100
    print('Level = {} | Accuracy = {}'.format(i, accuracy))

The best accuracy was 83.75% but it has a lot of nodes and min sample leafs. Actually, the number of nodes will depend on your business. To me a level 5 would be great.

Mobile Price Classification

Written by Matheus Baars