Mobile Price Classification

Matheus Baars
4 min readJan 9, 2021

Hey, guys!

I’ve been studying analytics — data science and I wish to share some competions I have been through. For this article I am doing the Mobile Price Classification. More details about the data you can find in this link. Firstly,Ill present the data.
For this project, I’ll use Python and Jupyter Notebook and some modules like scikit learn, matplotlib, numpy, pandas and etc.

The objective in this project is to classify by price. However, it’s not supposed to calculate the price but to classify in a range of categories.
Firstly, I’ll present the data.

  • battery_power — Total energy a battery can store in one time measured in mAh
  • blue — Has bluetooth or not
  • clock_speed — speed at which microprocessor executes instructions
  • dual_sim — Has dual sim support or not
  • fc — Front Camera mega pixels
  • four_g — Has 4G or not
  • int_memory — Internal Memory in Gigabytes
  • m_dep — Mobile Depth in cm
  • mobile_wt — Weight of mobile phone
  • n_cores — Number of cores of processor
  • pc — Primary Camera mega pixels
  • px_height — Pixel Resolution Height
  • px_width — Pixel Resolution Width
  • ram — Random Access Memory in Megabytes
  • sc_h — Screen Height of mobile in cm
  • sc_w — Screen Width of mobile in cm
  • talk_time — longest time that a single battery charge will last when you are
  • three_g — Has 3G or not
  • touch_screen — Has touch screen or not
  • wifi — Has wifi or not

Let’s start observing each variable avaiable.
For that, import all the modules. If you do not have it, you can use the pip command to install. In case of using anaconda, use the conda command in the prompt.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Now that we have imported all the modules, let’s create our dataframe.
It’s important to download all the data and put in the same directory of you script.

To avoid limited amount of columns and rows visible, I inserted this code to view all columns and rows and followed by the creation of the dataframe.

pd.options.display.max_columns = None 
pd.options.display.max_rows = None

Now we are ready to create our dataframe and observe the table with the commands below.

dataframe = pd.read_csv('train.csv')
dataframe.head(5)

It’s a great practice to view some statistics in order to obtain possibles errors on the dataframe or outliers or missing values. I’ll run only continous variables so we won’t get weird values. Actually, that’s one of the most importants part of the project and where you spend more time working on.

features_describe = ['battery_power', 'clock_speed', 'fc', 'int_memory', 'm_dep',
'mobile_wt', 'n_cores', 'pc', 'px_height', 'px_width', 'ram',
'sc_h', 'sc_w', 'talk_time']
dataframe[features_describe].describe()
dataframe.isna().sum()

As we can see there are no missing values. The next step is to observe any outlier. The best method is to use boxplots.

I'll post just some of them to have a general idea.

sns.boxplot(x='battery_power', data=dataframe)

Despite some outliers, the amount would not cause any problem to the final result.

We can see the correlation between variables. The best way to do so is to create a heatmap. For decision tree, multicolinearity is no big deal because it splits the variables unlikely regression. The only good feature is ram variable with 0.92 but we’ll loose to many data if we discard all features so I goingo to choose all feature that ranks better than 0.2.

plt.figure(figsize=(20, 10))
heatmap = sns.heatmap(dataset.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':16}, pad=16);

The best correlation between variables and price_range is the variable ram. However, I'll choose all categories that ranks equal or over 0.22.

Filtering the data, we have

features = ['battery_power', 'px_height', 'px_width', 'ram']

With all the features, let’s train our dataframe and evaluate with accuracy test.

Data = dataframe[features]
Target = dataframe['price_range']
X_train, X_test, Y_train, Y_test = train_test_split(Data, Target, test_size=0.2, random_state=100)dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(X_train, Y_train)
y_pred = dtc.predict(X_test)accuracy = accuracy_score(Y_test,y_pred)*100
print(accuracy)

With these conditions we got an accuracy of 80.25%. But we can improve a little bit doing this trick

for i in range(1, 51):
dtc = tree.DecisionTreeClassifier(max_depth=i, min_samples_leaf=5, random_state=100)
dtc = dtc.fit(X_train, Y_train)
y_pred = dtc.predict(X_test)
accuracy = accuracy_score(Y_test,y_pred)*100
print('Level = {} | Accuracy = {}'.format(i, accuracy))

The best accuracy was 83.75% but it has a lot of nodes and min sample leafs. Actually, the number of nodes will depend on your business. To me a level 5 would be great.

--

--