INSIGHT ON ANALYTICS: May 2018

Monday, May 7, 2018

Reading a confusion Matrix... A misclassifier

How to read a confusion matrix

suppose you have 95 records available for prediction. The record has a target to predict whether the person has a particular disease or not.

the matrix displays as

	Confirmed	Not confirmed
Confirmed	35	20
Not confirmed	25	15

What do you understand?

I have given summation of the rows and columns

		Predicted values
		Confirmed	Not confirmed	total
Actual values	Confirmed	35	20	55
Actual values	Not confirmed	25	15	40
total		60	35	95

Can you get some information?

The green cells are the actuals, while the yellows are the predicted values and the pink are the errors.

Actual values are the one which are tested physically and the results are available on hand. Which means in our case 55 people had the disease and 40 don’t have.

What did the algorithm do ? It has predicted as 60 people who have disease and 35 who do not have.

How to interpret? Out of the 55 people, the algorithm has made a wrong prediction of classifying 20 people as they don’t have a disease.

Secondly, out of 45 people who do not have disease it has predicted 25 people that they may have a disease, which is a very serious issue.

Reached 1000...more to go

Saturday, May 5, 2018

Few algorithms a Data science professional should know

1.Perceptron Algorithm
2.Hoeffding's Inequality
3. Data preparation:- Normalisation, Feature scaling, Binary scaling, standardisation
4.Machine Learning algorithms such as Linear , logistic regression, Principal Component analysis,
Artificial Neural Network, K- Nearest Neighbours, Naive bayes, K- means clustering,Support Vector Machine,Random Forerst
5. Validation algorithms

Perceptron Algorithm

1. Initialize the weights and threshold to small random numbers.

2. Present a vector x to the neuron inputs and calculate the output.

3. Update the weights
4. Repeat steps 2 and 3 until:

o the iteration error is less than a user-specified error threshold or

a predetermined number of iterations have been completed .

Hoeffding’s inequality
A powerful technique—perhaps the most important inequality in learning theory—for bounding the probability that sums of bounded random variables are too large or too small

Univariate Time series

We have a data set which is varying with time. For instance temperatures of a particular place for the past 20 years. What insight can we have?
1. Plot the entire data as time series
2. Check for any cyclic pattern
3. Plot the mean of the temperature for every season( 3-4 months grouped as a season)
4. Plot the variable (temperature in our case) for every month. Mean temperature(for all 20 years) in y axis and months in the x axis.
5.If still we did not get any visible pattern - try moving day average methods- ( 100 DMA, 200 DMA)etc. a ARIMA Model.
6.Plot the % variation of mean for year over year
7. Plot the % variation of the mean for Month over month

Few more can be done. But it is the weather data and there are always a possibility of random variations which cannot guessed to the core.

Pass on your comments.