Monday, May 7, 2018

Reading a confusion Matrix... A misclassifier


How to read a confusion matrix
suppose you have 95 records available for prediction. The record has a target to predict whether the person has a particular disease or not.
 the matrix displays as

Confirmed
Not confirmed
Confirmed
35
20
Not confirmed
25
15

What do you understand?

 I have given summation of the rows and columns

Predicted values

Confirmed
Not confirmed
total
Actual values
Confirmed
35
20
55
Not confirmed
25
15
40
total
60
35
95

Can you get some information?
The green cells are the actuals, while the yellows are the predicted values and the pink are the errors.
Actual values are the one which are tested physically and the results are available on hand. Which means in our case 55 people had the disease and 40 don’t have.
What did the algorithm do ? It has predicted as 60 people who have disease and 35 who do not have.
How to interpret? Out of the 55 people, the algorithm has made a wrong prediction of classifying 20 people as they don’t have a disease. 
Secondly, out of 45 people who do not have disease it has predicted 25 people that they may have a disease, which is a very serious issue.

Reached 1000...more to go


Saturday, May 5, 2018

Few algorithms a Data science professional should know

1.Perceptron Algorithm
2.Hoeffding's Inequality
3. Data preparation:- Normalisation, Feature scaling, Binary scaling, standardisation
4.Machine Learning algorithms such as Linear , logistic regression, Principal Component analysis,
Artificial Neural Network, K- Nearest Neighbours, Naive bayes, K- means clustering,Support Vector Machine,Random Forerst
5. Validation algorithms

Perceptron Algorithm

1.     Initialize the weights and threshold to small random numbers.
2.     Present a vector to the neuron inputs and calculate the output.
     3.      Update the weights 
     4.       Repeat steps 2 and 3 until:
o   the iteration error is less than a user-specified error threshold or
a predetermined number of iterations have been completed .

Hoeffding’s inequality
    A powerful technique—perhaps the most important inequality in learning theory—for bounding the probability that sums of bounded random variables are too large or too small
x

Univariate Time series

We have a data set which is varying with time. For instance temperatures of  a particular place for the past 20 years. What insight can we have?
1. Plot the entire data as time series
2. Check for any cyclic pattern
3. Plot the mean of the temperature for every season( 3-4 months grouped as a season)
4. Plot the variable (temperature in our case) for every month. Mean temperature(for all 20 years) in y axis and months in the x axis.
5.If still we did not get any visible pattern - try moving day average methods- ( 100 DMA, 200 DMA)etc. a ARIMA Model.
6.Plot the % variation of mean for year over year
7. Plot the % variation of the mean for Month over month

Few more can be done. But it is  the weather data and there are always a possibility of random variations which cannot guessed to the core.

Pass on your comments.