Thursday, July 25, 2019

Problem solving using R and Python a caution



Recently, I had a task of identifying repeated customers. For instance customer A purchases a product on 17th May and later comes on 22nd May , 24th May and so on. Likewise customer B purchases a product on 18th May and purchases similar products in  later dates. I had  a data of  3000+ customers and the repeated visit list of 21000. I need to identify  day wise  list of repetition something like this. The expected outcome is “ on which date of purchase yielded Maximum repeated customer’. Later we could identify the revenue.
Date of
 First purchase/Repeat
18 May
19 May
20 May
17 May
3
4
7
18 May
12
6
5
19 May
10
4
2

I used python for reading both the lists , coded  and computed the daywise list.  Out of curiosity , used merge command and tabulated daywise using R to cross check. To my shock there was a difference of 25 %. It spoiled my entire enthusiasm of doing  something useful as there is a substantial difference which otherwise should not be.  Compared the  tabulated values daywise obtained from python and the values obtained from R.
This simple piece of work has cost me few hours , but the learning is forever, which I thought of sharing .
Findings:   There is a marginal difference in every row. This is due to the fact that the machine/ data source has  duplicated the data while capturing the repeated customers. Cross checking is always required , may be with a different tool / strategy / approach. This might have resulted in wrong computations and projections, but I have avoided them in full.

Tuesday, May 21, 2019

Is 55 Lakh data Difficult to analyse

Firstly, how can you get such a big volume of data? What kind of file ? What is the source ?

Any industry which is associated with sensor data,  e commerce sites, online portals ,  telecom industries  have such a huge data. These can't be stored in the normal csv format or in excel files.
These could be in server ,  if not in the files as json format. The volume of the files could be in GBs .

coming to the point of 5.5 million records [ in Indian context- More than Half a crore data], size of the files could be in the range of  3.5 GBs in the json format.


How could one read those file and get the insights ?

While there are several tools available , open source tools such as R and Python , does the magic.

 Python has several features , reading the file could be easier with just two commands. One need to be more cautious as it involves  ' Date - time '  as a variable, which will  can often put us in trouble.

Getting insights: 
Before starting the analysis part, what is important is to understand the type of conversion of data that  has happened. There are possibilities that the same type of data is not  understood by the package.
'int' can also be read as 'char'.
Since it is in the json format, chances are that the variables get stored as 'dict' format. 
The major work will be to clean up the data and make it workable.


'Groupby'  can used in cases to draw conclusions and form tables

Once things are set in order, it is the usual dataframe and analysis part is just a usual piece of cake.
Happy Analysis !

Tuesday, April 16, 2019

How Excel can put you in trouble...

I have a Large data , the head of it will look like the one as given below
2.33546E+11
ee
21-04-2017 13:04
ACTIVE
Pop
4
2017
2.33246E+11
gg
21-04-2017 15:43
ACTIVE
SMS
4
2017
2.33245E+11
hh
22-04-2017 21:52
ACTIVE
Intg
4
2017
2.33244E+11
jj
22-04-2017 22:09
ACTIVE
Intg
4
2017

My file did not  have headers. I have 10 such files, need to perform analysis  and get some insights.

The very first step I did was to insert a row and gave the names of the headers .
mobile_num
keyword
Date and time
status_name
mode_name
month
year
2.33546E+11
ee
21-04-2017 13:04
ACTIVE
Pop
4
2017

Saved the file in the csv format. Used R to read the file . It gave the summary . Every thing went fine.
I was about to compare the column of mobile_num of one file with another, it gave unexpected  output  which put me in lot of trouble.
What went wrong ?
After several attempts to check my code, test my logic, I was not convinced that my R prog has malfunctioned. Finally decided to see the data what R has read. The mobile numbers were read as 6 numbers followed by zeros as given in the exponential value. Truly unimaginable to the core!! how did it happen.  I don't know. But I have figured out what should have been done to avoid this.

1. While saving a column with exponential number / a 12 digit number it has to be formatted as number with no decimal places.
2. Insert a row above
3. paste the header names
4. save it and re open for confirmation.

You could save atleast  your own time if you know the above.

Thursday, April 11, 2019

Few of the problems we face...

While we know the syntax, algorithms, context and have the ability to anlayse the given data which may also be in a weird format, there are always challenges which consumes tons of our time by throwing error.
The error and the challenges I faced- quite a few,
1. A csv file which contained Date column and quite other variables. While R was reading this file, it interpreted as character, the character was something like ' 44326'. I was struggling with all my known skills/ strategies, I was not able to succeed.
Finally when gone through the data, I found the String ' Total' at the end of the date column, which was almost impossible for me to identify.
2. The second challenge was the date conversion, while the package may be R or Python, there are lot more chances that the date is not interpreted properly and the program might run.While I was cross checking the data I found the error, by the time I found it hours have gone !.
3.Use of reserved words / Standard words for the package - Example- count.
4. Space between the variables in the header
5. Blank lines in the first row

Lot more... will add then and there.

Thursday, March 28, 2019

Capabilities of python


What  special things can be done using python?
Recently I happen to  analyse a set of csv files atleast 100.
Each file contains several fields- variables, among them the most is the purchaser Id, purchase date .
The task is,
List all the purchaser’s id and find them in the successive date  files, to identify whether the purchaser has visited again.
If available list the dates and the value of all purchases he has made.
The illustration
Day1 – 01-12-2018
Ids          value of purchase
123         10
234         15
456         25
Get the summary – 3 customers , Rs 50 /-

Find these ids in the next day – 02-12-2018
If available get the summary- only one purchaser id is available – value is  Rs 20
Extend this for the next 99 days. This is cycle 1.
Pick up the list of purchasers in the day 2 and continue this process for the next 98 days.

WoW , what a wonderful tool is this ?
It is calculating the number of files available, running  the iterations – say 100 factorial, tabulating   summarizing  the results  as output in excel format, and what else, the capability is still more.




Thursday, February 7, 2019

Market basket Analysis

Few terminologies:


Transaction is a set of items (Itemset).

Confidence : It is the measure of uncertainty or trust worthiness associated with each discovered
pattern.

Support : It is the measure of how often the collection of items in an association occur together as percentage of all transactions

Frequent itemset : If an itemset satisfies minimum support,then it is a frequent itemset.

Strong Association rules: Rules that satisfy both a minimum support threshold and a minimum
confidence threshold

In Association rule mining, we first find all frequent itemsets and then generate strong association rules from the frequent itemsets



Apriori algorithm is the most established algorithm for finding frequent item sets mining.

The basic principle of Apriori is “Any subset of a frequent itemset must be frequent”.

We use these frequent itemsets to generate association rules.



Association rule mining

Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases

Understand customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”

Applications: Basket data analysis, cross‐marketing, catalog design, loss‐leader analysis, web log analysis, fraud detection  (supervisor‐>examiner)

Rule form

Antecedent →Consequent [support, confidence]

(support and confidence are user defined measures of interestingness)

Let the rule discovered be {Jamun,...} → {Potato Chips}


Potato chips as consequent => Can be used to determine what should be done to boost its sales

Jamun in the antecedent => Can be used to see which products would be affected if the store discontinues selling Jamun

Jamun in antecedent and Potato chips in the consequent =>  Can be used to see what products should be sold with Jamun to promote sale of Potato Chips


Find all itemsets that have high support,These are known as frequent itemsets. Generate association rules from frequent itemsets


Let us see an example
:
 Finding the support, confidence and lift of i) shirts and ties, ii) trousers and ties.
Transaction  Id
Shirts
Trousers
Ties
001
1
1
1
002
0
1
0
003
1
0
1
004
1
0
1
005
1
1
0

For our data there are 3 transactions with both shirts and ties (shirts ∩ ties) out of total 5 transactions.
Support =3/5 =0.6 or 60%
Confidence for association is calculated using the following formula:
In our example, there are 3 transaction for both shirts and ties together out of 4 transactions for shirts. The calculation for confidence for our dataset is:
Confidence =3/4 =0.75 or 75 %
A third useful metric for association analysis is lift; it is defined as:
Expected confidence in the above formula is presence of ties in the overall dataset i.e. there are 4 instances of ties purchase out of 5.
The value for lift, 125%, shows that purchases of the ties improve when the customers buy shirts. The point to note   here is that if the customer buys a shirt, does his chance of buying ties go up i.e. value of lift above 100
Lift= confidence / expected confidence = P( ties | shirts) / P(ties)
Expected confidence in the above formula is presence of ties in the overall dataset i.e. there are 4 instances of ties purchase out of 5.
Lift = (3/4) / ( 3/5) = 15/12 =1.25 or 125%
Similarly for the trousers and ties
Support: 1/5
Confidence: 1/3
Lift= ( 1/3 )/ (3/5)= 5/9 or 55.6%




Customer Life Time Value


Just like we use Net Present Value (NPV) to evaluate investments and companies, we use CLV to evaluate customer relationships CLV is the expected NPV of the cash flows from a customer relationship CLV is defined as the discounted sum of all future customer revenue streams minus product and servicing costs and re marketing costs
Let us assume  we are analyzing the customer life time value of a company which offers services of a kind . The following are obtained.
Where ,

This is a simple model  of calculation. The formula differs when the service is offered before the contribution from the customer. [ Eg- Credit cards ]


Segmentation of data -K- Means



  1. Demographic segmentation
In business, demographic segmentation is when an organization uses data about the demographic characteristics of its customers to better target and enhance its marketing efforts.
By segmenting a market by demographic variables such as the age of the customer, gender, income, education, religion, and family life cycle, business professionals are able to create groups of customers that display similar wants and needs. Examples

  Age
  Gender
  Income

  Education
  Family life cycle

other segmentationss

  1. Behavioral segmentation
  2. Psychographic segmentation
  3. Geographic segmentation



Illustration of the algorithm- Choose two random centeroids.
Step:3 

Step:4

What we expect, but what we get will be different.



 This can be accomplished by making the WCSS -within-cluster sums of squares as minimum as possible.  This metric determines the optimum number of clusters .