Wednesday, March 11, 2020

Happiness at work


What is happiness?
While there are a number of ways to express and define if, one particular form is challenge your brain get the work done and prove yourself that you are good at work.
Well,  I was struck at the middle of a python program for a logic.
Supposing we have a data for analysing the website visit data as given below


Repeated visit date
first Visit date
Customers
08-12-2019
09-12-2019
10-12-2019
11-12-2019
16-12-2019
17-12-2019
19-12-2019
20-12-2019
21-12-2019
08-12-2019
15
2
12
3
4
7
8
1
1
2
09-12-2019
25
4
11
2
3
6
7
0
0
1
15-12-2019
30
5
13
4
5
8
9
2
2
3
16-12-2019
42
7
8
1
1
2
3
4
7
3
17-12-2019
54
21
7
0
0
1
2
3
6
2
19-12-2019
22
6
9
2
2
3
4
5
8
4
20-12-2019
12
8
8
1
3
8
1
1
2
4

The business problem is : How many times  the customers acquired daily ,revisit in the first week of their visit ?
 We have thousands of data  for months together. Missing days cannot be added manually for sure.
Now let us look at the data, the first visit date is not continuous. The re visit date is also not continuous.
What is needed ? say on 8th Dec, we got 15 customers, and they visited ,2+12+3+4= 21 times. Like wise we need to identify for all the customers and aggregate.
If the data is in such a way that the rows start at 8th Dec and continues till 20th Dec without a break and similarly for the column, the logic is simple . It is the  sum of 7 numbers.
Here it is different. The first problem to address is to have a square matrix- Data frame.  Find the logic, the most important thing.
1.       Get the min date of first visit and the max date of the revisit
2.       Get a pd.series with date.
3.       Convert to data frame and then, left join it with the current data frame
4.       Delete the additional column that is automatically created and replace NaN with 0.
5.       The rows now will have the continuous dates, what about the columns ? Transpose and repeat the process.
6.       Transpose again to get the original matrix[ df ] and now it is complete
Wow ! we got the result. But cost me good hours to get the logic, patting on my back- the result of work- Happiness..
Now the compute the problem’s solution.
Logic:- for 1st row->  sum of 2nd column till 8th column
         For 2nd row ->  sum of 3rd column till 9th column
Now we get the catch , use 2 for loops and fix. Enjoy coding and analytical solutions..


Thursday, July 25, 2019

Problem solving using R and Python a caution



Recently, I had a task of identifying repeated customers. For instance customer A purchases a product on 17th May and later comes on 22nd May , 24th May and so on. Likewise customer B purchases a product on 18th May and purchases similar products in  later dates. I had  a data of  3000+ customers and the repeated visit list of 21000. I need to identify  day wise  list of repetition something like this. The expected outcome is “ on which date of purchase yielded Maximum repeated customer’. Later we could identify the revenue.
Date of
 First purchase/Repeat
18 May
19 May
20 May
17 May
3
4
7
18 May
12
6
5
19 May
10
4
2

I used python for reading both the lists , coded  and computed the daywise list.  Out of curiosity , used merge command and tabulated daywise using R to cross check. To my shock there was a difference of 25 %. It spoiled my entire enthusiasm of doing  something useful as there is a substantial difference which otherwise should not be.  Compared the  tabulated values daywise obtained from python and the values obtained from R.
This simple piece of work has cost me few hours , but the learning is forever, which I thought of sharing .
Findings:   There is a marginal difference in every row. This is due to the fact that the machine/ data source has  duplicated the data while capturing the repeated customers. Cross checking is always required , may be with a different tool / strategy / approach. This might have resulted in wrong computations and projections, but I have avoided them in full.

Tuesday, May 21, 2019

Is 55 Lakh data Difficult to analyse

Firstly, how can you get such a big volume of data? What kind of file ? What is the source ?

Any industry which is associated with sensor data,  e commerce sites, online portals ,  telecom industries  have such a huge data. These can't be stored in the normal csv format or in excel files.
These could be in server ,  if not in the files as json format. The volume of the files could be in GBs .

coming to the point of 5.5 million records [ in Indian context- More than Half a crore data], size of the files could be in the range of  3.5 GBs in the json format.


How could one read those file and get the insights ?

While there are several tools available , open source tools such as R and Python , does the magic.

 Python has several features , reading the file could be easier with just two commands. One need to be more cautious as it involves  ' Date - time '  as a variable, which will  can often put us in trouble.

Getting insights: 
Before starting the analysis part, what is important is to understand the type of conversion of data that  has happened. There are possibilities that the same type of data is not  understood by the package.
'int' can also be read as 'char'.
Since it is in the json format, chances are that the variables get stored as 'dict' format. 
The major work will be to clean up the data and make it workable.


'Groupby'  can used in cases to draw conclusions and form tables

Once things are set in order, it is the usual dataframe and analysis part is just a usual piece of cake.
Happy Analysis !

Tuesday, April 16, 2019

How Excel can put you in trouble...

I have a Large data , the head of it will look like the one as given below
2.33546E+11
ee
21-04-2017 13:04
ACTIVE
Pop
4
2017
2.33246E+11
gg
21-04-2017 15:43
ACTIVE
SMS
4
2017
2.33245E+11
hh
22-04-2017 21:52
ACTIVE
Intg
4
2017
2.33244E+11
jj
22-04-2017 22:09
ACTIVE
Intg
4
2017

My file did not  have headers. I have 10 such files, need to perform analysis  and get some insights.

The very first step I did was to insert a row and gave the names of the headers .
mobile_num
keyword
Date and time
status_name
mode_name
month
year
2.33546E+11
ee
21-04-2017 13:04
ACTIVE
Pop
4
2017

Saved the file in the csv format. Used R to read the file . It gave the summary . Every thing went fine.
I was about to compare the column of mobile_num of one file with another, it gave unexpected  output  which put me in lot of trouble.
What went wrong ?
After several attempts to check my code, test my logic, I was not convinced that my R prog has malfunctioned. Finally decided to see the data what R has read. The mobile numbers were read as 6 numbers followed by zeros as given in the exponential value. Truly unimaginable to the core!! how did it happen.  I don't know. But I have figured out what should have been done to avoid this.

1. While saving a column with exponential number / a 12 digit number it has to be formatted as number with no decimal places.
2. Insert a row above
3. paste the header names
4. save it and re open for confirmation.

You could save atleast  your own time if you know the above.