Thursday, July 25, 2019

Problem solving using R and Python a caution



Recently, I had a task of identifying repeated customers. For instance customer A purchases a product on 17th May and later comes on 22nd May , 24th May and so on. Likewise customer B purchases a product on 18th May and purchases similar products in  later dates. I had  a data of  3000+ customers and the repeated visit list of 21000. I need to identify  day wise  list of repetition something like this. The expected outcome is “ on which date of purchase yielded Maximum repeated customer’. Later we could identify the revenue.
Date of
 First purchase/Repeat
18 May
19 May
20 May
17 May
3
4
7
18 May
12
6
5
19 May
10
4
2

I used python for reading both the lists , coded  and computed the daywise list.  Out of curiosity , used merge command and tabulated daywise using R to cross check. To my shock there was a difference of 25 %. It spoiled my entire enthusiasm of doing  something useful as there is a substantial difference which otherwise should not be.  Compared the  tabulated values daywise obtained from python and the values obtained from R.
This simple piece of work has cost me few hours , but the learning is forever, which I thought of sharing .
Findings:   There is a marginal difference in every row. This is due to the fact that the machine/ data source has  duplicated the data while capturing the repeated customers. Cross checking is always required , may be with a different tool / strategy / approach. This might have resulted in wrong computations and projections, but I have avoided them in full.

No comments:

Post a Comment