Recently, I had a task of identifying repeated customers.
For instance customer A purchases a product on 17th May and later
comes on 22nd May , 24th May and so on. Likewise customer
B purchases a product on 18th May and purchases similar products
in later dates. I had a data of
3000+ customers and the repeated visit list of 21000. I need to identify day wise
list of repetition something like this. The expected outcome is “ on
which date of purchase yielded Maximum repeated customer’. Later we could
identify the revenue.
Date of
First purchase/Repeat
|
18 May
|
19 May
|
20 May
|
17 May
|
3
|
4
|
7
|
18 May
|
12
|
6
|
5
|
19 May
|
10
|
4
|
2
|
I used python for reading both
the lists , coded and computed the
daywise list. Out of curiosity , used
merge command and tabulated daywise using R to cross check. To my shock there
was a difference of 25 %. It spoiled my entire enthusiasm of doing something useful as there is a substantial
difference which otherwise should not be.
Compared the tabulated values
daywise obtained from python and the values obtained from R.
This simple piece of work has
cost me few hours , but the learning is forever, which I thought of sharing .
Findings: There is a marginal difference in every row.
This is due to the fact that the machine/ data source has duplicated the data while capturing the
repeated customers. Cross checking is always required , may be with a different
tool / strategy / approach. This might have resulted in wrong computations and
projections, but I have avoided them in full.