INSIGHT ON ANALYTICS: July 2019

Recently, I had a task of identifying repeated customers. For instance customer A purchases a product on 17^th May and later comes on 22^nd May , 24^th May and so on. Likewise customer B purchases a product on 18^th May and purchases similar products in later dates. I had a data of 3000+ customers and the repeated visit list of 21000. I need to identify day wise list of repetition something like this. The expected outcome is “ on which date of purchase yielded Maximum repeated customer’. Later we could identify the revenue.

Date of First purchase/Repeat	18 May	19 May	20 May
17 May	3	4	7
18 May	12	6	5
19 May	10	4	2

I used python for reading both the lists , coded and computed the daywise list. Out of curiosity , used merge command and tabulated daywise using R to cross check. To my shock there was a difference of 25 %. It spoiled my entire enthusiasm of doing something useful as there is a substantial difference which otherwise should not be. Compared the tabulated values daywise obtained from python and the values obtained from R.

This simple piece of work has cost me few hours , but the learning is forever, which I thought of sharing .

Findings: There is a marginal difference in every row. This is due to the fact that the machine/ data source has duplicated the data while capturing the repeated customers. Cross checking is always required , may be with a different tool / strategy / approach. This might have resulted in wrong computations and projections, but I have avoided them in full.

INSIGHT ON ANALYTICS

Thursday, July 25, 2019

Problem solving using R and Python a caution