Tuesday, May 21, 2019

Is 55 Lakh data Difficult to analyse

Firstly, how can you get such a big volume of data? What kind of file ? What is the source ?

Any industry which is associated with sensor data,  e commerce sites, online portals ,  telecom industries  have such a huge data. These can't be stored in the normal csv format or in excel files.
These could be in server ,  if not in the files as json format. The volume of the files could be in GBs .

coming to the point of 5.5 million records [ in Indian context- More than Half a crore data], size of the files could be in the range of  3.5 GBs in the json format.


How could one read those file and get the insights ?

While there are several tools available , open source tools such as R and Python , does the magic.

 Python has several features , reading the file could be easier with just two commands. One need to be more cautious as it involves  ' Date - time '  as a variable, which will  can often put us in trouble.

Getting insights: 
Before starting the analysis part, what is important is to understand the type of conversion of data that  has happened. There are possibilities that the same type of data is not  understood by the package.
'int' can also be read as 'char'.
Since it is in the json format, chances are that the variables get stored as 'dict' format. 
The major work will be to clean up the data and make it workable.


'Groupby'  can used in cases to draw conclusions and form tables

Once things are set in order, it is the usual dataframe and analysis part is just a usual piece of cake.
Happy Analysis !