Tuesday, April 16, 2019

How Excel can put you in trouble...

I have a Large data , the head of it will look like the one as given below
2.33546E+11
ee
21-04-2017 13:04
ACTIVE
Pop
4
2017
2.33246E+11
gg
21-04-2017 15:43
ACTIVE
SMS
4
2017
2.33245E+11
hh
22-04-2017 21:52
ACTIVE
Intg
4
2017
2.33244E+11
jj
22-04-2017 22:09
ACTIVE
Intg
4
2017

My file did not  have headers. I have 10 such files, need to perform analysis  and get some insights.

The very first step I did was to insert a row and gave the names of the headers .
mobile_num
keyword
Date and time
status_name
mode_name
month
year
2.33546E+11
ee
21-04-2017 13:04
ACTIVE
Pop
4
2017

Saved the file in the csv format. Used R to read the file . It gave the summary . Every thing went fine.
I was about to compare the column of mobile_num of one file with another, it gave unexpected  output  which put me in lot of trouble.
What went wrong ?
After several attempts to check my code, test my logic, I was not convinced that my R prog has malfunctioned. Finally decided to see the data what R has read. The mobile numbers were read as 6 numbers followed by zeros as given in the exponential value. Truly unimaginable to the core!! how did it happen.  I don't know. But I have figured out what should have been done to avoid this.

1. While saving a column with exponential number / a 12 digit number it has to be formatted as number with no decimal places.
2. Insert a row above
3. paste the header names
4. save it and re open for confirmation.

You could save atleast  your own time if you know the above.

Thursday, April 11, 2019

Few of the problems we face...

While we know the syntax, algorithms, context and have the ability to anlayse the given data which may also be in a weird format, there are always challenges which consumes tons of our time by throwing error.
The error and the challenges I faced- quite a few,
1. A csv file which contained Date column and quite other variables. While R was reading this file, it interpreted as character, the character was something like ' 44326'. I was struggling with all my known skills/ strategies, I was not able to succeed.
Finally when gone through the data, I found the String ' Total' at the end of the date column, which was almost impossible for me to identify.
2. The second challenge was the date conversion, while the package may be R or Python, there are lot more chances that the date is not interpreted properly and the program might run.While I was cross checking the data I found the error, by the time I found it hours have gone !.
3.Use of reserved words / Standard words for the package - Example- count.
4. Space between the variables in the header
5. Blank lines in the first row

Lot more... will add then and there.