Friday, March 02, 2018

Simplicity

A easy tasks but became complicate because my way of thinking.

Datetime           Action
Date A                  A
Date B                  B
Date C                  A
Date D                  B

Example as above where Datetime column contains different date and time. Column Action contains string A and B, where A represents a request and B is response.

The task was to find the time different between each request-response pair. So simple as it is,  just compare the time different between Date A and Date B pair,  and another pair Date C and Date D,  and so on.

Viola, task completed. The result commonly turned out to be few seconds different,  and not expecting more than 8 seconds. But somehow,  they were 5 to 10% turned out more than that. Somewhere was wrong.

Data are in millions row, manually check was nightmare but had too. Turned out sometime the data are not in proper pair,  could be A paired with Z, which is not related, a dirty data occurred.

So based on experiences,  conditions check must apply. And this was where For...If looping came to my mind. Begun the mistake. Wrote a looping to check where next row must be B after row A, only then compare to get the time different, else would be ignored.

For...If looping is fundamental logic as learned during college time,  doing some programming, as well as using Case,  or Switch or While looping. It is good to use when data are small, but when come to millions row and multiple columns, it took few hours to process. This was why I kept the computers running overnight to get the result,  after waited for about a hour per process still not getting result.

Then I was stuck on how to improve the For..If looping,  to make it more efficient. Thought for 2 days,  then I asked myself, why not rethink and rewrite. Looked back to the first solution attempted that was fast but without condition checking, can easily enhance on there.

Datetime      Action     A_Next   DT_Next
Date A                  A          B           Date B
Date B                  B          A           Date C
Date C                  A          B           Date D
Date D                  B        ....           .....

Basically,  just copy the next row Datetime and Action values to current row, to new columns. Then just ignore those rows not in A-B sequence,  and only calculate the time different on those valid rows : Datetime - DT_Next. 

From hours of waiting the result by using For..If looping,  reduces to, within few seconds can get the result,  for millions row of data.

The next challenge is how to use apply,  lambda and RegEx to replace str.contains(),  to extract data. For millions row data, currently took few minutes to extract,  hope can reduce it much more.

Learning is fun.

No comments:

Post a Comment