Tuesday, February 27, 2018

Data Analysis

When there are millions row of data,  it is not suggested to use looping + conditions check + column assign. It is very slow.

For index,  row in df.iterrow():
  If df.loc[index, "x"]=="y":
    If df.loc[index+1,"x"]=="z":
      df.loc[index,"k"]=df.loc[index+1,"a"]
......

Some advise not to use iterrow() if possible. Well, need to find alternative way.

Last time a simple problem buzzed me for few months, where read_csv was slowed on looping reading multiple csv files, size up to GiB, where data were  originally generated and extracted from text files. The problem was the datetime format, seems like python/pandas prefer in certain format.

The fun of the data analysis is: problem solving skills,  logics,  and get the statistics as evidence to support own suggestions.

Others may do data analysis to get insight of business performance,  I do it because to validate a system structure and system performance.

Fun when get the solutions, headache along the way.

No comments:

Post a Comment