疯子在哭泣。。。呐喊。。。沉静。。。哭泣: pandas

Showing posts with label pandas. Show all posts

Wednesday, February 27, 2019

Broken . Python . World Map

Was headache for few days, in solving a very simple problem of data visualization : refresh the data, update map and zoom to the particular map area.

Well, sound easy but first it kept getting "cannot insert level_0" error when refreshing data....solution, drop the index when unpack dataframe before go for next calculation. A simple solution , but wasted a day.

Then 2nd problem, when data was updated, based on selected country, the map should redraw and zoom to that country. Another day gone for this. The solution, don't set x_range and y_range at the initial stage, let it auto set.

Plot=figure()

So when update data

Def update(attractive, old, new):
Plot.x_range = Range1d(min, max)
Plot.y_range = Range1d(min,max)

It redraw the plot to new x and y range, this is how map is zoomed to specific area.

After reading hundreds of posts and comments on online resources, stackoveflow , only then came out the solutions. To focus on problem, also not a good thing to.

She was coughing in the morning . I heard, I worried. But again I was over thought, must be.

Thursday, August 09, 2018

生动图表

学了多年数据分析，从 ms excel开始，然后pandas，绘的图表都是静态的。以前觉得足够，因为偶尔才需要用到，交报告之类的。

从去年尾开始，已不满足，觉得要挑战自己，学生动的图表。做了很多资料搜寻和参考，最后选择了bokeh。它和pandas能兼容，所以就少了些烦恼。

两个月的时间，终于小有成就，能绘画出简单的。虽然pandas 和bokeh 都是python 的一部分，不过一向来都是用Jupyter notebook，突然要转换成python的script又花了几天时间。幸好现在网上都是资料，遇到难题，就搜搜，参考他人列子，再想一想，答案就出来了。

这次部分完成了。继续更多的挑战迈进。

Friday, March 02, 2018

Simplicity

A easy tasks but became complicate because my way of thinking.

Datetime           Action
Date A                  A
Date B                  B
Date C                  A
Date D                  B

Example as above where Datetime column contains different date and time. Column Action contains string A and B, where A represents a request and B is response.

The task was to find the time different between each request-response pair. So simple as it is, just compare the time different between Date A and Date B pair, and another pair Date C and Date D, and so on.

Viola, task completed. The result commonly turned out to be few seconds different, and not expecting more than 8 seconds. But somehow, they were 5 to 10% turned out more than that. Somewhere was wrong.

Data are in millions row, manually check was nightmare but had too. Turned out sometime the data are not in proper pair, could be A paired with Z, which is not related, a dirty data occurred.

So based on experiences, conditions check must apply. And this was where For...If looping came to my mind. Begun the mistake. Wrote a looping to check where next row must be B after row A, only then compare to get the time different, else would be ignored.

For...If looping is fundamental logic as learned during college time, doing some programming, as well as using Case, or Switch or While looping. It is good to use when data are small, but when come to millions row and multiple columns, it took few hours to process. This was why I kept the computers running overnight to get the result, after waited for about a hour per process still not getting result.

Then I was stuck on how to improve the For..If looping, to make it more efficient. Thought for 2 days, then I asked myself, why not rethink and rewrite. Looked back to the first solution attempted that was fast but without condition checking, can easily enhance on there.

Datetime      Action     A_Next   DT_Next
Date A                  A          B           Date B
Date B                  B          A           Date C
Date C                  A          B           Date D
Date D                  B        ....           .....

Basically, just copy the next row Datetime and Action values to current row, to new columns. Then just ignore those rows not in A-B sequence, and only calculate the time different on those valid rows : Datetime - DT_Next.

From hours of waiting the result by using For..If looping, reduces to, within few seconds can get the result, for millions row of data.

The next challenge is how to use apply, lambda and RegEx to replace str.contains(), to extract data. For millions row data, currently took few minutes to extract, hope can reduce it much more.

Learning is fun.

Tuesday, February 27, 2018

Data Analysis

When there are millions row of data, it is not suggested to use looping + conditions check + column assign. It is very slow.

For index, row in df.iterrow():
If df.loc[index, "x"]=="y":
If df.loc[index+1,"x"]=="z":
df.loc[index,"k"]=df.loc[index+1,"a"]
......

Some advise not to use iterrow() if possible. Well, need to find alternative way.

Last time a simple problem buzzed me for few months, where read_csv was slowed on looping reading multiple csv files, size up to GiB, where data were originally generated and extracted from text files. The problem was the datetime format, seems like python/pandas prefer in certain format.

The fun of the data analysis is: problem solving skills, logics, and get the statistics as evidence to support own suggestions.

Others may do data analysis to get insight of business performance, I do it because to validate a system structure and system performance.

Fun when get the solutions, headache along the way.