Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Tuesday, November 05, 2019

Bokeh Visualization . Wedges


Struggling for few hours on how to get the donut chart label probably on the wedges.

Finally just realized it is math problem to get the coordinate of x and y. Damn me. 

Thursday, June 27, 2019

Missing what

Bokeh ah bokeh, I beh tahan liao lah.

To draw an interactive graph,  it is easier with one or two widgets button without changing the source or replot. But when chart's source changed, and charts are being redraw,  oh my....

I am stuck and stuck.

What am I missing up

Monday, June 03, 2019

Neural Prediction

Learning with pain. Picked some example to follow, but accidentally set train data near to 3 million dataset, killing the laptop.

First must understand the dataset size.

Second, to set train, val, and test dataset, make sure not so big size dataset.

Thursday, May 30, 2019

Thanos vs Theano

I like Thanos, he is ambitious and goal focus.

I like Theano too,  to manipulate data and give neural output.

Work as life, life is work too.

Wednesday, May 29, 2019

Learning Learning and Learning

decode 4 million geo-hash, big mistake. learning something new...coordinate, lat and long can be hashed / encoded.

Thursday, May 09, 2019

Headache

Headache, 头痛的事。

CCTV, IP Cam, 1080p? 2MP? Onvif support?  Price?!!

Works, how to break each n-columns as a group and save as new row? Looping?

Thursday, August 09, 2018

生动图表

学了多年数据分析,从 ms excel开始,然后pandas, 绘的图表都是静态的。以前觉得足够,因为偶尔才需要用到,交报告之类的。

从去年尾开始,已不满足,觉得要挑战自己,学生动的图表。做了很多资料搜寻和参考,最后选择了bokeh。它和pandas能兼容,所以就少了些烦恼。

两个月的时间,终于小有成就,能绘画出简单的。虽然pandas 和bokeh 都是python 的一部分,不过一向来都是用Jupyter notebook,突然要转换成python的script又花了几天时间。幸好现在网上都是资料,遇到难题,就搜搜,参考他人列子,再想一想,答案就出来了。

这次部分完成了。继续更多的挑战迈进。

Friday, March 02, 2018

Simplicity

A easy tasks but became complicate because my way of thinking.

Datetime           Action
Date A                  A
Date B                  B
Date C                  A
Date D                  B

Example as above where Datetime column contains different date and time. Column Action contains string A and B, where A represents a request and B is response.

The task was to find the time different between each request-response pair. So simple as it is,  just compare the time different between Date A and Date B pair,  and another pair Date C and Date D,  and so on.

Viola, task completed. The result commonly turned out to be few seconds different,  and not expecting more than 8 seconds. But somehow,  they were 5 to 10% turned out more than that. Somewhere was wrong.

Data are in millions row, manually check was nightmare but had too. Turned out sometime the data are not in proper pair,  could be A paired with Z, which is not related, a dirty data occurred.

So based on experiences,  conditions check must apply. And this was where For...If looping came to my mind. Begun the mistake. Wrote a looping to check where next row must be B after row A, only then compare to get the time different, else would be ignored.

For...If looping is fundamental logic as learned during college time,  doing some programming, as well as using Case,  or Switch or While looping. It is good to use when data are small, but when come to millions row and multiple columns, it took few hours to process. This was why I kept the computers running overnight to get the result,  after waited for about a hour per process still not getting result.

Then I was stuck on how to improve the For..If looping,  to make it more efficient. Thought for 2 days,  then I asked myself, why not rethink and rewrite. Looked back to the first solution attempted that was fast but without condition checking, can easily enhance on there.

Datetime      Action     A_Next   DT_Next
Date A                  A          B           Date B
Date B                  B          A           Date C
Date C                  A          B           Date D
Date D                  B        ....           .....

Basically,  just copy the next row Datetime and Action values to current row, to new columns. Then just ignore those rows not in A-B sequence,  and only calculate the time different on those valid rows : Datetime - DT_Next. 

From hours of waiting the result by using For..If looping,  reduces to, within few seconds can get the result,  for millions row of data.

The next challenge is how to use apply,  lambda and RegEx to replace str.contains(),  to extract data. For millions row data, currently took few minutes to extract,  hope can reduce it much more.

Learning is fun.

Tuesday, February 27, 2018

Data Analysis

When there are millions row of data,  it is not suggested to use looping + conditions check + column assign. It is very slow.

For index,  row in df.iterrow():
  If df.loc[index, "x"]=="y":
    If df.loc[index+1,"x"]=="z":
      df.loc[index,"k"]=df.loc[index+1,"a"]
......

Some advise not to use iterrow() if possible. Well, need to find alternative way.

Last time a simple problem buzzed me for few months, where read_csv was slowed on looping reading multiple csv files, size up to GiB, where data were  originally generated and extracted from text files. The problem was the datetime format, seems like python/pandas prefer in certain format.

The fun of the data analysis is: problem solving skills,  logics,  and get the statistics as evidence to support own suggestions.

Others may do data analysis to get insight of business performance,  I do it because to validate a system structure and system performance.

Fun when get the solutions, headache along the way.