数据分析学习(一)

工作之余，开始学习《利用Python进行数据分析》这本书，开始准备将采集到的房屋数据进行统计，挖点有用或者好完的数据出来。
先拿这本书中的1880-2010年间全美婴儿姓名的一部分来练练手，熟悉下IPython和Jupyter Notebook这个高级货。

一：整合数据

import pandas as pd
pieces = []
years = range(1880,2011)
columns=['name','sex','births']
opath = "/Users/xuxp/learn-env/pydata-book-master/ch02/names"
for year in years:
    path = "%s/yob%d.txt" % (opath,year)
    frame = pd.read_csv(path,names=columns)
    frame['year'] = year
    pieces.append(frame)

# 将所有的数据整合到单个DataFrame中
names = pd.concat(pieces,ignore_index=True)
    
names.head()

使用groupby或pivot_table在year和sex级别上对其进行聚合

二：按性别和年度统计总出生数

1 2	total_births = names.pivot_table('births',index='year',columns='sex',aggfunc=sum) total_births.tail()

1 2	%matplotlib inline total_births.plot(title='Total births by sex and year')

插入prop列，用于存放指定名字的婴儿数相对于总出生数的比例，prop为0.02表示100个人有两个人取了这个名字

def add_prop(group):
    births = group.births.astype(float)
    group['prop'] = births/births.sum()
    return group

names = names.groupby(['year','sex']).apply(add_prop)
names.head()

检查所有分组的prop值总和是否为1

1 2	import numpy as np np.allclose(names.groupby(['year','sex']).prop.sum(),1)

True

三、前1000个名字

取出数据的子集，每对sex/year组合的前1000个名字

def get_top1000(group):
    return group.sort_values(by="births",ascending=False)[:1000]
grouped = names.groupby(['year','sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()

四、分析命名趋势

boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']
total_births = top1000.pivot_table('births',index='year',columns='name',aggfunc=sum)
subset = total_births[['Minnie','Harry']]
subset.plot(subplots=True,figsize=(12,10),grid=False,title="Number og births per year")

五、评估命名多样性的增长

1 2	table = top1000.pivot_table('prop',index='year',columns='sex',aggfunc=sum) table.plot(title='Sum of table1000.prop by year and sex',yticks=np.linspace(0,1.2,13),xticks=range(1880,2020,10))

个人总结

极其佩服国外的一些大神能够做出这么好的东西，用着十分的舒服，爱不释手。
安装好anaconda后，我开始学习数据分析处理的基础知识，不能放弃学习。