数据来源

https://github.com/wesm/pydata-book

>>> import pandas as pd
>>> from pandas import DataFrame,Series
>>> import numpy as np
>>> data = pd.read_csv('D:\python\DataAnalysis\data\Haiti.csv')
>>> data[:5]
   Serial   ...    VERIFIED
0    4052   ...          NO
1    4051   ...          NO
2    4050   ...          NO
3    4049   ...          NO
4    4042   ...          NO

[5 rows x 10 columns]

查看数据

看看那些数据是我们想要的,每行表示一条从某人手机上发生的紧急或其他问题的报告,每条报告包含一个时间戳和位置(经度和纬度)

['INCIDENT DATE','LATITUDE','LONGITUDE'] 将要查看的列索引

[:10]   前十行

>>> data[['INCIDENT DATE','LATITUDE','LONGITUDE']][:10]
      INCIDENT DATE   LATITUDE   LONGITUDE
0  05/07/2010 17:26  18.233333  -72.533333
1  28/06/2010 23:06  50.226029    5.729886
2  24/06/2010 16:21  22.278381  114.174287
3  20/06/2010 21:59  44.407062    8.933989
4  18/05/2010 16:26  18.571084  -72.334671
5  26/04/2010 13:14  18.593707  -72.310079
6  26/04/2010 14:19  18.482800  -73.638800
7  26/04/2010 14:27  18.415000  -73.195000
8  15/03/2010 10:58  18.517443  -72.236841
9  15/03/2010 11:00  18.547790  -72.410010

category字段含有一组以逗号分隔的代码,这些代码表示消息的类型:

>>> data['CATEGORY'][:6]
0          1. Urgences | Emergency, 3. Public Health, 
1    1. Urgences | Emergency, 2. Urgences logistiqu...
2    2. Urgences logistiques | Vital Lines, 8. Autr...
3                            1. Urgences | Emergency, 
4                            1. Urgences | Emergency, 
5                       5e. Communication lines down, 
Name: CATEGORY, dtype: object

调用describe还能发现一些异常的地理位置,根据经纬度判断

>>> data.describe()
            Serial     LATITUDE    LONGITUDE
count  3593.000000  3593.000000  3593.000000
mean   2080.277484    18.611495   -72.322680
std    1171.100360     0.738572     3.650776
min       4.000000    18.041313   -74.452757
25%    1074.000000    18.524070   -72.417500
50%    2163.000000    18.539269   -72.335000
75%    3088.000000    18.561820   -72.293570
max    4052.000000    50.226029   114.174287

清理数据

错误位置信息并移除缺失分类信息,海地的经纬度的处理。

通过条件选择,过滤掉不在海地范围内的数据,并排除分类信息为空的数据

>>> data = data[(data.LATITUDE >18)&(data.LATITUDE<20)&(data.LONGITUDE > -75)&(data.LONGITUDE < -70)&(data.CATEGORY.notnull())]
>>> data.describe()
            Serial     LATITUDE    LONGITUDE
count  3569.000000  3569.000000  3569.000000
mean   2081.498459    18.592503   -72.424994
std    1170.311824     0.273695     0.291018
min       4.000000    18.041313   -74.452757
25%    1074.000000    18.524200   -72.417498
50%    2166.000000    18.539269   -72.335000
75%    3089.000000    18.561800   -72.293939
max    4052.000000    19.940630   -71.099489

由于每个分类可能包含多个分类,每个名称除了英文还包含法文,所以编写了三个函数,前两个用于获取所有分类列表,第三个是获取编码和英文名称

to_cat_list(catstr):

移除字符串的首位空格并返回

get_all_categories(cat_series):

将传入的字符串处理后放到set集合中,并排序返回

get_english(cat):

处理名称,获得‘|’后面的英文字符。

>>> def to_cat_list(catstr):
...     stripped = (x.strip() for x in catstr.split(','))
...     return [x for x in stripped if x]
... 
>>> def get_all_categories(cat_series):
...     cat_sets = (set(to_cat_list(x)) for x in cat_series)
...     return sorted(set.union(*cat_sets))
... 
>>> def get_english(cat):
...     code,names = cat.split('.')
...     if '|' in names:
...         names = names.split('|')[1]
...     return code,names.strip()
... 

测试get_english

>>> get_english('2.Urgences logistiques | Vital Lines')
('2', 'Vital Lines')

制作一个将编码和名称映射的字典

>>> all_cats = get_all_categories(data.CATEGORY)
>>> english_mapping = dict(get_english(x) for x in all_cats)
>>> english_mapping['2a']
'Food Shortage'
>>> english_mapping['6c']
'Earthquake and aftershocks'

构造全零的DataFrame,列为分类编码,索引跟data的索引一样

>>> def get_code(seq):
...     return [x.split('.')[0] for x in seq if x]
... 
>>> all_codes = get_code(all_cats)
>>> code_index = pd.Index(np.unique((all_codes)))
>>> dummy_frame = DataFrame(np.zeros((len(data),len(code_index))),index=data.index,columns=code_index)
>>> dummy_frame.ix[:,:6]
        1   1a   1b   1c   1d    2
0     0.0  0.0  0.0  0.0  0.0  0.0
4     0.0  0.0  0.0  0.0  0.0  0.0
5     0.0  0.0  0.0  0.0  0.0  0.0
6     0.0  0.0  0.0  0.0  0.0  0.0
7     0.0  0.0  0.0  0.0  0.0  0.0
8     0.0  0.0  0.0  0.0  0.0  0.0
9     0.0  0.0  0.0  0.0  0.0  0.0
10    0.0  0.0  0.0  0.0  0.0  0.0
11    0.0  0.0  0.0  0.0  0.0  0.0
12    0.0  0.0  0.0  0.0  0.0  0.0
13    0.0  0.0  0.0  0.0  0.0  0.0
14    0.0  0.0  0.0  0.0  0.0  0.0
15    0.0  0.0  0.0  0.0  0.0  0.0
16    0.0  0.0  0.0  0.0  0.0  0.0
17    0.0  0.0  0.0  0.0  0.0  0.0
18    0.0  0.0  0.0  0.0  0.0  0.0
19    0.0  0.0  0.0  0.0  0.0  0.0
20    0.0  0.0  0.0  0.0  0.0  0.0
21    0.0  0.0  0.0  0.0  0.0  0.0
22    0.0  0.0  0.0  0.0  0.0  0.0
23    0.0  0.0  0.0  0.0  0.0  0.0
24    0.0  0.0  0.0  0.0  0.0  0.0
25    0.0  0.0  0.0  0.0  0.0  0.0
26    0.0  0.0  0.0  0.0  0.0  0.0
27    0.0  0.0  0.0  0.0  0.0  0.0
28    0.0  0.0  0.0  0.0  0.0  0.0
29    0.0  0.0  0.0  0.0  0.0  0.0
30    0.0  0.0  0.0  0.0  0.0  0.0
31    0.0  0.0  0.0  0.0  0.0  0.0
32    0.0  0.0  0.0  0.0  0.0  0.0
   ...  ...  ...  ...  ...  ...
3563  0.0  0.0  0.0  0.0  0.0  0.0
3564  0.0  0.0  0.0  0.0  0.0  0.0
3565  0.0  0.0  0.0  0.0  0.0  0.0
3566  0.0  0.0  0.0  0.0  0.0  0.0
3567  0.0  0.0  0.0  0.0  0.0  0.0
3568  0.0  0.0  0.0  0.0  0.0  0.0
3569  0.0  0.0  0.0  0.0  0.0  0.0
3570  0.0  0.0  0.0  0.0  0.0  0.0
3571  0.0  0.0  0.0  0.0  0.0  0.0
3572  0.0  0.0  0.0  0.0  0.0  0.0
3573  0.0  0.0  0.0  0.0  0.0  0.0
3574  0.0  0.0  0.0  0.0  0.0  0.0
3575  0.0  0.0  0.0  0.0  0.0  0.0
3576  0.0  0.0  0.0  0.0  0.0  0.0
3577  0.0  0.0  0.0  0.0  0.0  0.0
3578  0.0  0.0  0.0  0.0  0.0  0.0
3579  0.0  0.0  0.0  0.0  0.0  0.0
3580  0.0  0.0  0.0  0.0  0.0  0.0
3581  0.0  0.0  0.0  0.0  0.0  0.0
3582  0.0  0.0  0.0  0.0  0.0  0.0
3583  0.0  0.0  0.0  0.0  0.0  0.0
3584  0.0  0.0  0.0  0.0  0.0  0.0
3585  0.0  0.0  0.0  0.0  0.0  0.0
3586  0.0  0.0  0.0  0.0  0.0  0.0
3587  0.0  0.0  0.0  0.0  0.0  0.0
3588  0.0  0.0  0.0  0.0  0.0  0.0
3589  0.0  0.0  0.0  0.0  0.0  0.0
3590  0.0  0.0  0.0  0.0  0.0  0.0
3591  0.0  0.0  0.0  0.0  0.0  0.0
3592  0.0  0.0  0.0  0.0  0.0  0.0

将需要的项设置为1,在于data进行连接。

for row,cat in zip(data.index,data.CATEGORY):
    codes = get_code(to_cat_list(cat))
    dummy_frame.ix[row,codes] = 1
 
data = data.join(dummy_frame.add_prefix('category_'))

绘制海地地图

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
def basic_haiti_map(ax=None,lllat=17.25,urlat=20.25,lllon=-75,urlon=-71):
    m = Basemap(ax=ax,projection='stere',
        lon_0 = (urlon+lllon)/2,
        lat_0 = (urlat+lllat)/2,
        llcrnrlat = lllat,urcrnrlat = urlat,
        llcrnrlon = lllon,urcrnrlon = urlon,
        resolution = 'f')
    m.drawcoastlines()
    m.drawstates()
    m.drawcountries()
    return m

绘制点

fig,axes = plt.subplots(nrows = 2,ncols = 2,figsize=(12,10))
fig.subplots_adjust(hspace=0.05,wspace=0.05)
to_plot = ['2a','1','3c','7a']
maplist = []
for code,ax in zip(to_plot,axes.flat):
    m = basic_haiti_map(ax)
    maplist.append(m)
for code,ax ,m in zip(to_plot,axes.flat,maplist):	 
    cat_data = data[data['category_%s' %code] == 1]
    x, y = m(cat_data.LONGITUDE.values, cat_data.LATITUDE values)
    m.plot(x,y,'k.',alpha=0.5)
    ax.set_title('%s: %s' % (code,english_mapping[code]))

Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐