目录

一.DataFrame是什么

二.DataFrame的创建

1.直接通过字典创建DataFrame

2.通过字典创建时指定列的顺序和行索引

三.DataFrame的常用属性

四.DataFrame的常用方法

五.DataFrame的布尔索引

六.DataFrame的运算

1.DataFrame与标量运算

2.DataFrame与DataFrame运算

七.DataFrame的更改操作

1.设置行索引

2.修改行索引名和列名

八.DataFrame数据的导入与导出

1.导出数据

2.导入数据


一.DataFrame是什么

        DataFrame是Pandas 中的另一个核心数据结构,类似于一个二维的表格或数据库中的数据表。它是一个表格型的数据结构,它含有一组有序的列(列有固定从 0 开始的数字索引,顺序不乱),每列可以是不同的值类型(数值、字符串、布尔型值),既有行索引也有列索引。

        简单说就是与sql表的结构极为相似。

        DataFrame中的数据是以一个或多个二维块存放的(而不是列表、字典或别的一维数据结构)。它可以被看做由Series组成的字典(共同用一个索引)。提供了各种功能来进行数据访问、筛选、分割、合并、重塑、聚合以及转换等操作,广泛用于数据分析、清洗、转换、可视化等任务。

二.DataFrame的创建

1.直接通过字典创建DataFrame

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103], "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]})
print(df)

        运行结果:

2.通过字典创建时指定列的顺序和行索引

import pandas as pd

df = pd.DataFrame({"name": ["张三", "李四", "王五"], "age": [15, 16, 17]},
                  columns= ["age", "name"], index= [101, 102, 103])
print(df)

        运行结果:

三.DataFrame的常用属性

属性

说明

index

DataFrame的行索引

columns

DataFrame的列标签

values

DataFrame的值

ndim

DataFrame的维度

shape

DataFrame的形状

size

DataFrame的元素个数

dtypes

DataFrame的元素类型

T

行列转置

loc[]

显式索引,按行列标签索引或切片

iloc[]

隐式索引,按行列位置索引或切片

at[]

使用行列标签访问单个元素

iat[]

使用行列位置访问单个元素

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])
# DataFrame的行索引
print(df.index)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])
# DataFrame的列标签
print(df.columns)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])
# DataFrame的值
print(df.values)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])
# DataFrame的维度
print(df.ndim)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])
# DataFrame的形状
print(df.shape)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])
# DataFrame的元素个数
print(df.size)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])

# DataFrame的元素类型
print(df.dtypes)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])

# 行列转置
print(df.T)

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])

# 显式索引,按行列标签索引或切片
print(df.loc["bb":"cc", :"name"])

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])

# 隐式索引,按行列位置索引或切片
print(df.iloc[1:3, 0:1])

        注意:位置索引都是含左不含右的。

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])

# 使用行列标签访问单个元素
print(df.at["aa", "name"])

import pandas as pd

df = pd.DataFrame({"id": [101, 102, 103],
                   "name": ["张三", "李四", "王五"],
                   "age": [15, 16, 17]}, index=["aa", "bb", "cc"])

# 使用行列位置访问单个元素
print(df.iat[1, 1])

四.DataFrame的常用方法

方法

说明

head()

查看前n行数据,默认5行

tail()

查看后n行数据,默认5行

isin()

元素是否包含在参数集合中

isna()

元素是否为缺失值

sum()

求和

mean()

平均值

min()

最小值

max()

最大值

var()

方差

std()

标准差

median()

中位数

mode()

众数

quantile()

指定位置的分位数,如quantile(0.5)

describe()

常见统计信息

info()

基本信息

value_counts()

每个元素的个数

count()

非空元素的个数

drop_duplicates()

去重

sample()

随机采样

replace()

用指定值代替原有值

equals()

判断两个DataFrame是否相同

cummax()

累计最大值

cummin()

累计最小值

cumsum()

累计和

cumprod()

累计积

diff()

一阶差分,对序列中的元素进行差分运算,也就是用当前元素减去前一个元素得到差值默认情况下,它会计算一阶差分,即相邻元素之间的差值。参数:

periods:整数,默认为 1。表示要向前或向后移动的周期数,用于计算差值。正数表示向前移动,负数表示向后移动。

axis:指定计算的轴方向。0 或 'index' 表示按列计算,1 或 'columns' 表示按行计算,默认值为 0。

sort_index()

按行索引排序

sort_values()

按某列的值排序,可传入列表来按多列排序,并通过ascending参数设置升序或降序

nlargest()

返回某列最大的n条数据

nsmallest()

返回某列最小的n条数据

        在Pandas的 DataFrame 方法里,axis 是一个非常重要的参数,它用于指定操作的方向。

axis 参数可以取两个主要的值,即 0 或 'index',以及 1 或 'columns' ,其含义如下:

  • axis=0 或 axis='index':表示操作沿着行的方向进行,也就是对每一列的数据进行处理。例如,当计算每列的均值时,就是对每列中的所有行数据进行计算。
  • axis=1 或 axis='columns':表示操作沿着列的方向进行,也就是对每行的数据进行处理。例如,当计算每行的总和时,就是对每行中的所有列数据进行计算。

        其实和numpy中的概念相同。

import pandas as pd

df = pd.DataFrame(data={"id": [101, 102, 103,104,105,106,101],
                        "name": ["张三", "李四", "王五","赵六","冯七","周八","张三"],
                        "age": [10, 20, 30, 40, None, 60,10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff","aa"])

# head() 查看前n行数据,默认5行
print(df.head())

import pandas as pd

df = pd.DataFrame(data={"id": [101, 102, 103,104,105,106,101],
                        "name": ["张三", "李四", "王五","赵六","冯七","周八","张三"],
                        "age": [10, 20, 30, 40, None, 60,10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff","aa"])

# tail() 查看后n行数据,默认5行
print(df.tail(3))

import pandas as pd

df = pd.DataFrame(data={"id": [101, 102, 103,104,105,106,101],
                        "name": ["张三", "李四", "王五","赵六","冯七","周八","张三"],
                        "age": [10, 20, 30, 40, None, 60,10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff","aa"])

# isin() 元素是否包含在参数集合中
print(df.isin(["王五"]))

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# isna() 元素是否为缺失值
print(df.isna())

        还是那句话,None会在Pandas数据结构里转化为np.nan。

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# sum() 求和
print(df["age"].sum())

        df[列标签名]取出来的就是这个列。换言之,对一个列求和是合理的,对一行求和就不合理了,因为一行里的数据一般不是一种数据类型,也不会有这种需求。

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# mean() 平均值
print(df["age"].mean())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# min() 最小值
print(df["age"].min())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# max() 最大值
print(df["age"].max())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# var() 方差
print(df["age"].var())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# std() 标准差
print(df["age"].std())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# median() 中位数
print(df["age"].median())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# mode() 众数
print(df["age"].mode())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# quantile() 指定位置的分位数
print(df["age"].quantile(0.5, interpolation="linear"))

        这在Series篇详细介绍过,不再介绍。

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# describe() 常见统计信息
print(df["age"].describe())

        也可查看整个df的describe(),而不是只能看单独一列Series的。

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# info() 基本信息
print(df["age"].info())

        也可查看整个df的info(),而不是只能看单独一列Series的。

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# value_counts() 每个元素的个数
print(df.value_counts())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# count() 非空元素的个数
print(df.count())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# drop_duplicates()
print(df.drop_duplicates())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# sample()
print(df.sample())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# replace() 用指定值代替原有值
print(df.replace(20, "haha"))

import pandas as pd
import numpy as np

# equals() 判断两个DataFrame是否相同
df1 = pd.DataFrame(data={"id": [101, 102, 103],
                         "name": ["张三", "李四", "王五"],
                         "age": [10, 20, 30]})
df2 = pd.DataFrame(data={"id": [101, 102, 103],
                         "name": ["张三", "李四", "王五"],
                         "age": [10, 20, 30]})
print(df1.equals(df2))

import pandas as pd
import numpy as np

# cummax() 累计最大值
df3 = pd.DataFrame(data={'A': [2, 5, 3, 7, 4],
                         'B': [1, 6, 2, 8, 3]})
# 按列 等价于 axis=0 默认
print(df3.cummax(axis="index"))
print("----------------------------------")
# 按行 等价于 axis=1
print(df3.cummax(axis="columns"))

import pandas as pd
import numpy as np

df3 = pd.DataFrame(data={'A': [2, 5, 3, 7, 4],
                         'B': [1, 6, 2, 8, 3]})

# cummin() 累计最小值
print(df3.cummin())

import pandas as pd
import numpy as np

df3 = pd.DataFrame(data={'A': [2, 5, 3, 7, 4],
                         'B': [1, 6, 2, 8, 3]})

# cumsum() 累计和
print(df3.cumsum())

import pandas as pd
import numpy as np

df3 = pd.DataFrame(data={'A': [2, 5, 3, 7, 4],
                         'B': [1, 6, 2, 8, 3]})

# cumprod()累计积
print(df3.cumprod())

import pandas as pd
import numpy as np

df3 = pd.DataFrame(data={'A': [2, 5, 3, 7, 4],
                         'B': [1, 6, 2, 8, 3]})

# diff() 一阶积分
print(df3.diff())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# sort_index() 按行索引排序
print(df.sort_index())

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# sort_values() 按某列的值排序,可传入列表来按多列排序,并通过ascending参数设置升序或降序
print(df.sort_values(by = ["age"]))

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# nlargest() 返回某列最大的n条数据
print(df.nlargest(n=2, columns="age"))

import pandas as pd
import numpy as np

df = pd.DataFrame(data={"id": [101, 102, 103, 104, 105, 106, 101],
                        "name": ["张三", "李四", "王五", "赵六", "冯七", "周八", "张三"],
                        "age": [10, 20, 30, 40, None, np.nan, 10]},
                  index=["aa", "bb", "cc", "dd", "ee", "ff", "aa"])

# nsmallest() 返回某列最小的n条数据
print(df.nsmallest(n=1, columns="age"))

五.DataFrame的布尔索引

        可以使用布尔索引从DataFrame中筛选满足某些条件的行。

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10],
                        "name": ["张三", "李四", "王五", "赵六"]},
                    columns=["name", "age"],
                    index=[101, 104, 103, 102])
print(df["age"] > 25)
print("---------------")
print(df[df["age"] > 25])

六.DataFrame的运算

1.DataFrame与标量运算

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10],
                        "name": ["张三", "李四", "王五", "赵六"]},
            columns=["name", "age"],
            index=[101, 104, 103, 102],)
print(df * 2)

2.DataFrame与DataFrame运算

        根据标签索引进行对位计算,索引没有匹配上的用NaN填充。

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data={"age": [10, 20, 30, 40],
                         "name": ["张三", "李四", "王五", "赵六"]},
columns=["name", "age"],
index=[101, 102, 103, 104],)
df2 = pd.DataFrame(data={"age": [10, 20, 30, 40],
                         "name": ["张三", "李四", "王五", "田七"]},
columns=["name", "age"],
index=[102, 103, 104, 105],)

print(df1 + df2)

七.DataFrame的更改操作

1.设置行索引

        创建DataFrame时如果不指定行索引,pandas会自动添加从0开始的索引。

import pandas as pd

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})
print(df)

        通过set_index()设置行索引:

import pandas as pd

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})

# inplace=True:这是一个布尔类型的参数。当设为 True 时,会直接在原
# DataFrame上进行修改;若设为 False(默认值),则会返回一个新的
# DataFrame,原DataFrame 保持不变
df.set_index("id", inplace=True)
print(df)

        通过reset_index()重置行索引:

import pandas as pd

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})

# inplace=True:这是一个布尔类型的参数。当设为 True 时,会直接在原
# DataFrame上进行修改;若设为 False(默认值),则会返回一个新的
# DataFrame,原DataFrame 保持不变
df.set_index("id", inplace=True)
print(df)
df.reset_index(inplace=True)
print("---------------")
print(df)

2.修改行索引名和列名

        通过rename()修改行索引名和列名:

import pandas as pd

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})
df.set_index("id", inplace=True)
print(df)

print("------------------")
df.rename(index={101: "一", 102: "二", 103: "三", 104: "四"},
          columns={"age": "年龄", "name": "姓名"}, inplace=True)
print(df)

        将index和columns重新赋值:

import pandas as pd

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})
df.set_index("id", inplace=True)
# print(df)

# print("------------------")
df.rename(index={101: "一", 102: "二", 103: "三", 104: "四"},
          columns={"age": "年龄", "name": "姓名"}, inplace=True)
# print(df)
df.index= ["I", "II", "III", "IV"]
df.columns = ["年齡", "名稱"]
print(df)

        添加列:

        通过 df[“列名”] 添加列。

import pandas as pd
import numpy as np

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})
df.set_index("id", inplace=True)
df["phone"] = ["13333333333", "14444444444", 
               "15555555555", "16666666666"]
print(df)

        删除列:

        通过 df.drop(“列名”, axis=1) 删除,也可是删除行 axis=0。

import pandas as pd
import numpy as np

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})
df.set_index("id", inplace=True)
df["phone"] = ["13333333333", "14444444444",
               "15555555555", "16666666666"]
# print(df)
df.drop("phone", axis=1, inplace=True)
print(df)

import pandas as pd

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})
df.set_index("id", inplace=True)
df["phone"] = ["13333333333", "14444444444",
               "15555555555", "16666666666"]
del df["phone"]
print(df)

​​​​​​​        插入列:

        通过 insert(loc, column, value) 插入。该方法没有inplace参数,直接在原数据上修改。

import pandas as pd
import numpy as np

df = pd.DataFrame({"age": [20, 30, 40, 10],
                   "name": ["张三", "李四", "王五", "赵六"],
                   "id": [101, 102, 103, 104]})
df.set_index("id", inplace=True)
df.insert(loc=0, column="phone", value=df["age"] * df.index)
print(df)

八.DataFrame数据的导入与导出

1.导出数据

方法

说明

to_csv()

将数据保存为csv格式文件,数据之间以逗号分隔,可通过sep参数设置使用其他分隔符,可通过index参数设置是否保存行标签,可通过header参数设置是否保存列标签。

to_pickle()

如要保存的对象是计算的中间结果,或者保存的对象以后会在Python中复用,可把对象保存为.pickle文件。如果保存成pickle文件,只能在python中使用。文件的扩展名可以是.p、.pkl、.pickle。

to_excel()

保存为Excel文件,需安装openpyxl包。

to_clipboard()

保存到剪切板。

to_dict()

保存为字典。

to_hdf()

保存为HDF格式,需安装tables包。HDF(Hierarchical Data Format)是一种面向科学与工程的分层二进制数据格式,核心用于存储、组织和共享海量、多维、复杂结构的数据,常见后缀为.hdf/.h4(HDF4)、.h5/.hdf5(HDF5)。

to_html()

保存为HTML格式,需安装lxml、html5libbeautifulsoup4包。

to_json()

保存为JSON格式。

to_feather()

feather是一种文件格式,用于存储二进制对象。feather对象也可以加载到R语言中使用。feather格式的主要优点是在Python和R语言之间的读写速度要比csv文件快。feather数据格式通常只用中间数据格式,用于Python和R之间传递数据,一般不用做保存最终数据。需安装pyarrow包。

to_sql()

保存到数据库。

import os
import pandas as pd

os.makedirs("data", exist_ok=True)
df = pd.DataFrame({"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"], "id": [101, 102, 103, 104]})
df.set_index("id", inplace=True)

df.to_csv("data/df.csv")
df.to_csv("data/df.tsv", sep="\t")  # 设置分隔符为 \t
df.to_csv("data/df_noindex.csv", index=False)  # index=False 不保存行索引
df.to_pickle("data/df.pkl")
df.to_excel("data/df.xlsx")
df.to_clipboard()
df_dict = df.to_dict()
df.to_hdf("data/df.h5", key="df")
df.to_html("data/df.html")
df.to_json("data/df.json")
df.to_feather("data/df.feather")

2.导入数据

方法

说明

read_csv()

加载csv格式的数据。可通过sep参数指定分隔符,可通过index_col参数指定行索引。

read_pickle()

加载pickle格式的数据。

read_excel()

加载Excel格式的数据。

read_clipboard()

加载剪切板中的数据。

read_hdf()

加载HDF格式的数据

read_html()

加载HTML格式的数据

read_json()

加载JSON格式的数据。

read_feather()

加载feather格式的数据。

read_sql()

加载数据库中的数据

df_csv = pd.read_csv("data/df.csv", index_col="id")  # 指定行索引
df_tsv = pd.read_csv("data/df.tsv", sep="\t")  # 指定分隔符
df_pkl = pd.read_pickle("data/df.pkl")
df_excel = pd.read_excel("data/df.xlsx", index_col="id")
df_clipboard = pd.read_clipboard(index_col="id")
df_from_dict = pd.DataFrame(df_dict)
df_hdf = pd.read_hdf("data/df.h5", key="df")
df_html = pd.read_html("data/df.html", index_col=0)[0]
df_json = pd.read_json("data/df.json")
df_feather = pd.read_feather("data/df.feather")

print(df_csv)
print(df_tsv)
print(df_pkl)
print(df_excel)
print(df_clipboard)
print(df_from_dict)
print(df_hdf)
print(df_html)
print(df_json)
print(df_feather)
Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐