Python数据分析基础（经济金融）

站长

2024年04月16日 16:12 · 阅读数 146

numpy

arrays操作

创建数组

sizes = (2, 3, 4)
x = np.zeros(sizes)

y = np.ones(sizes)

数组切片

x_1d = np.array([1, 2, 3])
print(x_1d[0])
print(x_1d[0:2])

x_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(x_2d[1, 1])
print(x_2d[1, :])

数组属性: .shape .dtype

x = np.array([[1, 2, 3], [4, 5, 6]])
print(x.shape)
rows, columns = x.shape

广播操作：
- Operations between an array and a single number.
- Operations between two arrays of the same shape.
其他一些函数：

x = np.linspace(0, 25, 10)
np.mean(x)
np.std(x)
np.max(x)
np.diff(x) #前后项比较
np.reshape(x, (5, 2))
np.eye(3)#返回的是一个二维2的数组(N,M)，对角线的地方为1，其余的地方为0.

#链接广播特性和特定的处理函数：
np.random.seed(42)
x = np.random.rand(10)
print(x)

def f(val):
    if val < 0.3:
        return "low"
    else:
        return "high"

print(f(0.1)) # scalar, no problem
# f(x) # array, fails since f() is scalar
f_vec = np.vectorize(f)
print(f_vec(x))

向量/矩阵操作

向量

在numpy中，一个向量即为一维数组。 numpy中的某些广播操作也是向量操作。
点积：np.dot(x, y) x @ y

矩阵

一个 N × M 矩阵可以被认为是一个由M个n元素的向量组成的集合，这些向量以列的形式并排堆叠。在numpy中，一个矩阵即为一个二维数组。

#两个向量堆叠，3 × 2 的矩阵。
x = np.array([[1, 2, 3], [4, 5, 6]])

矩阵乘法：(三种方式)np.dot(x, y) x @ y np.matmul(x, y)

Matrix multiplication: Let v=x⋅yv = x \cdot yv=x⋅y then we can write vij=∑k=1Nxikykjv_{ij} = \sum_{k=1}^N x_{ik} y_{kj}vij=∑k=1Nxikykj where xijx_{ij}xij is notation that denotes the element found in the ith row and jth column of the matrix xxx.

转置：

x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(x.transpose())

单位矩阵: I = np.eye(3)
逆矩阵: np.linalg.inv(x)

A = np.array([[1, 2, 0], [3, 1, 0], [0, 1, 2]])
print(np.linalg.inv(A))

matplotlib

简单制图步骤

创建一个图形 Figure 和轴 Axis 对象，用于存储来自我们的图形的信息。
生成我们要绘制的数据。
传递数据，并通过调用plot方法在轴ax上绘制线形图。

# Step 1
fig, ax = plt.subplots()

# Step 2
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
#print(x, y)

# Step 3
ax.plot(x, y)

Figure 与 Axis 的区别

Figure 是整个画框，Axis 是画布，Figure 包含 Axis。这意味着可以在一个Figure中创建多个Axis。如下图所示：

# Specify the shape of the axes
fig, axes = plt.subplots(2, 3)

fig.set_facecolor("gray")

# Can choose hex colors
colors = ["#065535", "#89ecda", "#ffd1dc", "#ff0000", "#6897bb", "#9400d3"]

# axes is a numpy array and we want to iterate over a flat version of it
for (ax, c) in zip(axes.flat, colors):
    ax.set_facecolor(c)

fig.tight_layout()

Python数据分析基础（经济金融）

图种

Bar 条形图

countries = ["CAN", "MEX", "USA"]
populations = [36.7, 129.2, 325.700]
land_area = [3.850, 0.761, 3.790]

fig, ax = plt.subplots(2)

ax[0].bar(countries, populations, align="center")
ax[0].set_title("Populations (in millions)")

ax[1].bar(countries, land_area, align="center")
ax[1].set_title("Land area (in millions miles squared)")

fig.tight_layout()

Python数据分析基础（经济金融）

散点图（包含批注）

N = 50

np.random.seed(42)

x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2  # 0 to 15 point radii

fig, ax = plt.subplots()

ax.scatter(x, y, s=area, c=colors, alpha=0.5)

ax.annotate(
    "First point", xy=(x[0], y[0]), xycoords="data",
    xytext=(0.5, 0.8),
    arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=0.6")
)

Python数据分析基础（经济金融）

填充图

x = np.linspace(0, 1, 500)
y = np.sin(4 * np.pi * x) * np.exp(-5 * x)

fig, ax = plt.subplots()

ax.grid(True)
ax.fill(x, y);

Python数据分析基础（经济金融）

其他常用属性

ax.set_xlabel('..')
ax.legend(["Employed", "Unemployed"])

pandas

基础介绍

Series

Series是一列数据，每个数据都有行标签。pandas将行标签作为系列的索引。

创建方法：

values = [5.6, 5.3, 4.3, 4.2, 5.8, 5.3, 4.6, 7.8, 9.1, 8., 5.7]
years = list(range(1995, 2017, 2))

unemp = pd.Series(data=values, index=years, name="Unemployment")
print(pd.Series(data=values, index=years, name="Unemployment"))
print(pd.Series(values))

属性：unemp.index unemp.values unemp.head unemp.tail
辅助制图: unemp.plot()
查找：

#找到Series中唯一的值
unemp.unique()
#索引定位，.loc[index_items]
unemp.loc[1995]
unemp.loc[[1995, 2005, 2015]]

DateFrame

DataFrame是pandas存储一列或多列数据的方式。我们可以将一个DataFrames看作是并排堆叠的多个Series。这类似于Excel工作簿中的工作表或SQL数据库中的表。除了行标签(索引)，DataFrames还有列标签。我们将这些列标签称为列或列名。

创建方法：

data = {
    "NorthEast": [5.9,  5.6,  4.4,  3.8,  5.8,  4.9,  4.3,  7.1,  8.3,  7.9,  5.7],
    "MidWest": [4.5,  4.3,  3.6,  4. ,  5.7,  5.7,  4.9,  8.1,  8.7,  7.4,  5.1],
    "South": [5.3,  5.2,  4.2,  4. ,  5.7,  5.2,  4.3,  7.6,  9.1,  7.4,  5.5],
    "West": [6.6, 6., 5.2, 4.6, 6.5, 5.5, 4.5, 8.6, 10.7, 8.5, 6.1],
    "National": [5.6, 5.3, 4.3, 4.2, 5.8, 5.3, 4.6, 7.8, 9.1, 8., 5.7]
}

unemp_region = pd.DataFrame(data, index=years)

属性：unemp.index unemp.values unemp.head unemp.tail
辅助制图: unemp.plot()

Python数据分析基础（经济金融）

查找：

#找到Series中唯一的值
unemp.unique()
#索引定位，.loc[index_xitems, index_yitems]
print(unemp_region.loc[1999, "NorthEast"])
print(unemp_region.loc[[1995, 2005], "South"])

列计算：类似于numpy的广播操作。

print(unemp_region)
unemp_region["West"] / 100
unemp_region["West"].max() # 找出所在列的最大值
unemp_region["West"] - unemp_region["MidWest"] # 计算两列数据的差额
unemp_region.West.corr(unemp_region["MidWest"]) # 计算两列数据的相关性
unemp_region.sum() # 所在列的数据加和

修改数据：
- 添加新列df["New Column Name"] = new_values
- 更改标签名或列名(重新赋值)df.loc[index, column] = value
- 修改现有数据(例如，做一些算术或将一列字符串改为小写)
```
names = {"NorthEast": "NE",
     "MidWest": "MW",
     "South": "S",
     "West": "W"}
unemp_region.rename(columns=names)
```

应用

内置聚合操作

下述操作返回的都是Series数据类型。

Mean (mean)
Variance (var)
Standard deviation (std)
Minimum (min)
Median (median)
Maximum (max)
etc… 如上所示，聚合的默认值是聚合每个列。但是，通过使用axis关键字参数，您也可以按行进行聚合。

print(unemp.head()) # 按列
print(unemp.var(axis=1).head()) # 按行

自定义聚合操作

# 定义聚合函数（聚合规则）
def high_or_low(s):
    """
    This function takes a pandas Series object and returns high
    if the mean is above 6.5 and low if the mean is below 6.5
    """
    if s.mean() < 6.5:
        out = "Low"
    else:
        out = "High"

    return out

# 传递聚合规则
unemp.agg(high_or_low)

内置变换操作

应用于Series的函数的输出可能需要是一个新的Series。

累计和/最大值/最小值/积 (cum(sum|min|max|prod))
相邻两项的差额 (diff)
Elementwise addition/subtraction/multiplication/division (+, -, *, /)
百分比变化 (pct_change)
每个不同值的出现次数 (value_counts)
绝对值 (abs)

自定义变换操作

自定义系列转换步骤：编写一个Python函数（变换规则），接受一个Series并输出一个新的Series。将新函数（规则）作为参数传递给apply方法(或者，transform方法)。

def standardize_data(x):
    """
    Changes the data in a Series to become mean 0 with standard deviation 1
    """
    mu = x.mean()
    std = x.std()

    return (x - mu)/std
    
std_unemp = unemp.apply(standardize_data)
std_unemp.head()

自定义标量转换步骤：定义一个接受标量并生成标量的Python函数（变换规则）。将此函数作为参数传递给applymapSeries或DataFrame方法。

布尔值筛选

布尔值筛选的应用例子有：

只对18岁以上的人进行分析。
看看特定时间段的数据。
获取特定产品或客户ID的数据。我们可以通过使用Series或布尔值列表来索引Series或DataFrame来做到这一点。

步骤：

创建一个布尔值DataFrames/Series。（可以直接创建，也可以通过函数制定规则来返回一个布尔值DataFrames/Series）

print(unemp_small)
unemp_small["Texas"] < 4.5

将这个布尔值DataFrames/Series作为索引转递给原始数据集提取子集。

unemp_small.loc[unemp_small["Texas"] < 4.5]

其他操作

isin 检查一个数据点是否具有几个固定值中的一个。

print(unemp_small)
unemp_small["Michigan"].isin([3.3, 3.2])

进而可以定位到含有特定值的子集：

unemp_small.loc[unemp_small["Michigan"].isin([3.3, 3.2])]

.any 和 .all : Python函数any和all是接受布尔值集合并返回单个布尔值的聚合函数。当至少有一个输入为True时，any返回True;当所有输入都为True时，all返回True。

附：蒙特卡罗模拟

蒙特卡罗模拟的用途：

我们研究复杂的系统时，这些系统可能具有内在的随机性，但它们并不容易向我们揭示它们的潜在分布。在我们面临这种困难的情况下，我们求助于一套被称为蒙特卡罗方法的工具。

这些方法可以归结为重复模拟某些事件(或多个事件)并查看结果分布。它的背后是大数定理。

蒙特卡罗模拟的基本步骤：

生成随机数：

#离散
rn = np.random.rand()

操作随机数
重复上述操作
得到最终结果，解释或可视化

转载自:https://juejin.cn/post/7099447149296877581