Ver 的 Blog

论文阅读:焊接

Ver — Tue, 20 Jan 2026 08:28:33 GMT

该渲染由 Shiro API 生成，可能存在排版问题，最佳体验请前往：https://blog.verxie.org/posts/study/literature_welding

1. Derivation of physical equations for high-speed laser welding using large language models

2026.1.20
多模态融合, 语言模型, 激光焊接

一、研究问题

高速激光焊接中的隆起（humping）缺陷机理复杂、数据稀疏。
传统做法依赖大量实验数据与经验公式/量纲分析，成本高、跨材料迁移差。
本文目标：在稀疏数据条件下，用文献知识 + LLM 推导可解释的物理方程，并用于隆起发生预测与工艺优化。

二、核心思路

提出 T2EGPT（Text-to-Equation Generation Transformer）框架：
把“文本里的领域知识”（来自筛选后的私有文献库）与“稀疏实验数据”连接起来，自动生成候选方程，并通过规则化评分筛选最优方程。

关键点：

先建一个私有数据库：文献按预定义主题标准筛选（面向隆起缺陷相关机理/参数）。
输入一组与隆起相关的物理参数（例如：最大融化速度、熔池长度、热导率、密度、比热、表面张力系数等）。
LLM 先生成“相关性报告”：对变量关系按“类型/形式/效果”分类。
再把这些关系用于候选方程生成与打分筛选（结合模式搜索/评分量表 rubric）。

三、方法论

传统经验方程发展（数据驱动/量纲分析）

输入：Humping Data（稀疏数据）
1. 构建输入维度矩阵
2. 搜索无量纲空间（dimensionless space）
3. 评估打分
4. 从拟合结果选最优方程

本文方法（LLM + 文献知识驱动）

输入：Literature Domain Knowledge（私有文献库）
1. 相关性结果（Correlation results）
  - 直接/间接相关分析
  - 线性/非线性关系分析
2. Rubric 评分（可由人定义标准）
  - Positive / Negative / Not significant
3. 从 LLM 推导出的候选方程中选最优方程

四、结论

隆起是一种不平衡状态：

惯性效应主导于毛细稳定性。
隆起来源于：惯性驱动的后向熔融回流 与 毛细力驱动的表面稳定之间的竞争。
- 惯性力促使熔融金属向后流动
- 毛细力抵抗表面形变，抑制隆起
在不锈钢、铝、钛合金之间表现出高预测准确性与可迁移性（跨材料泛化）。
在有限数据场景下，比传统纯数据驱动方法更稳健。
生成的物理方程具备可解释性，可用于指导工艺优化（不只是“黑盒预测”）。

五、创新

文本 + 数据 + LLM联合：推动“物理法则/经验方程”的自动发现。
用 LLM 建私有知识库并自动抽取关系、构造候选方程，减少人工整理与文献格式不统一的痛点。
候选方程通过固定评分规则评估，形成“可控”的方程筛选链路，而非完全自由生成。

六、缺陷

LLM 幻觉是潜在风险。
缓解思路：提示模型输出支持结论的原始文本片段，并由研究者结合领域知识做校验（人机协作闭环）。

2. Multimodal data fusion for welding defect detection using ensemble deep learning

2026.1.20
多模态融合，深度学习，缺陷检测，电阻点焊，模型解释

一、研究问题

焊接作为复杂热力学过程，易出现裂纹、气孔、夹渣、未熔合、不完全穿透、切口等缺陷；诱因包括材料变化、参数波动、环境因素、操作误差等。

传统机器学习/单模态深度学习存在局限：

效率与鲁棒性不足：对缺陷形状/分布变化、噪声干扰敏感；小样本与类别不平衡下性能不稳。
信息不完备：单一模态只能提供局部特征，缺乏全局完整性与冗余校验。
工业可用性不足：缺少可解释的定量指标，难以支撑质量控制决策与传感器布局优化。

本研究目标：

在有限且不平衡数据下，实现高准确率、稳健的缺陷分类；
降低计算复杂性，支持多模态数据流高效处理；
提供定量可解释性指标，增强决策可信度并指导传感器优化。

二、核心思路

构建三个子分类器，分别从红外图像、RGB图像、焊接参数提取信息；通过 Dempster–Shafer（DS）证据理论进行多模态融合形成集成框架。

其中图像子网络采用 ResNet-based 权重共享的双输入结构（多视角双图像），并引入 FPN增强（F-ResNet / DF-ResNet）提升多尺度特征表达。

解释性方面结合：

MM-SHAP：量化各模态对最终分类的贡献；
Grad-CAM：可视化图像网络关注区域（尤其是DF-ResNet在图像处理中的显著区域）。

三、方法论

整体流程（三步）：

采集与预处理：红外图像、RGB图像、焊接参数（含处理多重共线性）；并处理数据不平衡（增强）。
单模态预训练 + DS融合：分别训练各模态基础分类器，再用DS证据理论融合输出。
可解释性分析：MM-SHAP 量化模态贡献 + Grad-CAM 可视化注意区域，形成“定量 + 定性”解释闭环。

3.1 网络结构

ResNet-18 两类残差结构： (a) 维持特征图大小；(b) 改变特征图大小
FPN增强 ResNet：F-ResNet
双输入 FPN增强 ResNet：DF-ResNet（并使用权重共享）

评价指标：Accuracy / Precision / Recall / F1-score

MM-SHAP 计算过程

3.2 焊接参数定义表（Welding parameters）

Type	Variable	描述	Unit	Range
Input	Pressure（压力）	气动缸上的压力	PSI	{35, 60, 80, 95}
Input	Welding time（焊接时间）	焊接过程时间	ms	200–1500
Input	Electrode angle（电极角度）	电极之间的角度	Deg	{0, 15}
Input	Electrode force（电极力）	施加在电极上的力	N	0–133.53
Input	Welding current（焊接电流）	电流通过金属板	A	639.81–5009.43
Input	Material thickness A（材料厚度A）	材料A厚度	mm	0.61–1.057
Input	Material thickness B（材料厚度B）	材料B厚度	mm	0.608–1.01
Output	Pull test force（拉力测试力）	焊接接头机械强度	N	1410.3–5806.5
Output	Nugget diameter（颗粒直径）	焊接点直径	mm	1.9–4.72

3.3 多重共线性处理（VIF，逐步处理）

Variable	Step 1	Step 2	Step 3	Step 4	Step 5
Pressure（压力）	15.76	15.40	15.16	15.15	/
Welding time（焊接时间）	7.76	7.76	7.61	5.75	5.45
Electrode angle（电极角度）	3.67	3.67	3.54	2.82	2.74
Electrode force（电极力）	15.46	14.94	13.07	10.00	8.12
Welding current（焊接电流）	13.60	13.54	10.64	9.85	9.71
Material thickness A（厚度A）	603.48	/	/	/	/
Material thickness B（厚度B）	602.69	30.95	28.03	13.45	4.93
Pull test force（拉力测试力）	59.84	59.84	40.69	/	/
Nugget diameter（颗粒直径）	76.83	76.66	/	/	/

3.4 数据增强前后样本分布

Category	Original Train	Original Test	Original Total	Augmented Train	Augmented Test	Augmented Total
Good（好）	309	134	443	309	134	443
Bad（坏）	15	6	21	300	120	420
Explode（爆炸）	22	9	31	308	126	434
Total	346	149	495	917	380	1297

3.5 超参数设置

Hyper Parameter	Value
Batch size	64
Learning rate	0.0005
Epochs	100
Optimizer	SGD
Weight decay	5 × 10⁻⁵
Loss function	Cross-Entropy Loss
Early stopping	No early stopping
Evaluation metric	Accuracy / Precision / Recall / F1-score

四、结论

在多种缺陷场景下，实现 91.6% 整体准确率；
双输入 + 权重共享使分类准确率提升 7.87%，并增强小样本场景鲁棒性；
在识别不良样本时，模型更依赖红外图像信息（由解释结果支持）。

4.1 消融实验（组件贡献）

Model	FPN	Double input（双输入）	Weight sharing（权重共享）	Accuracy	Precision	Recall	F1
ResNet				0.788	0.804	0.790	0.787
D-ResNet		√	√	0.816	0.824	0.818	0.818
F-ResNet	√			0.792	0.797	0.796	0.794
DF-ResNet（不共享权重）	√	√		0.821	0.842	0.822	0.818
DF-ResNet	√	√	√	0.850	0.851	0.854	0.852

4.2 不同模型指标对比

Model	Accuracy	Precision	Recall	F1
IrNet	0.787	0.784	0.793	0.785
DF-ResNet	0.850	0.851	0.854	0.852
ANN	0.855	0.858	0.860	0.857
EMMDL（本文）	0.916	0.926	0.920	0.917

4.3 不同主干对比（Backbone）

Backbone	Accuracy	Precision	Recall	F1
AlexNet	0.524	0.388	0.537	0.432
VggNet	0.611	0.645	0.610	0.602
GoogleNet	0.761	0.819	0.771	0.743
MobileNetv2	0.734	0.740	0.739	0.738
SqueezeNet	0.789	0.802	0.795	0.792
Vision Transformer	0.703	0.727	0.703	0.702
ResNet	0.816	0.824	0.818	0.818
F-ResNet	0.850	0.851	0.854	0.852

五、创新

双输入 + 权重共享 + 集成学习：提升缺陷分类准确率与小样本鲁棒性。
DS证据理论融合：以“证据融合”方式整合多模态输出，增强抗噪与互补性利用。
可解释性闭环（Grad-CAM + MM-SHAP）：既能可视化图像关注区域，又能定量比较模态贡献，支持质量控制与传感器布局优化。
在材料变化、环境干扰等工业典型扰动下，较单模态框架更不易失效，具备更强工程落地潜力。

六、缺陷

融合策略偏后期决策融合：DS融合主要作用于分类证据层，可能未充分利用模态间的细粒度交互（特征级互补）；对复杂缺陷的时空演化信息利用有限。
泛化边界未完全明确：任务是电阻点焊缺陷分类，是否能跨材料牌号、不同设备、不同工装/光照/热辐射条件保持稳定，需要更多域外测试。
可解释性仍是“相关性解释”：Grad-CAM/MM-SHAP提供的是注意/贡献线索，但不等价于因果解释；对工艺优化的指导需要结合物理与工艺约束进一步验证。

3. Deep multimodal fusion of spectral and visual data for laser welding defect classification

2026.1.20
多模态融合，激光焊接，光谱-视觉，交叉注意力，缺陷分类，通道选择

一、研究问题

激光焊接缺陷检测需要准确解读异构信号，其中焊缝图像（表面/几何/飞溅等）与光谱时间序列（等离子体/热/材料状态等）提供互补信息，但二者在数据形态、噪声与对齐方式上差异大，有效融合仍具挑战性。

二、核心思路

构建面向汽车电池母线焊接的多模态数据集，并提出基于交叉注意力（cross-attention）的视觉-光谱融合框架：

先对焊缝图像做分割以抑制背景干扰；
对光谱做相关性分析，筛选信息更丰富的通道以降维；
使用反向光谱嵌入（inverted spectral embedding）
- 视觉到光谱的交叉注意力（vision-to-spectrum cross-attention）建模细粒度跨模态交互。

三、方法论

3.1 数据采集与工况设置

数据集包含 7 种焊接工况

Welding status（状态）	Definition（定义）
Baseline（基线）	激光设备默认设置，工件未处理
Low Power（低功率）	焊接功率低于基线
Low Gap（间隙过小）	工件间隙 < 0.5mm
Defocus（失焦）	失焦量为 4mm 与 6mm
Water Treatment（水处理）	焊接前用水清洗工件
Oil Treatment（油处理）	焊接前除油清洗工件
Cold Weld（冷焊）	焊后未能正确结合

3.2 预处理与输入构建

焊缝图像分割：用 U-Net 分割焊缝区域，降低背景噪声对特征提取的影响。
光谱通道选择（通道维压缩）：对光谱信号做相关性分析（Pearson correlation），筛选信息通道以降维、降计算。
对齐策略（焊缝-光谱配对）：将输入图像裁剪为两条独立焊缝，并对第一条焊缝垂直翻转；每条焊缝与其对应光谱配对进入模型（保证对齐）。
光谱时间长度：将光谱时间序列长度设置为 560 time steps（用于捕捉时间变化）。

3.3 模型结构

视觉编码器：MobileNetV2（强调效率，面向实时工业场景）。
光谱嵌入：MLP 将原始光谱序列投影到高维；采用反向嵌入以更好捕捉通道间依赖。
融合模块：可学习的交叉注意力，实现视觉特征与光谱表征的显式交互（比简单拼接/相加更能建模跨模态依赖）。
检测头：对融合后的 token 做自注意力（MHSA）聚合全局上下文，再输出焊接状态类别。
损失函数：Focal Loss（缓解类别不平衡，强调困难样本）。

四、结论

4.1 不同融合方式对比（逐类指标）

逐元素相加、按通道拼接、交叉注意力

Defect type	Add: Pre	Add: Rec	Add: F1	Add: AUC	Concat: Pre	Concat: Rec	Concat: F1	Concat: AUC	CrossAttn: Pre	CrossAttn: Rec	CrossAttn: F1	CrossAttn: AUC
Baseline	100.0	91.0	95.2	95.6	95.2	90.9	93.0	95.0	100.0	95.6	97.8	97.7
Low power	100.0	100.0	100.0	100.0	100.0	95.2	97.6	97.6	100.0	100.0	100.0	100.0
Low gap	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
Defocus	95.2	100.0	97.6	98.8	95.2	100.0	97.6	98.8	100.0	100.0	100.0	100.0
Water treatment	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	80.0	100.0	88.9	99.6
Oil treatment	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
Cold weld	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0

4.2 与单模态模型对比（加权指标）

Metrics	Vision-only（VGG16）	Vision-only（MobileNetV2）	Vision-only（GoogleNet）	Vision-only（ResNet50）	Vision-only（ViT-B/16）	Spectrum-only（Informer）	Spectrum-only（DLinear）	Ours（Cross attention Fusion）
Weighted Pre. (%)	97.7	97.5	98.5	97.9	83.2	93.8	93.1	99.4
Weighted Rec. (%)	97.6	97.6	98.4	97.6	82.9	93.5	92.7	99.2
Weighted F1. (%)	97.6	97.5	98.4	97.6	83.0	93.5	92.6	99.2

4.3 复杂度-性能权衡（消融/变体）

Model variant	Fusion module	Cross-attention	Params	FLOPs	Weighted F1 (%)
Image only（仅图像）	✗	✗	5.6M	8.5G	97.5
+ Spectrum branch（+光谱分支）	✓	✗	15.9M	12.6G	98.4
+ Cross-attention fusion（本文）	✓	✓	25.9M	18.5G	99.2

总体结论：

方法整体准确率达到 99.2%，提升光谱嵌入维度后可进一步提升到 100.0%；消融验证了分割、通道选择、嵌入/融合设计的收益。
论文也强调其在其他工业缺陷数据集（NEU、DAGM）上验证了泛化能力。

五、创新

交叉注意力用于视觉-光谱融合：显式建模跨模态依赖，实现动态交互（区别于简单拼接/相加）。
“分割 + 通道选择”式输入净化：U-Net 抑制背景；Pearson 相关性筛通道，兼顾性能与效率。
反向光谱嵌入 + 视觉到光谱注意力：面向“光谱是时间序列且通道相关强”的特性做结构化建模。
工程可用性倾向：使用 MobileNetV2 等轻量视觉骨干，并在文中强调计算效率与实时部署潜力。

六、缺陷

融合模块偏“注意力对齐”但缺少物理约束：交叉注意力提升效果显著，但其对焊接物理过程的可解释性（哪些波段/哪些纹理对应哪些缺陷机理）需要更多验证（如波段重要性/注意力与物理量关联）。
数据对齐策略存在潜在偏差：将两条焊缝裁剪并对第一条翻转的做法，可能引入形态先验；若未来换相机视角/焊缝形态变化，泛化边界需要额外验证。
性能接近满分时，更需要“域外测试”：99%+ 的结果容易受到数据集划分、同工件泄漏、增强/采样策略等影响；建议关注跨批次、跨设备、跨材料、跨日期的外部验证设置。

4. An effective penetration depth and width prediction method in pulsed GTA welding based on multimodal transformer-serial fusion network

2026.1.21
多模态融合, 熔深预测, 脉冲GTAW, 变压器融合网络

一、研究问题

P-GTAW（脉冲钨极氩弧焊）过程中熔池行为高度动态且复杂，尽管深度学习在焊接质量管理（WQM）中潜力显著，但在此类场景下的适用性仍需进一步验证。

同时，多传感器融合的机制仍缺乏可解释性：

不同模态信息在模型中如何被挖掘与融合不清晰；
“直接互补性”与“各模态贡献边界”缺少明确说明。

二、核心思路

提出 AM-TSFNet（Attention-based Multimodal Transformer-Serial Fusion Network）：利用 熔池图像 + 弧声信号 + 红外热像 的多模态输入，进行实时回归预测，输出：

背面焊缝宽度（back-bead width）
穿透深度（penetration depth）

目标是构建一个更准确、对噪声更鲁棒、并更“可解释”的焊接质量预测模型。

三、方法论

3.1 框架概述（AM-TSFNet 总链路）

AM-TSFNet 由三部分构成：

多模态输入处理与同步：熔池图像 / 声学信号 / 红外热像对齐后输入。
高频与低频特征提取：通过共享特征提取器提取特征，并引入 STSL 抑制冗余低频、增强有效高频。
跨模态融合与回归预测：通过注意力机制（Q/K/V 投影的自注意力结构）建模跨模态交互，最终进入回归头同时预测宽度与熔深。

关键设计点：

权重共享（weight sharing）：三种模态的特征提取器共享参数，以保证一致性并降低模型复杂度。
STSL（Soft Thresholding Shrinkage Layer）：自适应抑制近零激活（通常对应噪声/非焊接区域/低显著频率成分），增强信噪分离。
正则化策略：FC层 Dropout=0.5；L2 正则（weight decay=1e-4）；Early stopping（patience=10）。

提出的 AM-TSFNet 架构

3.2 特征提取器（TSFPC）

提出 TSFPC（Transformer-Serial Fusion Partial Convolution） 特征提取器：

先用 Transformer 模块在空间与时频域提取局部特征；
再用 CNN 建模长程依赖；
为提升效率，引入 PConv（Partial Convolution） 替代标准卷积：只对部分通道卷积，其余通道直连，以平衡效率与表达能力；
将 Transformer 的局部特征与 PConv 的全局特征进行串行融合（serial fusion）以增强表达；
结合 STSL 提升抗噪与泛化，并增强可解释性。

提出的 TSFPC 特征提取器架构 (a) 整体框架 (b) PCIR 块框架 (c) PCMV 块框架

3.3 TSFPC 层级配置表（以图像分支为例）

Layer（层）	Output size（输出大小）	Kernel（卷积核）	Stride（步幅）	Padding（填充）	Output Channels（输出通道）
RGB Image（输入）	256×256	–	–	–	3
Conv1	128×128	3×3	2	1	16
PCIR Block	64×64	–	–	–	24
PCIR + PCMV Block1	32×32	–	–	–	48
PCIR + PCMV Block2	16×16	–	–	–	64
PCIR + PCMV Block3	8×8	–	–	–	80
Conv2	8×8	1×1	1	0	320
Adaptive GAP	1×1	–	–	–	320
Fully Connected	–	–	–	–	256

3.4 STSL（软阈值收缩层）

STSL 作用于所有模态的特征图（视觉 / 声学 / 红外）：

将接近 0 的激活视为噪声或低显著成分；
通过软阈值抑制这些响应，同时保留物理上有意义的负响应；
目的：提升特征空间的信噪分离，从而提升宽度与熔深回归性能。

软阈值收缩层（STSL）的架构

四、结论

整体效果：多模态网络取得更高的 (R^2)（最高约 0.97）与更低 MSE（平均约 0.16），优于单传感器预测方法；STSL 与多模态融合均带来稳定增益。

4.1 消融：输入模态 × 融合 × STSL

Exp	img	audio	ir	Feature Fusion	STSL	Width MSE	Width R²	Depth MSE	Depth R²	Avg MSE	Avg R²
1	√			–	✗	0.98	0.92	0.22	0.82	0.60	0.87
1	√			–	✓	0.76	0.94	0.18	0.86	0.47	0.90
2		√		–	✗	1.10	0.91	0.32	0.77	0.71	0.84
2		√		–	✓	1.05	0.92	0.23	0.82	0.64	0.87
3			√	–	✗	1.61	0.87	0.31	0.77	0.96	0.82
3			√	–	✓	1.31	0.91	0.29	0.79	0.80	0.85
4	√	√		√	✗	0.63	0.94	0.23	0.86	0.43	0.90
4	√	√		√	✓	0.61	0.95	0.15	0.89	0.38	0.92
5	√		√	√	✗	0.71	0.94	0.21	0.84	0.46	0.89
5	√		√	√	✓	0.46	0.96	0.16	0.88	0.31	0.92
6		√	√	√	✗	1.37	0.89	0.23	0.83	0.80	0.86
6		√	√	√	✓	0.66	0.94	0.22	0.84	0.44	0.89
7	√	√	√	–	✗	0.59	0.96	0.17	0.88	0.38	0.92
7	√	√	√	–	✓	0.38	0.97	0.12	0.91	0.25	0.94
8	√	√	√	√	✗	0.30	0.97	0.10	0.93	0.20	0.95
8	√	√	√	√	✓	0.25	0.98	0.07	0.96	0.16	0.97

4.2 与基线模型对比（回归性能）

Model	Width MSE	Width R²	Depth MSE	Depth R²	Avg MSE	Avg R²
AF-FTTSnet	1.18	0.91	0.24	0.83	0.71	0.87
ViT + Cross-attention	1.12	0.92	0.32	0.76	0.72	0.84
ResNet18 + Cross-attention	0.83	0.93	0.23	0.83	0.53	0.88
MobileViT + Cross-attention	0.42	0.97	0.15	0.89	0.29	0.93
ViT + FFM	0.75	0.94	0.19	0.86	0.47	0.90
ResNet18 + FFM	0.59	0.96	0.17	0.88	0.38	0.92
MobileViT + FFM	0.41	0.97	0.11	0.93	0.26	0.95
Ours（AM-TSFNet）	0.25	0.98	0.07	0.96	0.16	0.97

五、创新

AM-TSFNet 混合架构：将 Transformer 与 PConv 串行融合，兼顾局部结构信息与全局上下文建模，面向动态焊接过程增强表征能力。
STSL 抗噪机制：自适应抑制噪声相关特征、突出信息性特征，提升高噪声条件下鲁棒性，并提升“特征可解释性”（更偏向“噪声抑制解释”）。
注意力式跨模态融合：用基于 Q/K/V 的注意力机制选择性整合图像/声学/红外互补信息，促进模态间信息交换。
系统性验证：多场景数据集 + 消融 + 多基线对比，验证 STSL 与融合模块对宽度/熔深回归的增益。

六、缺陷

“权重共享”可能压制模态特异性：三种模态共享同一特征提取器有利于降参，但图像/声谱/热像统计特性差异大；共享是否牺牲了某些模态的最优表征，需要更细致对比（共享 vs 部分共享 vs 独立）。

5. Online penetration prediction based on multimodal continuous signals fusion of CMT for full penetration

2026.1.22
音频-视觉信号, 渗透状态, 深度学习, CMT, 在线预测

一、研究问题

复杂对接焊接的在线渗透监测面临挑战，主要原因是钢板槽口不稳定与焊接热变形导致的过程波动，使得仅依赖单一信号难以稳定、准确地判别渗透状态与渗透深度。

二、核心思路

本研究提出一种混合方法，结合深度学习、计算机视觉与声信号处理，实现全渗透条件下槽口焊接渗透的实时监测。

提出多模态连续信号特征强化网络（MCRNet）：通过 3D 卷积捕捉时空信息，结合多尺度 2D 卷积与通道注意力提升轻量网络的特征提取能力，并设计相似性损失来约束视觉与声学特征在“同一渗透状态”上的一致性，实现多模态连续序列数据融合回归熔池渗透深度。

三、方法论

MCRNet 在有限深度结构内高效提取熔池连续信号特征。整体由 3D 卷积块、多尺度特征筛选模块（MFS）与融合模块组成，并通过相似性损失强化跨模态一致性。

3.1 3Dcov 块

对多帧图像，选取三个连续帧并整合到不同通道，构成尺寸为 3 × 256 × 256 的输入。采用 3D 卷积以捕捉连续视频帧中的时空信息，使空间与时间特征可以在卷积运算中统一建模。

三维卷积也有助于保持特征一致性。相邻帧通常共享相似模式，3D 卷积可利用这种一致性降低模型复杂度、提升计算效率，并在一定程度上缓解过拟合风险。

3.2 MFS 模块

多特征筛选（MFS）模块由多特征提取（MFE）块与挤压激励（SE）块组成。MFE 块包含五个分支以提取多样特征，并将分支输出进行协调形成模块输出。

为增强网络对复杂空间细节的捕捉能力，引入 1 × 3 与 3 × 1 的非对称卷积以加强水平与垂直方向的表征；同时融合 1 × 1 卷积以增强非线性处理能力。SE 块用于通道注意力加权，实现对有效特征的筛选与强化。

3.3 融合模块

融合模块通过线性层与批量归一化（BN）层，将视频与声音模态映射到匹配维度的特征图空间，再进行融合表示学习。

考虑到视频特征与声音特征共同表征同一渗透状态，设计额外损失约束二者在特征空间保持相似，从而提升融合稳定性与一致性。

3.4 相似性损失

引入损失项 $L_{V&S}$，基于两个特征集之间的余弦相似性度量跨模态一致性：

$$similarity=\cos(\theta)=\frac{V\cdot S}{|V||S|}=\frac{\sum{i=1}^{n}V{i}\times S{i}}{\sqrt{\sum{i=1}^{n}\left(V{i}\right)^{2}}\times\sqrt{\sum{i=1}^{n}\left(S_{i}\right)^{2}}}$$

$$L_{V&S}=1-\frac{\cos^{-1}(similarity)}{\pi}$$

四、结论

相较单模态输入，多模态方法整体效果至少提升 18%；实验显示 MCRNet 的 MSE 相比主流深度学习框架提升 44%（误差更低），在多模态输入下推理速度达到 57 FPS，实现熔池在线渗透深度的准确预测。

Network 网络	MAE (mm) 平均绝对误差 (mm)	MSE (mm) 均方误差 (mm)
MCRNet	0.2538	0.1190
MCRNet-Video only MCRNet-仅视频	0.2833	0.1555
MCRNet-Sound only MCRNet-仅声音	0.2893	0.1796
Without 没有（原文此处为某模块/策略占位）	0.2754	0.1314
Without 3Dcov block 没有 3Dcov 块	0.2876	0.1514
Without MFE block 没有 MFE 块	0.3565	0.2041
Without SE block 没有 SE 块	0.2984	0.1833

Network 网络	Time (ms) 时间（毫秒）
MCRNet	17.4
MCRNet- without reparameterization MCRNet-无重参数化	24.3

五、创新

提出面向全渗透槽口对接焊在线监测的多模态连续信号融合框架，将视频序列与声学信号联合建模，实现渗透深度实时回归预测。
设计轻量高效的 MCRNet 结构，将 3D 卷积（时空建模）、多尺度 2D 卷积（多分支特征提取）与通道注意力（SE）组合，在有限网络深度下提升特征强化与筛选能力。
提出跨模态相似性损失 $L_{V&S}$，以“同一渗透状态”一致性为约束，提升视觉与声学特征在融合前的对齐程度与融合稳定性。

六、缺陷

相似性损失假设两模态特征应高度一致，但在实际焊接中视觉与声学的敏感性可能对不同扰动源具有差异，过强一致性约束可能在某些工况下抑制“互补性特征”，需要进一步讨论权重系数与适用边界。
视频采用三帧输入（3 × 256 × 256）对更长时间尺度的动态变化建模能力有限；对热变形引起的慢变化、以及突发扰动的持续影响，是否需要更长序列或显式时序建模模块仍需验证。

6. Construction of a CNN-SK weld penetration recognition model based on the Mel spectrum of a CMT arc sound signal

2026.1.22
音频信号, Mel谱, CMT, 穿透状态识别, 轻量CNN, SKNet注意力

一、研究问题

弧声信号在焊接过程中易受工况波动与噪声影响，稳定性不理想；传统特征提取方法往往流程繁琐、效率偏低。与此同时，弧声在穿透状态判别中的信息价值常被低估或未被充分利用，因此需要一种更高效、能自适应提取有效特征的识别方法。

二、核心思路

提出一种紧凑的卷积神经网络（CNN）用于自适应提取弧声特征，并用于焊接穿透状态识别。

输入侧以弧声信号经 STFT 得到的 Mel 谱图（含 Mel 滤波器组转换步骤）作为网络输入表征。为提升识别能力，将动态选择核网络（SKNet）中的选择性核机制引入 CNN，形成 CNN-SK 模型，使网络能够在不同卷积核尺度之间进行动态选择，从而更有效地捕捉穿透状态相关的声学特征。

三、方法论

3.1 定制轻量级 CNN 网络

网络由以下组件组成：

6 个卷积层（Conv）：用于特征提取
6 个归一化层：用于稳定训练、缓解梯度问题
1 个平均池化层：用于参数近似与降维
1 个全连接层（FC）：用于最终分类

3.2 动态选择核网络（SKNet / SKAttention）

SKAttention 通过动态选择卷积核来强化关键信息提取。其核心是选择性核（SK）构建块：包含多个不同核大小的分支，并在融合后通过 SoftMax 完成信息选择与权重分配。

SKNet 的关键操作可概括为：

Split（分裂）：多分支不同尺度卷积并行提取
Fuse（融合）：聚合分支信息形成全局表征
Select（选择）：计算注意力权重，对不同尺度特征进行选择性重标定

3.3 定制 CNN-SK 模型结构

总体流程（按你笔记描述整理）：

输入特征先经过 4 种不同核大小的并行特征提取，得到 4 个特征映射；
将 4 个特征映射组合得到全局综合表示，用于后续权重选择；
为降低计算量，对聚合特征进行下采样；
生成的特征向量分别与 4 个注意力系数向量进行卷积，形成不同角度的特征重聚合；
通过 Softmax 计算各分支特征权重，并将加权结果传递到后续卷积层。

四、结论

CNN-SK 在三种穿透状态识别任务上取得最高准确率，并在计算资源不显著浪费的前提下优于多种对比模型。对比结论包括：LeNet 虽然训练更快、占用更小、FLOPs 更低，但精度明显不及 CNN-SK；VGG 与 AlexNet 等更复杂网络在多项指标上也不如 CNN-SK，说明引入动态核选择机制能够以较高性价比提升识别性能。

对比结果（按你笔记给出的数值汇总）：

Model	Accuracy（%）	备注
CNN-SK	98.83	最优
LeNet	92.33	更轻量但精度低
VGG	94.17	复杂但不占优
AlexNet	95.50	复杂但不占优
TF-CNN	98.20（100 epochs）	接近但低于 CNN-SK
VGG-SE	98.25（100 epochs）	接近但低于 CNN-SK

五、创新

将弧声信号以 Mel 谱图形式输入深度网络，利用 Mel 滤波器对低频更敏感的特性突出弧声关键模式，尤其适配 CMT 场景下弧声主要能量集中于 0–2 kHz 的现象。
构建 6 层轻量 CNN，面向工程应用保持较低复杂度，并通过引入动态选择核机制弥补轻量网络表达能力不足的风险。
将 SKNet 选择机制集成到 CNN 架构形成 CNN-SK，使模型能在不同卷积尺度间自适应选择，提升对穿透状态差异的辨识能力。
通过实验观察指出：SK 机制集成在早期卷积层（第 1 层或第 2 层）效果更显著，为“注意力插入位置”提供了经验性指导。

六、缺陷

Mel 表征可能丢失高频信息：强调 0–2 kHz 低频合理，但若某些缺陷或工况变化在更高频段有诊断信息，Mel 压缩可能削弱可分性；需要讨论频带选择的充分性。
解释性仍有限：SK 的动态选择体现“不同尺度的重要性”，但仍不足以解释“哪些频带/哪些时段”驱动决策；若能补充频带贡献可视化或对关键时频区域的定位，会更利于工程可信度与传感器优化。

7. Prediction of penetration based on infrared thermal and visual images during pulsed GTAW process

2026.1.23
穿透状态识别快速, R-CNN, 卷积描述符选择红外热像, GTAW

一、研究问题

脉冲 GTAW 过程中，穿透状态（penetration state的在线识别需要同时满足：

高准确率与鲁棒性（抗电弧闪烁、背景热辐射干扰）
工业现场可部署（工控机算力/显存受限，要求推理快、模型轻、训练周期短）

论文针对“双模态（IR 热像 + CCD 可见光）”条件下，如何在 不依赖复杂预处理/分割的前提下实现快速准确识别提出模型方案。

二、核心思路

构建 双输入 Dual-input Faster R-CNN：输入为原始 IR 热像与原始 CCD 图像。通过三类关键设计提升实用性：

同步特征提取与融合：降低电弧闪烁对 CCD 的负面影响，并利用 IR 的温度场信息补足视觉缺失信息
卷积描述符选择（Convolutional Descriptor Selection）：抑制 IR 特征图中的背景无关热辐射干扰
共享 RPN 与 ROI Pooling + 标签集成层（Label-integrated Layer）：在保证精度的同时降低计算负担与存储占用

三、方法论

3.1 Faster R-CNN 基本结构

Faster R-CNN 由四部分构成：

特征提取器 Backbone：对输入图像提取卷积特征图
RPN（Region Proposal Network）：在特征图上生成候选区域 proposals（含锚框生成与筛选）
ROI Pooling：将 proposals 映射回特征图并池化到固定尺寸
分类与回归头：输出类别标签与边界框回归结果

RPN 的作用是对特征图生成区域建议（proposals），为后续 ROI Pooling + 分类回归提供候选区域。

3.2 双输入 Dual-input Faster R-CNN 结构设计

为验证“共享哪些模块更优”，构建了 四种双输入 Faster R-CNN（DFR-1~DFR-4）结构变体，用于比较不同共享策略（例如共享 RPN、共享 ROI Pooling 等）对精度与速度的影响。

同时提出两类增强模型：

SSCD-DFR：对 IR 分支特征图引入“卷积描述符选择”，以抑制背景无关热辐射
DSCD-DFR：在更深层对 IR 与 CCD 两路特征共同进行描述符选择/融合（你笔记里写作 DDSC-DFR，正文建议统一为 DSCD-DFR 或以原文为准）

整体流程可概括为：

生成激活图（activation maps）
获取掩膜图（mask maps）
选择原始特征图的描述符（descriptor selection），得到更“干净”的 IR/融合特征用于识别

论文明确强调：采用原始 IR/CCD 作为输入，减少数据集制作与预处理误差，并通过同步特征提取 + 描述符选择提升抗干扰能力。

3.3 训练超参数

Hyperparameter 超参数	Value 值	Hyperparameter 超参数	Value 值
Anchor ratios 锚点比例	[0.5, 1, 1.5]	Feature Extractor backbone 主干	ResNet18
Anchor scales 锚尺度	[8, 16, 32]	Epoch 轮次	40
Trainnumbefore_NMS (pre-NMS)	12000	Learning rate 学习率	初始 1e−4；
Trainnumafter_NMS (post-NMS)	2000		StepLR(step=1, γ=0.95)
Testnumbefore_NMS (pre-NMS)	3000	Optimizer 优化器	Adam
Testnumafter_NMS (post-NMS)	300		L2 regularization: 5e−4

四、结论（你笔记整理版）

论文结论强调“双输入 + 结构共享 + 描述符选择”的综合收益：

识别准确率 >95%
每对 IR&CCD 图像识别时间 <270 ms

你记录的对比结果如下（建议直接保留为性能对照表）：

Model name 模型名称	Accuracy 准确率	Recognition time (per frame) 识别时间（每帧）	Training time (Epoch=40) 训练时间	Storage occupation 存储占用
DFR-1	95.58%	230 ms	1 h 45 min 35 s	94.98 MB
DFR-2	93.69%	243 ms	1 h 23 min 3 s	52.63 MB
DFR-3	92.34%	256 ms	1 h 20 min 46 s	53.43 MB
DFR-4	94.47%	246 ms	1 h 22 min 20 s	52.60 MB
SSCD-DFR	95.87%	264 ms	2 h 2 min 23 s	52.60 MB
DSCD-DFR	96.10%	320 ms	3 h 12 min 14 s	52.60 MB
FR-IR	90.84%	138 ms	46 min 55 s	47.43 MB
FR-CCD	91.57%	182 ms	47 min 10 s	47.43 MB

五、创新

双输入 Faster R-CNN 用于焊接穿透状态识别：同时利用 IR 温度场与 CCD 视觉信息的互补性，提高识别鲁棒性。
共享 RPN 与 ROI Pooling 的轻量化思路：在保证精度的同时降低计算/存储负担，面向工控机部署。
卷积描述符选择（尤其针对 IR 特征图）：显式抑制背景无关热辐射干扰，提升抗干扰能力。
标签集成层（Label-integrated Layer）：在融合后的决策阶段加强稳定输出，使整体更贴近“快速、准、轻”的工业约束目标。

六、缺陷

任务形式偏“状态识别/分类”：更多是判别穿透状态，而不是直接给出连续熔深/背宽等可控量；用于闭环控制时可能还需要回归模型或额外映射。
对时序信息利用有限：以逐帧/图像对为主，未显式建模焊接过程的时间动态（脉冲节拍、热惯性），在工况快速切换或短暂扰动下可能不如时序模型稳定。
工程依赖与泛化风险：双相机（IR+CCD）同步、标定、视角一致性会影响可迁移性；跨设备/镜头/滤光/发射率变化时可能需要重新训练或域适配。

8. Multi-sensing signals diagnosis and CNN-based detection of porosity defect during Al alloys laser welding

2026.1.23
铝合金激光焊接, 多传感信号诊断, 孔隙缺陷检测, 钥孔三维形态特征, 时频光谱图, 卷积神经网络

一、研究问题

孔隙（porosity）是铝合金激光焊接中常见且危害显著的内部缺陷。在线监测的关键难点在于：

孔隙形成与钥孔（keyhole）动态失稳/塌陷强相关，但这种动态行为难以用单一传感信号稳定表征。
传统人工特征（频域/时域统计量等）在复杂焊接动态下易失效，且难以实现在线定位。

二、核心思路

搭建多传感平台，用钥孔的三维形态特征来“机理驱动”地锁定孔隙发生区间，再把该区间的动态形态变化转成可供 CNN 识别的时频（TF）谱图，实现在线孔隙检测与定位。

多传感/多信号分工（论文符号）

信号/模态	缩写	获取方式	作用定位
钥孔深度	KD	相干光测量系统（coherent light measurement）	用于诊断孔隙发生区域（塌陷→KD突变）
钥孔开口图像	KO images	高速相机获取钥孔开口序列	提取形态特征序列（不稳定/高频振荡→孔隙）
钥孔开口形态特征	KO morphological signals	从 KO 图像处理得到	转为 TF 谱图后交给 CNN 分类

三、方法论

3.1 总体流程（诊断 + 检测）

阶段	输入	关键处理	输出	目的
3.1.1 诊断（找“疑似孔隙区间”）	KD 连续信号	EEMD 分解与重构 + 阈值判定	孔隙候选区间（porosity region）	利用“塌陷→KD锐减/异常”锁定区间
3.1.2 特征构造（把动态变成谱图）	KO 图像序列	图像预处理→提取 KO 形态特征序列；滑窗扫描；WPT 转 TF 谱图	TF spectrum graphs（谱图）	把 1D 形态序列变成 2D“图像特征”供 CNN 识别
3.1.3 CNN 检测（分类+定位）	TF 谱图序列	CNN 二分类（孔隙/无孔隙）；滑窗回映射到焊缝位置	孔隙标签 + 位置	在线识别与定位（更偏向“大孔隙”可靠）

3.2 KD 信号诊断孔隙区间（EEMD）

对相干光系统测得的 KD 信号进行 EEMD 处理后重构；
发现孔隙往往出现在重构信号超过特定阈值的区间；其物理解释是：钥孔塌陷形成孔隙，会导致 KD 出现“ sharp decrease ” 的突变行为。

（这一部分是整篇“机理分析”的核心：先用 KD 把“可能出孔隙”的位置圈出来，再对该位置的 KO 形态做细粒度检测。）

3.3 KO 图像处理与形态特征序列（KO signals）

对 KO images 的处理目标：尽量抑制飞溅/噪声影响，稳定提取钥孔开口的几何量。论文给出的典型步骤包括：ROI 提取、形态学操作去飞溅、滤波、二值化、保留最大连通域等。

提取的 KO 形态特征在论文中用于构建后续 TF 谱图，代表性几何量包含：

Area（面积）
Perimeter（周长）
Length（长度）
Width（宽度）（论文将这些作为“钥孔开口形态特征信号”进入滑窗+WPT流程）

3.4 滑动窗口 + WPT 生成 TF 谱图

在 KO 形态特征序列上做滑动窗口扫描（论文示例：窗口 size=20、step=20 的设置出现在流程描述中）。 ([ScienceDirect][2])
对每个窗口片段做 WPT（Wavelet Packet Transform），生成对应的 TF spectrum graph。
观察规律：孔隙对应位置往往呈现“messy TF spectrum graphs”，指示 KO 在该处出现更强的高频不稳定振荡。

3.5 CNN 二分类模型

模型结构（论文给的关键点）：深度 6（6 个卷积层），每个卷积层后接池化层，顶部 2 个全连接层 + softmax 二分类。

训练设置与环境：

batch size = 64
learning rate γ = 0.001
training iterations = 4000（10 epochs，每个 epoch 400 iterations）
dropout = 0.5（用于降低过拟合风险）
软件硬件：TensorFlow 1.14 / Python 3.7 / RTX 3080Ti（等）

数据增强与划分（Experiment #1）

表 3：增强后两类谱图数量（Experiment #1）

状态	Label	谱图数量
No porosity	0	1976
Porosity	1	848

表 4：训练集/测试集划分（Experiment #1）

状态	Label	Total	Train	Test
No porosity	0	1976	1500	476
Porosity	1	848	600	248

（论文解释：对“no porosity”的纯谱图做水平翻转；对“porosity”的 messy 谱图做水平+垂直翻转，以缓解数据偏斜。）

四、结论

4.1 机理层面的对应关系

KD 经 EEMD 重构后，孔隙倾向出现在重构值超过阈值的区间；原因是钥孔塌陷形成孔隙伴随 KD 的突变下降。
KO 形态特征序列经滑窗+WPT 后，孔隙发生位置对应 messy TF spectrum graphs，反映 KO 在该处高频剧烈振荡。
构建的 CNN 对包含不同 TF 特征的谱图具有较高识别能力，可在线检测孔隙并定位“大尺寸孔隙”的位置。

4.2 模型性能与在线检测效果

Experiment #1：CNN 在测试集上给出平均分类准确率 96.13%（孔隙/无孔隙二分类）。
Experiment #2：对整条焊缝扫描并在线检测孔隙：成功检测 33 个孔，整体检测准确率 82.5%；对“孔隙状态(0/1)”分类准确率 95.67%。
大孔与小孔差异：以 100 μm 为阈值时，CNN 对“大孔隙”检测更可靠（文中示例：large ≈ 90.32%，small ≈ 55.56%）。

五、创新

多传感机理诊断 + 深度学习检测的组合：先用 KD（相干光）把孔隙候选区域圈定，再用 KO 形态 TF 谱图做 CNN 识别，实现“诊断—检测”闭环。
将 KO 的 area/perimeter/length/width 等形态序列通过 WPT 转为 TF 谱图，形成一种可迁移的“2D 维度无关特征表示”，增强了方法的可移植性。
在线定位：通过滑窗扫描把谱图分类结果回映射到焊缝位置，实现孔隙位置的在线指示（尤其对大孔隙更有效）。

六、缺陷

对小孔隙敏感性不足：小孔对应的 TF 特征不够“messy/显著”，导致检测准确率明显低于大孔。
阈值/流程依赖较强：KD 的 EEMD 重构与阈值判定决定了候选区间，若阈值随材料/工况漂移，可能带来漏检或误检（需要跨工况校准策略）。
多阶段流水线误差累积：KO 图像处理（去飞溅/二值化/连通域）→形态量→WPT→CNN，各环节的噪声会层层放大；对实际工业现场光照、飞溅更强场景，鲁棒性需要额外验证。

9. Optical coherence measurement-based penetration depth monitoring of stainless steel sheets in laser lap welding using long short-term memory network

2026.1.26
激光焊接, 不锈钢板, 穿透深度监测, 光学相干测量, 长短期记忆网络

一、研究问题

激光搭接焊接薄不锈钢板的工业现场对穿透深度的绝对水平与波动稳定性提出严格要求，因此需要可靠的在线监测方法。已有传感监测通常通过“间接特征→熔深”建立映射，但在噪声干扰下相关性不稳定、误差较大；本文聚焦于：如何利用相干光测得的钥孔深度（KD）信号实现更准确的穿透深度曲线监测。

二、核心思路

提出基于光学相干测量（coherent light / OCM）+ 时序网络的穿透深度监测框架：

用相干光束获取焊接过程中的钥孔深度 KD 原始信号。
通过经验模态分解（EMD）重建 KD 的低频趋势，发现其与穿透深度曲线存在显著关联，但仍存在“监测误差”。
结合互相关分析与数值模拟，解释误差来源（底部熔化层厚度、滞后特性、多次反射）。
用 LSTM 记忆 KD 的历史信息，自适应上述误差，从而预测每一时刻的穿透深度。

三、方法论

3.1 实验与测量系统

场景：不锈钢薄板激光搭接焊
设备与配置（节选）：IPG YLS-10000 光纤激光器（1060 nm），IPG 焊接头，聚焦后光斑约 0.5 mm；KUKA 六轴机器人（重复定位精度 ±0.05 mm）；监测系统采用 IPG 相干测量系统获取 KD。

3.2 信号处理：KD 重建与误差机理分析

KD 信号重建：对相干测量得到的 KD 原始序列做 EMD，提取更贴近穿透深度变化的趋势项（重建 KD）。
误差来源（论文给出的机制解释）
- bottom melt layer thickness（底部熔化层厚度）
- hysteresis property（滞后特性）
- multiple reflections（多次反射）

3.3 穿透深度预测模型：LSTM 回归

思路：把“重建 KD 序列”作为输入序列，利用 LSTM 的记忆能力预测每一时刻穿透深度（逐点回归）。

Iterations 迭代次数	Learning rate 学习率	Batch size 批量大小	Dropout 丢弃率	Optimizer 优化器	Hidden neurons 隐藏层神经元数
500	0.005	32	0.5	Adam	12

四、结论

论文结论层面：LSTM 预测模型表现出高精度与良好泛化，可实现穿透深度的有效在线监测。

你笔记里记录的“模型对比误差”可用表格固化为：

模型	误差指标1	误差指标2
LSTM	77.31 μm	23.14 μm
RNN	80.94 μm	27.16 μm
DBN	83.69 μm	29.95 μm
ANFIS	88.13 μm	32.52 μm

五、创新

测量—机理—学习闭环：不仅用 KD 预测熔深，还用互相关 + 数值模拟解释“KD→熔深”误差的物理来源（底部熔化层厚度/滞后/多次反射），把黑盒回归变成“可解释的误差建模问题”。
用时序记忆去吸收系统性误差：把“KD 与熔深不一致”视为带滞后与扰动的动态映射，LSTM 通过历史信息对误差进行自适应补偿，而不是仅做静态拟合。
面向工业需求的定位明确：强调薄板搭接焊对“熔深水平 + 波动稳定性”的要求，直接对准在线监测落地场景。

六、缺陷

模型侧创新有限：网络结构以基础 LSTM 为主，更多贡献在“信号重建 + 误差机理解释 + 时序补偿”的系统方案；若写进综述/论文现状，建议明确它是“监测链路设计”贡献，而非“新网络结构”。
泛化边界可能较窄：材料（不锈钢）、工况（搭接/薄板）与具体相干测量系统的耦合较强；迁移到铝/镀锌钢/深熔模式时，多次反射与等离子体/蒸汽影响可能改变误差机理，需要再验证。

10. Real-time porosity monitoring during laser welding of aluminum alloys based on keyhole 3D morphology characteristics

2026.1.26
铝合金，激光焊接，孔隙率监测，多传感信号，钥匙孔三维形态特征，滑动窗口扫描，EEMD，PCA，反馈-遗传算法优化 ANN

一、研究问题

孔隙（尤其是钥匙孔诱发孔隙 keyhole-induced pores）及其孔隙率在工业现场仍严重依赖焊后离线检测；而在线监测内部缺陷更难。文中指出：基于光谱（SE）的方法更适合监测“冶金孔”（如低沸点元素相关孔隙），但对钥匙孔诱发孔隙不适用，且等离子体/金属蒸汽的剧烈周期性会干扰 SE。

因此，本文目标是：在铝合金激光焊接中，建立一种可在线预测当前焊接位置局部孔隙率（local porosity）的实时方法。

二、核心思路

搭建 CVM（计算机视觉测量）+ OCT（光学相干技术） 的多传感平台，用“钥匙孔开口（外部）+ 钥匙孔深度（内部）”联合表征钥匙孔三维形态；再用滑动窗口量化钥匙孔波动程度（每窗求 STD），构造“钥匙孔三维形态 STD 特征”。

在建模上，采用 反馈机制 + GA（遗传算法）优化的 ANN（Feedback-GA-ANN），考虑焊接热历史带来的时序/滞后影响，实现局部孔隙率的在线跟踪与预测。

三、方法论

3.1 实验与数据来源

项目	内容
焊接装备	光纤激光器 IPG YLS-30000（最大功率 30 kW，波长 1070 nm），光斑直径 0.5 mm
材料	Al 6061（8 mm、10 mm）与 Al 7075（10 mm）
传感平台	CVM 获取钥匙孔开口图像 + OCT 测量钥匙孔深度（用于形成“3D 形态特征”）
监测目标	焊缝沿程的局部孔隙率在线预测

3.2 关键特征构造：滑动窗口 + STD（波动量化）

步骤	输入	操作	输出
1	多传感得到的钥匙孔 3D 形态序列	沿焊接方向做滑动窗口扫描	窗序列
2	每个窗口内的形态数据	计算 STD（标准差）	3D 形态 STD 特征

直觉对应：钥匙孔诱发孔隙与“钥匙孔剧烈不稳定波动”相关，STD 用来把这种不稳定程度量化。

3.3 局部特征提取与降维：EEMD + PCA

模块	目的	方法
EEMD	从时序/波动信号中提取与“局部孔隙”对应的局部特征	EEMD（集合经验模态分解），常用于时频分析
PCA	压缩特征维度、提取主成分	PCA 主成分特征作为模型输入之一

3.4 预测模型：Feedback-GA-ANN（考虑热历史的反馈 ANN + GA 优化）

组件	作用	说明
反馈机制（Feedback）	让模型显式利用焊接热历史（滞后/记忆效应）	区别于“一次前向”的传统 ANN
GA 优化	自动优化 ANN 参数	采用经典遗传算法搜索最优参数组合
在线应用方式	先离线建模，再在线逐点预测	“离线学到 3D-STD 特征→孔隙率映射”，在线根据当前特征输出局部孔隙率

四、结论

构建了基于 CVM+OCT 的钥匙孔三维形态特征监测框架，相比仅用 2D 形态更丰富。
提出了基于滑动窗口 STD的钥匙孔实时波动量化，并将其与局部孔隙率建立映射。
通过 Feedback-GA-ANN 实现焊缝沿程的局部孔隙率在线跟踪与预测（文中示例提到用实验 #1 的末 200 个局部孔隙率点作为测试片段进行预测展示）。

五、创新

传感层面：将成熟 CVM 与新兴 OCT 结合，实现钥匙孔外形开口 + 内部深度的 3D 形态特征测量，用于孔隙率监测。
特征层面：提出“滑动窗口扫描 + STD”的实时波动量化范式，把钥匙孔不稳定性转为可学习特征。
建模层面：在 ANN 中引入“反馈机制”显式编码热历史，再用 GA 做参数优化，形成可落地的在线预测链路。
机理联系：强调并揭示“孔隙率 ↔ 钥匙孔 3D 形态波动”的对应关系，使模型更接近过程机理而非纯黑箱。

六、缺陷

硬件门槛与工程复杂度高：方法依赖 CVM+OCT 的同步测量与标定；一旦现场有强弧光、飞溅、烟尘、反射率变化，OCT/视觉信号质量可能显著波动，系统维护成本较高。
反馈 ANN 的时序表达能力有限：反馈机制用于热历史是亮点，但与 LSTM/TCN/Transformer 等端到端时序模型相比，表达上限可能受限（尤其在工况跨度更大时）。

11. Intelligent detection method for aluminum alloy TIG welding quality by fusing multimodal data features

2026.1.27
TIG, 焊接, 多模态数据融合, 深度学习, 质量监测

一、研究问题

铝合金 TIG 焊接过程中存在多源扰动（工况波动、环境干扰、操作变化等），会引发细微但关键的质量变化；仅依赖单一传感器/单一模态的传统在线方法难以捕捉这类变化，也难以建立更准确的隐含关联来融合异构数据。

二、核心思路

提出 Resnet-Transformer 模型（RTM） 做多模态特征级融合：

模态：焊接熔池图像 + 焊接电流 + 焊接速度
任务：识别 TIG 焊接的 6 种焊接状态
融合：采用 MFB（Multi-modal Factorized Bilinear） 融合图像与时间序列特征
配套：设计“熔池图像自动分割/裁剪/增强”算法，去除钨电极等冗余特征，提高输入质量与模型鲁棒性（文中报告准确率提升 8.6%）

三、方法论

3.1 整体流程

阶段	输入	关键处理	输出	目的
3.1.1 数据输入	熔池图像；电流序列；速度序列	图像尺寸 224×224；时序长度 1280（电流/速度）	两路输入张量	形成“图像 + 时序”双塔输入
3.1.2 图像增强	原始熔池图像	自动分割/裁剪 + 增强（去冗余、提边缘与细节）	Enhanced pool images	减少钨电极干扰、提高学习效率
3.1.3 特征提取	图像；时序	图像塔：RAT（ResNet50-Attention）；时序塔：Transformer 编码器	图像特征、时序特征	分别提取空间与时间表征
3.1.4 特征融合	两路特征	MFB 融合（降维→扩展→逐元素乘→池化→归一化）	融合特征	建立跨模态隐含关联
3.1.5 分类	融合特征	分类头（CrossEntropy）	6 类焊接状态	完成质量状态识别

3.2 图像增强处理（自动分割/裁剪/增强）

“可复现步骤表”：

步骤	操作要点	目的
1	二值化获取钨电极轮廓，取轮廓点 y 最大值 (y{max})，裁掉 (y{max}) 以下区域，仅保留熔池区域	去除电极冗余与干扰
2	Sobel 计算梯度幅值与方向	强化边缘/细节
3	双阈值（强/弱边缘）+ 边缘连接（弱边缘需与强边缘连通）	获得更完整边缘
4	获取熔池轮廓的 x/y 极值，确定裁剪范围	聚焦熔池区域
5	直方图均衡化 + 以 Sobel 为核的 2D 卷积滤波	增强对比度与细节、稳定纹理表征

论文动机：钨电极在不同图像中轮廓与灰度高度一致，属于冗余特征；其灰度又与熔池差异显著，会干扰图像优化与模型学习。

3.3 RTM（Resnet-Transformer）与 MFB 融合

双塔/双分支特征级融合：一塔处理图像，一塔处理时序（电流+速度）。
图像特征提取：RAT（Resnet50-Attention）
- 用 Transformer 的多头自注意力思想改造 ResNet50：在 E 层残差块中用注意力机制替换部分 3×3 卷积（论文表述为对 ResNet50 的 attention 改进）。
融合：MFB（Multi-modal Factorized Bilinear）
- 两路特征先全连接降维，再扩展维度；
- 逐元素相乘；dropout；求和池化；
- 幂归一化 + L2 归一化；得到融合表征用于分类。

3.4 训练设置

参数	值
Number of training iterations（训练迭代次数）	300
Optimization algorithm（优化算法）	Adam
Learning rate（学习率）	1e-6
Batch size（批量大小）	32
Loss function（损失函数）	CrossEntropyLoss

四、结论

RTM 完成 6 类焊接状态分类，整体准确率 98.94%。
文中强调：相较仅基于图像的 RAT，RTM（多模态融合）在案例中有明显提升（摘要中给出“相对 RAT 提升 16.13%”的描述）。

你整理的分类指标可直接用 Markdown 表固化：

表：RTM 的分类指标（Precision / Recall / F1）

类别	Precision	Recall	F1
Good welding（良好）	0.985	0.950	0.987
Burn through（烧穿）	1.000	0.995	1.000
Contamination（污染）	0.956	0.994	0.985
Lack fusion（缺乏熔合）	1.000	1.000	0.997
Misalignment（错位）	0.975	0.980	0.975
Lack penetration（缺乏穿透）	0.989	0.990	0.994

表：RTM vs HYC 的分类指标对比

类别	Precision RTM	Precision HYC	Recall RTM	Recall HYC	F1 RTM	F1 HYC
Good welding（良好）	1.000	0.931	0.975	0.875	0.987	0.902
Burn through（烧穿）	0.976	0.979	1.000	0.950	1.000	0.964
Contamination（污染）	0.976	0.930	1.000	0.925	0.988	0.927
Lack fusion（缺乏熔合）	1.000	0.985	0.995	0.995	0.997	0.993
Misalignment（错位）	0.985	0.899	0.985	0.930	0.985	0.914
Lack penetration（缺乏穿透）	0.995	0.923	1.000	0.965	0.998	0.944

五、创新

多模态特征级融合落地到 TIG 质量监测：融合熔池图像 + 电流 + 速度，以 MFB 建立异构数据的隐含关联，提升状态识别性能。
Resnet + Transformer 编码器的组合式建模：利用 ResNet 的局部表征能力与 Transformer 的全局建模能力，在图像与时序两路分别提特征后再融合。
熔池图像自动分割/裁剪增强：去除钨电极等冗余特征，提升数据质量与泛化鲁棒性；论文报告该预处理带来 8.6% 的准确率增益。

六、缺陷（建议写进笔记的“可批判点”）

融合可解释性不足：MFB 能提升性能，但“哪一模态在何种缺陷起主导作用、为何提升”若缺少可解释分析（如 Grad-CAM / SHAP / 注意力可视化），很难直接指导传感器布局与工艺诊断。

12. Cross-attention-based multi-sensing signals fusion for penetration state monitoring during laser welding of aluminum alloy

2026.1.27
穿透状态监测，激光焊接，交叉注意力，多传感信号融合，深度学习

一、研究问题

随着产品与工艺复杂度提升，单一传感器难以全面表征激光焊接过程中的穿透状态波动，亟需更精确的多传感监测策略。现有多传感方案常依赖手工预处理或时频分析，且不同传感器特征往往“各做各的”，导致跨模态高层信息未能联合利用，从而影响鲁棒性与精度。

二、核心思路

采用光电传感器（photodiode）与声学传感器（microphone）同步采集铝合金激光焊接过程的一维时间序列信号；依据焊缝顶部与背面形貌将数据划分为 3 类穿透状态。提出交叉注意力融合网络 CAFNet，直接在原始时域信号上交互式融合光电与声学信息，实现穿透状态分类，避免事先进行时频分析与特征工程。

三、方法论

3.1 数据与任务定义

项目	内容
传感器	photodiode + microphone（光电 + 声学）
数据形式	两路原始一维时间序列（时域信号）
标签	按顶部/背面形貌划分为 3 类穿透状态
核心目标	穿透状态分类（无需时频图/手工特征）

3.2 CAFNet 总体结构

CAFNet 由“两分支 1D-CNN + 交叉注意力（CA）块”组成：两路信号分别提取特征，再通过 CA 模块实现跨模态交互融合。

3.3 1D-CNN 分支（两路同构）

每个 1D-CNN 分支包含 4 个 Conv 块；每个 Conv 块由以下层组成：

模块	结构
Conv block（×4）	Conv(3×1, L filters) → BN → LReLU(negative slope=0.001) → MaxPool(2×1)

其中 BN 用于缓解内部协变量偏移并加速收敛；LReLU 用于激活归一化后的特征图。

3.4 交叉注意力（CA）融合

核心贡献是把 Transformer 的注意力思想用于“光电-声学”两路时域特征的交互式提取与融合；论文在 Highlights 中强调其为“modified self-attention mechanism”以实现 photoacoustic 特征的交互提取。同时强调模型可直接处理原始时域信号、在不平衡/小样本训练比率下表现更稳健。

四、结论

论文在摘要中给出两条关键结论：

完整数据设置下，CAFNet 达到 mean testing accuracy = 99.73%、std = 0.37%，优于对比 DL 方法。
在“有限且不平衡数据”条件下，CAFNet 达到最高 average testing accuracy = 94.34%，体现更强鲁棒性。

方法	10%	20%	30%	40%	50%	Average（Mean %）
1D-CNN-A	67.45 ± 1.04	71.55 ± 1.68	74.28 ± 0.72	77.54 ± 0.93	79.09 ± 1.37	73.98
1D-CNN-P	66.96 ± 1.74	70.92 ± 1.10	74.10 ± 1.06	76.31 ± 1.17	77.76 ± 1.77	73.21
ResCNN-A	79.67 ± 5.97	88.39 ± 1.90	91.50 ± 1.31	94.58 ± 1.24	95.05 ± 1.17	89.84
ResCNN-P	46.27 ± 5.70	58.05 ± 7.23	68.45 ± 8.09	80.80 ± 2.90	86.71 ± 1.55	68.06
1D-CNN-AP	78.77 ± 1.12	84.28 ± 0.64	87.74 ± 1.48	89.62 ± 1.41	90.27 ± 1.70	86.14
ResCNN-AP	76.62 ± 3.19	83.52 ± 2.48	88.88 ± 1.86	93.73 ± 1.15	94.47 ± 1.40	87.44
CAFNet	86.70 ± 0.46	91.67 ± 0.57	96.89 ± 0.41	97.94 ± 0.36	98.49 ± 0.39	94.34

五、创新

交叉注意力用于多传感时域信号融合：用 CA 在特征层进行跨模态交互式提取 photoacoustic 信息，而不是简单拼接/相加。
无需时频分析与手工特征：直接对原始时域信号建模，降低特征工程依赖与系统复杂度，更贴近在线监测落地。
小样本/不平衡鲁棒性：在训练数据比例受限时仍保持较高平均准确率（表中 CAFNet 的 Average=94.34%，并被论文摘要强调为“stronger robustness”）。

六、缺陷

可解释性仍偏弱：提出了交叉注意力，但若缺少“注意力权重/重要时间片段/跨模态贡献”的可视化与定量解释，难以直接指导传感器布置与闭环控制策略（尤其是工业端更关心“为什么判成 LP/PP/FP”）。
对同步与采样稳定性的依赖：跨模态注意力本质依赖两路信号的时间对齐质量；现场采样时钟漂移、传感器安装位置变化、噪声源变化可能导致分布漂移，需额外的对齐/校准策略才能稳健部署。

一、研究问题

二、核心思路

三、方法论

四、结论

五、创新

六、缺陷

看完了？说点什么呢

【学业/工作】停笔的那一刻，天没有塌

Ver — Wed, 07 Jan 2026 14:58:49 GMT

该渲染由 Shiro API 生成，可能存在排版问题，最佳体验请前往：https://blog.verxie.org/notes/1

纸张声哗啦啦得响着，「从后往前传」，前方传来严厉的声音。

我没有抬起头，但是我能感受到前后桌的目光都在盯着我，脑子里嗡嗡作响，写还是不写，握着笔的手微微发抖，直到————

「后面的那个！」我知道老师在盯着我「停笔！」

那天是我第一次进到第二考场，当我自以为能从从容容地完成考试，没想到...

那时候的我，真的感觉天塌了，回家闷头趴在被子里面，自责，难过，自暴自弃，混合的负面情绪搅得我一头乱麻。

现在看来，我觉得这是中国学生普遍的缩影，青春的成长中，『学业』占据了太多的生活，朝九晚五？不不不，我们的时代是属于朝七晚九的。

也许正因如此，默默的在人们心目中，『未来』的好坏等于『学业』的优劣的观念逐渐生根发芽。

所以在我刚刚转到国际高中时，不适感不是来源于『语言沟通』，而是我周围人的『学习态度』。

我不理解，为什么这群人仿佛不担心『未来』一样，悠闲地学，悠闲地玩。

我不明白，为什么这群人仿佛不在乎『学业』一样，考差考好，都无所谓。

凭借着在应试教育培养下得到的学习能力，我还算轻松的完成了高中学业出国留学了。

再往后，在这种轻松的低压力的环境影响下，不出乎意料地，我迎来了一次严重的失利。

大三的毕业考试，我几乎在纯裸考的情况下完成了考试，不用估分，我也知道这次彻彻底底地考砸了。

『高中』努力却够不着目标的自责，与『大学』放纵导致碰不到目标的自责，相似却又不同。

我思考了很久，那个『未来』等于『学业』的观念早就随着环境的变化消散了，那我到底，为什么而『学』？

为了『过去』？那个长辈因时代而没能够实现的学业梦？

为了『未来』？一个全员都在卷成绩而忽视生活的社会？

不妨想想，如果『失去』了学业，我们还『拥有』什么？

地球还在转着，红绿灯还在亮着，行人仍走着，『我们』还活着。

为什么卷『学业』？想为『未来』留下保障，更好的活着。

仅仅看到了『学业』，卷是没有尽头的。

如果忘掉了『学业』，人会失去主心骨。

平衡好程度，不应该为了『学』而『学』。

『学』是一种能力，应用的场景还是『生活』，为了『自己』而『学』。

『工作』亦是如此，有搞砸的时候，有表现优异的时候，无论如何，先给自己点个外卖，安慰/奖励一下自己吧。

哭完的我，吃着外卖，思考着。

看完了？说点什么呢

机器学习:代码 - Machine Learning Code

Ver — Fri, 19 Dec 2025 07:08:25 GMT

该渲染由 Shiro API 生成，可能存在排版问题，最佳体验请前往：https://blog.verxie.org/posts/study/2023-10-30-Machine-Learning-Code

1. 线性回归(Linear Regression)

1.1 Boston Housing的数据

1.1.1 导入库

# 导入 NumPy 库：一个用于数值计算的库，提供了大量的数学函数来操作数组。
import numpy as np

# 导入 Matplotlib 的 pyplot 模块：这是一个绘图库，用于创建静态、动态、交互式的可视化图形。
import matplotlib.pyplot as plt

# 从 scikit-learn 库中导入 preprocessing 模块：这个模块提供了几种常用的实用功能，如特征缩放、中心化、标准化和二值化等。
from sklearn import preprocessing

1.1.2 导入数据

# 导入 pandas 库: 用于处理csv文件
import pandas as pd
data_url = "https://lib.stat.cmu.edu/datasets/boston"
# sep="\s+"意味着数据列之间由一个或多个空格分隔。skiprows=22表示跳过前22行，header=None表示数据没有列标题。
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
# 第一部分是raw_df的偶数行（从0开始）的所有列，第二部分是raw_df的奇数行的前两列。并且是水平堆叠
boston_data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
# raw_df的奇数行的第三列作为目标变量
target = raw_df.values[1::2, 2]

为了使示例简单，将仅使用两个功能：INDUS和RM。这些和其他功能的解释可在数据页上找到。

# 获取bosten数据
data = boston_data;
# 仅仅处理INDUS和RM, 从data中选择所有行但只选择第3和第6列
x_input = data[:, [2,5]]
y_target = target;
# 标准化数据，使其拥有规则性
x_input = preprocessing.normalize(x_input)

1.1.3 可视化(Visualization)

# 对两个特征单独地画图
plt.title('Industrialness vs Med House Price')
plt.scatter(x_input[:, 0], y_target)
plt.xlabel('Industrialness')
plt.ylabel('Med House Price')
plt.show()

plt.title('Avg Num Rooms vs Med House Price')
plt.scatter(x_input[:, 1], y_target)
plt.xlabel('Avg Num Rooms')
plt.ylabel('Med House Price')
plt.show()

1.2 定义一个线性回归模型(Defining a Linear Regression Model)

线性回归模型为:

f(x)=\mathbf{w}^\top \mathbf{x}+b=w_{1}x_{1}+w_{2}x_{2}+b,

np.dot(w, v) for vector dot product
np.dot(W, V) for matrix dot product

def linearmodel(w, b, x):
    '''
    Input: w 是权重, b 是截距, x 是d维的向量
    Output: 预测的输出
    '''
    return np.dot(w, x) + b

def linearmat_1(w, b, X):
    '''
    Input: w 是权重, b 是截距, X 是数据矩阵 (n x d)
    Output: 包含线性模型预测的向量
    '''
    # n 是训练例子的数量
    n = X.shape[0]
    t = np.zeros(n)
    for i in range(n):
        t[i] = linearmodel(w, b, X[i, :])
    return t

1.2.1 向量化(Vectorization)

def linearmat_2(w, X):
    '''
    linearmat_1的向量化.
    Input: w 是权重(包含截距), and X 数据矩阵 (n x (d+1)) (包含特征)
    Output:包含线性模型预测的向量
    '''
    return np.dot(X, w)

1.3 向量化和非向量化代码的速度比较(Comparing speed of the vectorized vs unvectorized code)

非向量化代码的时间

import time
w = np.array([1,1])
b = 1
t0 = time.time()
p1 = linearmat_1(w, b, x_input)
t1 = time.time()
print('the time for non-vectorized code is %s' % (t1 - t0))

向量化代码的时间

# 把截距添加到权重向量
wb = np.array([b, w[0], w[1]])
# 在输入矩阵中添加为1的参数(对应截距)
x_in = np.concatenate([np.ones([np.shape(x_input)[0], 1]), x_input], axis=1)
t0 = time.time()
p2 = linearmat_2(wb, x_in)
t1 = time.time()
print('the time for vectorized code is %s' % (t1 - t0))

1.4 定义损失函数(Defining the Cost Function)

$$ C(\mathbf{y}, \mathbf{t}) = \frac{1}{2n}(\mathbf{y}-\mathbf{t})^\top (\mathbf{y}-\mathbf{t}). $$

def cost(w, X, y):
    '''
    评估向量化方法的损失函数
    输入 `X` 和输出 `y`, 在权重 `w`.
    '''
    residual = y - linearmat_2(w, X)  # 获取差值
    err = np.dot(residual, residual) / (2 * len(y))

    return err

例如，假设的损失:

    cost(wb, x_in, y_target)

1.5 在权重空间画出损失(Plotting cost in weight space)

w1s = np.arange(-22, -10, 0.01)
w2s = np.arange(0, 12, 0.1)
b = 31.11402451
W1, W2 = np.meshgrid(w1s, w2s)
z_cost = np.zeros([len(w2s), len(w1s)])
for i in range(W1.shape[0]):
    for j in range(W1.shape[1]):
        w = np.array([b, W1[i, j], W2[i, j]])
        z_cost[i, j] = cost(w, x_in, y_target)
CS = plt.contour(W1, W2, z_cost,25)
plt.clabel(CS, inline=1, fontsize=10)
plt.title('Costs for various values of w1 and w2 for b=31.11402451')
plt.xlabel("w1")
plt.ylabel("w2")
plt.plot([-16.44307658], [6.79809451], 'o')
plt.show()

1.6 精确的解决方法(Exact Solution)

$$ \mathbf{w}^*=(X^\top X)^{-1}X^\top y. $$

def solve_exactly(X, y):
    '''
    精确解决线性回归(完全向量化)

    给出 `X` - n x (d+1) 的输入矩阵 
         `y` - 目标输出
    返回(d+1)维的最佳权重向量
    '''
    A = np.dot(X.T, X)
    c = np.dot(X.T, y)
    return np.dot(np.linalg.inv(A), c)

w_exact = solve_exactly(x_in, y_target)
print(w_exact)

2. 线性回归的梯度下降(Gradient Descent for Linear Regression)

2.1 Boston Housing的数据

2.1.1 导入库

import matplotlib
import numpy as np
import random
import warnings
import matplotlib.pyplot as plt
from sklearn import preprocessing   # for normalization

2.1.2 导入数据

import pandas as pd
import numpy as np
data_url = "https://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
boston_data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

data = boston_data;
x_input = data  # a data matrix
y_target = target; # a vector for all outputs
# add a feature 1 to the dataset, then we do not need to consider the bias and weight separately
x_in = np.concatenate([np.ones([np.shape(x_input)[0], 1]), x_input], axis=1)
# we normalize the data so that each has regularity
x_in = preprocessing.normalize(x_in)

2.2 线性模型(Linear Model)

$$ f(x)=\mathbf{w}^\top \mathbf{x}. $$

def linearmat_2(w, X):
    '''
    linearmat_1的向量化.
    Input: w 是权重(包含截距), and X 数据矩阵 (n x (d+1)) (包含特征)
    Output:包含线性模型预测的向量
    '''
    return np.dot(X, w)

2.3 损失函数(Cost Function)

$$ C(\mathbf{y}, \mathbf{t}) = \frac{1}{2n}(\mathbf{y}-\mathbf{t})^\top (\mathbf{y}-\mathbf{t}). $$

def cost(w, X, y):
    '''
    评估向量化方法的损失函数
    输入 `X` 和输出 `y`, 在权重 `w`.
    '''
    residual = y - linearmat_2(w, X)  # 获取差值
    err = np.dot(residual, residual) / (2 * len(y))

    return err

2.4 梯度计算(Gradient Computation)

\nabla C(\mathbf{w}) =\frac{1}{n}X^\top\big(X\mathbf{w}-\mathbf{y}\big)

# 向量化梯度方程
def gradfn(weights, X, y):
    '''
    给出 `weights` - 当前的对权重的猜想
          `X` - (N,d+1)的包含特征`1`的输入特征矩阵
          `y` - 目标y值
    返回当前数值估计的权重梯度
    '''

    y_pred = np.dot(X, weights)
    error = y_pred - y
    return np.dot(X.T, error) / len(y)

2.5 梯度下降(Gradient Descent)

\mathbf{w}^{(t+1)} \leftarrow \mathbf{w}^{(t)} - \eta\nabla C(\mathbf{w}^{(t)})

def solve_via_gradient_descent(X, y, print_every=100,
                               niter=5000, eta=1):
    '''
    给出  `X` - (N,D)的输入特征矩阵
          `y` - 目标y值
          `print_every` - 每'print_every' 迭代报告一次性能
          `niter` - 迭代数量的限制
          `eta` - 学习率
    用梯度下降解决线性回归

    返回
        `w` - 在`niter`次迭代之后的权重
        `idx_res` - 迭代的索引
        `err_res` - 迭代的索引对应的损失值
    '''
    N, D = np.shape(X)
    # 初始化所有的权重为0
    w = np.zeros([D])
    idx_res = []
    err_res = []
    for k in range(niter):
        # 计算梯度
        dw = gradfn(w, X, y)
        # 梯度下降
        w = w - eta * dw
        # 每print_every迭代报告一次
        if k % print_every == print_every - 1:
            t_cost = cost(w, X, y)
            print('error after %d iteration: %s' % (k, t_cost))
            idx_res.append(k)
            err_res.append(t_cost)
    return w, idx_res, err_res

w_gd, idx_gd, err_gd = solve_via_gradient_descent( X=x_in, y=y_target)

Output(partial):

error after 2199 iteration: 26.616940808457816
error after 2299 iteration: 26.475493515509722
error after 2399 iteration: 26.33686272884545
error after 2499 iteration: 26.20095757351077
...
error after 4699 iteration: 23.78096067719028
error after 4799 iteration: 23.692775341901584
error after 4899 iteration: 23.606193772224405
error after 4999 iteration: 23.521184465124133

2.6 小批量梯度下降(Minibatch Grident Descent)

C(\mathbf{w})=\frac{1}{n}\sum_{i=1}^nC_i(\mathbf{w}),

$$ where $Ci(\mathbf{w})$ is the loss of the model $\mathbf{w}$ on the $i$-th example. In our Boston House Price prediction problem, $Ci$ takes the form $C_i(\mathbf{w})=\frac{1}{2}(\mathbf{w}^\top\mathbf{x}^{(i)}-y^{(i)})^2$.

def solve_via_minibatch(X, y, print_every=100,
                               niter=5000, eta=1, batch_size=50):
    '''
    求解具有 Nesterov 动量的线性回归权重。
    给出  `X` - (N,D)的输入特征矩阵
          `y` - 目标y值
          `print_every` - 每'print_every' 迭代报告一次性能
          `niter` - 迭代数量的限制
          `eta` - 学习率
          `batch_size` - 小批量的大小
    返回
        `w` - 在`niter`次迭代之后的权重
        `idx_res` - 迭代的索引
        `err_res` - 迭代的索引对应的损失值
    '''
    N, D = np.shape(X)
    # 初始化所有的权重为0
    w = np.zeros([D])
    idx_res = []
    err_res = []
    tset = list(range(N))
    for k in range(niter):
        idx = random.sample(tset, batch_size)
        #sample batch of data
        sample_X = X[idx, :]
        sample_y = y[idx]
        dw = gradfn(w, sample_X, sample_y)
        w = w - eta * dw
        if k % print_every == print_every - 1:
            t_cost = cost(w, X, y)
            print('error after %d iteration: %s' % (k, t_cost))
            idx_res.append(k)
            err_res.append(t_cost)
    return w, idx_res, err_res

w_batch, idx_batch, err_batch = solve_via_minibatch( X=x_in, y=y_target)

Output(partial):

error after 2199 iteration: 26.693266124289604
error after 2299 iteration: 26.467262454186802
error after 2399 iteration: 27.126877660242872
error after 2499 iteration: 26.318343441629153
...
error after 4699 iteration: 24.030476040027352
error after 4799 iteration: 23.87462298591681
error after 4899 iteration: 23.6754448557431
error after 4999 iteration: 23.54132978738581

2.7 小批量梯度下降和梯度下降的比较(Comparison between Minibatch Gradient Descent and Gradient Descent)

plt.plot(idx_batch, err_batch, color="red", linewidth=2.5, linestyle="-", label="minibatch")
plt.plot(idx_gd, err_gd, color="blue", linewidth=2.5, linestyle="-", label="gradient descent")
plt.legend(loc='upper right', prop={'size': 12})
plt.title('comparison between minibatch gradient descent and gradient descent')
plt.xlabel("number of iterations")
plt.ylabel("cost")
plt.grid()
plt.show()

3. 感知器(Perceptron)

3.1 导入库

import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

3.2 数据生成(Data Generation)

# `no_points`：表示要生成的数据点的数量。
def generate_data(no_points):
    # 创建一个形状为 (no_points, 2) 的零矩阵
    X = np.zeros(shape=(no_points, 2))
    # 创建一个长度为 no_points 的零向量
    Y = np.zeros(shape=no_points)
    for ii in range(no_points):
        X[ii, 0] = random.randint(0,20)
        X[ii, 1] = random.randint(0,20)
        if X[ii, 0]+X[ii, 1] > 20:
            Y[ii] = 1 
        else:
            Y[ii] = -1
    return X, Y

3.3 类(Class)

class Person():
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

print(p1.name)
print(p1.age)
'''
John
36
'''

class Person():
    
  def __init__(self, name, age):
    self.name = name
    self.age = age

  def myfunc(self):
    print("Hello my name is " + self.name)

p1 = Person("John", 36)
p1.myfunc()

'''
Hello my name is John
'''

3.4 感知器逻辑(Perceptron Algorithm)

3.4.1 感知器(Perceptron)

\mathbf{x}\mapsto \text{sgn}(\mathbf{w}^\top\mathbf{x}+b)

3.4.2 感知器逻辑(Perceptron Algorithm)

$$ y(b+y+(\mathbf{w}+y\mathbf{x})^\top\mathbf{x})=yb+y\mathbf{w}^\top\mathbf{x}+y^2+y^2\mathbf{x}^\top\mathbf{x}> y(b+\mathbf{w}^\top\mathbf{x}). $$

class Perceptron():
    """
    Class for performing Perceptron.
    X is the input array with n rows (no_examples) and d columns (no_features)
    Y is a vector containing elements which indicate the class 
        (1 for positive class, -1 for negative class)
    w is the weight vector (d dimensional vector)
    b is the bias value
    """
    def __init__(self, b = 0, max_iter = 1000):
        # 最大迭代次数
        self.max_iter = max_iter
        # 权重
        self.w = []
        # 截距/偏置
        self.b = 0
        self.no_examples = 0
        self.no_features = 0
    
    def train(self, X, Y):
        '''
        This function applies the perceptron algorithm to train a model w based on X and Y.
        It changes both w and b of the class.
        '''
        # we set the number of examples and the number of features according to the matrix X
        self.no_examples, self.no_features = np.shape(X)  
        # we initialize the weight vector as the zero vector
        self.w = np.zeros(self.no_features)
        
        # we only run a limited number of iterations
        for ii in range(0, self.max_iter):
            # at the begining of each iteration, we set the w_updated to be false (meaning we have not yet found misclassified example)
            w_updated = False
            # we traverse all the training examples
            for jj in range(0, self.no_examples):
                # we compute the predicted value and assign it to the variable a
                a = self.b + np.dot(self.w, X[jj])
                # if we find a misclassified example
                if Y[jj] * a <= 0:
                    # we set w_updated = true as we have found a misclassified example at this iteration
                    w_updated = True
                    # we now update w and b
                    self.w += Y[jj] * X[jj]
                    self.b += Y[jj]
            # if we do not find any misclassified example, we can return the model
            if not w_updated:
                print("Convergence reached in %i iterations." % ii)
                break
        # after finishing the iterations we can still find a misclassified example
        if w_updated:
            print(
            """
            WARNING: convergence not reached in %i iterations.
            Either dataset is not linearly separable, 
            or max_iter should be increased
            """ % self.max_iter
                )
    def classify_element(self, x_elem):
        '''
        This function returns the predicted label of the perceptron on an input x_elem
        Input:
            x_elem: an input feature vector
        Output:
            return the predictred label of the model (indicated by w and b) on x_elem
        '''
        return np.sign(self.b + np.dot(self.w, x_elem))
    
    # To do: insert your code to complete the definition of the function classify a data matrix (n examples)
    def classify(self, X):
        '''
        This function returns the predicted labels of the perceptron on an input matrix X
        Input:
            X: a data matrix with n rows (no_examples) and d columns (no_features)
        Output:
            return the vector. i-th entry is the predicted label on the i-th example
        '''
#        predicted_Y = []
#        for ii in range(np.shape(X)[0]):
#            # we predict the label and add the label to the output vector
#            y_elem = self.classify_element(X[ii])
#            predicted_Y.append(y_elem)
#        # we return the output vector
        
        # vectorization
        out = np.dot(X, self.w)
        predicted_Y = np.sign(out + self.b)
        return predicted_Y

3.5 实验(Experiments)

3.5.1 数据生成(Data Generation)

X, Y = generate_data(100)

3.5.2 数据集的可视化(Visualization of the dataset)

idx_pos = [i for i in np.arange(100) if Y[i]==1]
idx_neg = [i for i in np.arange(100) if Y[i]==-1]
# make a scatter plot
plt.scatter(X[idx_pos, 0], X[idx_pos, 1], color='blue')
plt.scatter(X[idx_neg, 0], X[idx_neg, 1], color='red')
plt.show()

3.5.3 训练(Train)

# Create an instance p
p = Perceptron()
# applies the train algorithm to (X,Y) and sets the weight vector and bias
p.train(X, Y)
predicted_Y = p.classify(X)
acc_tr = accuracy_score(predicted_Y, Y)
print(acc_tr)

3.5.4 测试(Test)

# we first generate a new dataset
X_test, Y_test = generate_data(100)
predicted_Y_test = p.classify(X_test)
acc = accuracy_score(Y_test, predicted_Y_test)
print(acc)

3.5.5 感知器的可视化(Visulization of the perceptron)

# we get an array of the first feature
x1 = np.arange(0, 20, 0.1)
# bias
b = p.b
# weight vector
w = p.w
# we now use list comprehension to generate the array of the second feature

x2 = [(-b-w[0]*x)/w[1] for x in x1]
plt.scatter(X[idx_pos, 0], X[idx_pos, 1], color='blue')
plt.scatter(X[idx_neg, 0], X[idx_neg, 1], color='red')
# plot the hyperplane corresponding to the perceptron
plt.plot(x1, x2, color="black", linewidth=2.5, linestyle="-")
plt.show()

4. 卷积神经网络(Convolutional Neural Network)

4.1 训练一个图像分类器(Training an image classifier)

加载并且用torchvision标准化 CIFAR10 训练和测试数据集
定义一个卷积神经网络
定义一个损失函数
在训练集上训练网络
在测试集上测试网络

4.2 加载并且标准化CIFAR10(Load and normalize CIFAR10)

import torch
import torchvision
import torchvision.transforms as transforms

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"The current device is {device}")

transform = transforms.Compose(
[transforms.ToTensor(),
 transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                            download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                              shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                           download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                             shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

If running on Windows and you get a BrokenPipeError, try setting the num_worker of torch.utils.data.DataLoader() to 0

显示部分训练图像

import matplotlib.pyplot as plt
import numpy as np

# functions to show an image


def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# get some random training images
dataiter = iter(trainloader)
images, labels = next(dataiter)

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(batch_size)))

4.3 定义一个卷积神经网络(Define a Convolutional Neural Network)

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # 从3个输入通道到6个输出通道，使用5x5的卷积核
        self.conv1 = nn.Conv2d(3, 6, 5)
        # 使用2x2的窗口大小并且步长为2
        self.pool = nn.MaxPool2d(2, 2)
        # 从6个输入通道到16个输出通道，也使用5x5的卷积核
        self.conv2 = nn.Conv2d(6, 16, 5)
        # 16x5x5的张量展平为120个特征
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # ReLU激活函数
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net().to(device)

4.4 定义一个损失函数与优化器(Define a Loss function and optimizer)

# 导入pytorch提供的优化算法
import torch.optim as optim

# 定义了交叉熵损失函数
criterion = nn.CrossEntropyLoss()
# 使用随机梯度下降（SGD）作为优化算法， 设置学习率为0.001， 设置动量为0.9。动量在SGD中用于加速训练并避免陷入局部最小值。
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

4.5 训练网络(Train the network)

# 遍历整个数据集两次
for epoch in range(2): 

    # 用于累积每个批次的损失，以便后续打印平均损失
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels] and move them to
        #the current device
        inputs, labels = data
        inputs = inputs.to(device)
        labels = labels.to(device)

        # 在每次训练步骤之前，将模型中所有参数的梯度设置为零
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics - epoch and loss
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

PATH = './cifar_net.pth'
torch.save(net.state_dict(), PATH)

4.6 在测试数据上测试网络(Test the network on the test data)

选择一批测试数据并显示

dataiter = iter(testloader)
images, labels = next(dataiter) #Selects a mini-batch and its labels

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

加载预先保存的模型参数

net = Net()
net.load_state_dict(torch.load(PATH))

在单批数据上预测

images = images.to(device)
labels = labels.to(device)
net = net.to(device)
outputs = net(images)

获取预测结果：

_, predicted = torch.max(outputs, 1) #Returns a tuple (max,max indicies), we only need the max indicies.

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(4)))

评估整体测试集的准确性：

correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data
        images = images.to(device)
        labels = labels.to(device)

        # calculate outputs by running images through the network 
        outputs = net(images)

        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

评估每个类的准确性：

# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        images = images.to(device)
        labels = labels.to(device)

        outputs = net(images)    
        _, predictions = torch.max(outputs, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1

    
# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname, 
                                                   accuracy))

5. 自动编码器(AutoEncoders)

5.1 加载和刷新MNIST(Loading and refreshing MNIST)

# -*- coding: utf-8 -*- 该文件使用UTF-8编码
# The below is for auto-reloading external modules after they are changed, such as those in ./utils.
# Issue: https://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython

# Jupyter Notebook的特定命令，用于自动重新加载外部模块
%load_ext autoreload
%autoreload 2

# `numpy`库用于数组操作，`get_mnist`函数用于获取MNIST数据集
import numpy as np
from utils.data_utils import get_mnist # Helper function. Use it out of the box.

# 常量定义: 数据的存储位置和一个随机种子
DATA_DIR = './data/mnist' # Location we will keep the data.
SEED = 111111

# 使用get_mnist函数从指定的目录加载训练和测试数据。如果数据不在指定位置，它们将被下载
train_imgs, train_lbls = get_mnist(data_dir=DATA_DIR, train=True, download=True)
test_imgs, test_lbls = get_mnist(data_dir=DATA_DIR, train=False, download=True)

# 输出训练和测试数据的相关信息，如其类型、形状、数据类型以及类标签
print("[train_imgs] Type: ", type(train_imgs), "|| Shape:", train_imgs.shape, "|| Data type: ", train_imgs.dtype )
print("[train_lbls] Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, "|| Data type: ", train_lbls.dtype )
print('Class labels in train = ', np.unique(train_lbls))

print("[test_imgs] Type: ", type(test_imgs), "|| Shape:", test_imgs.shape, " || Data type: ", test_imgs.dtype )
print("[test_lbls] Type: ", type(test_lbls), "|| Shape:", test_lbls.shape, " || Data type: ", test_lbls.dtype )
print('Class labels in test = ', np.unique(test_lbls))

# 定义了一些与数据集相关的其他常量，如训练图像的数量、图像的高度、图像的宽度和类别的数量
N_tr_imgs = train_imgs.shape[0] # N hereafter. Number of training images in database.
H_height = train_imgs.shape[1] # H hereafter
W_width = train_imgs.shape[2] # W hereafter
C_classes = len(np.unique(train_lbls)) # C hereafter

'''
[train_imgs] Type:   || Shape: (60000, 28, 28) || Data type:  uint8
[train_lbls] Type:   || Shape: (60000,) || Data type:  int16
Class labels in train =  [0 1 2 3 4 5 6 7 8 9]
[test_imgs] Type:   || Shape: (10000, 28, 28)  || Data type:  uint8
[test_lbls] Type:   || Shape: (10000,)  || Data type:  int16
Class labels in test =  [0 1 2 3 4 5 6 7 8 9]
'''

# Jupyter Notebook特定的命令，保证matplotlib库生成的图像都直接在Notebook内显示
%matplotlib inline

# 导入库，将多个图像绘制在一个网格上
from utils.plotting import plot_grid_of_images # Helper functions, use out of the box.

# 绘制了train_imgs中的前100个图像。图像被组织成一个10x10的网格，每行显示10个图像
plot_grid_of_images(train_imgs[0:100], n_imgs_per_row=10)

5.2 数据预处理(Data pre-processing)

5.2.1 将标签的表示更改为长度 C=10 的 one-hot 向量(Change representation of labels to one-hot vectors of length C=10)

# 为训练和测试标签初始化一个全为0的矩阵。每个标签都将在对应的独热编码向量中有一个值为1的元素
# 对于每个训练标签，我们找到其对应的独热编码向量中应该为1的位置，并将该位置的值设置为1
train_lbls_onehot = np.zeros(shape=(train_lbls.shape[0], C_classes ) )
train_lbls_onehot[ np.arange(train_lbls_onehot.shape[0]), train_lbls ] = 1
test_lbls_onehot = np.zeros(shape=(test_lbls.shape[0], C_classes ) )
test_lbls_onehot[ np.arange(test_lbls_onehot.shape[0]), test_lbls ] = 1

# 打印了转换前后标签的类型、形状和数据类型
print("BEFORE: [train_lbls]        Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, " || Data type: ", train_lbls.dtype )
print("AFTER : [train_lbls_onehot] Type: ", type(train_lbls_onehot), "|| Shape:", train_lbls_onehot.shape, " || Data type: ", train_lbls_onehot.dtype )

'''
BEFORE: [train_lbls]        Type:   || Shape: (60000,)  || Data type:  int16
AFTER : [train_lbls_onehot] Type:   || Shape: (60000, 10)  || Data type:  float64
'''

5.2.2 重新缩放图像强度，从 [0,255] 到 [-1, +1](Re-scale image intensities, from [0,255] to [-1, +1])

# This commonly facilitates learning:
# A zero-centered signal with small magnitude allows avoiding exploding/vanishing problems easier.
from utils.data_utils import normalize_int_whole_database # Helper function. Use out of the box.
# 图像的强度值被归一化到了[-1, +1]的范围
train_imgs = normalize_int_whole_database(train_imgs, norm_type="minus_1_to_1")
test_imgs = normalize_int_whole_database(test_imgs, norm_type="minus_1_to_1")

# Lets plot one image.
from utils.plotting import plot_image # Helper function, use out of the box.
index = 0  # Try any, up to 60000
print("Plotting image of index: [", index, "]")
print("Class label for this image is: ", train_lbls[index])
print("One-hot label representation: [", train_lbls_onehot[index], "]")
plot_image(train_imgs[index])
# Notice the magnitude of intensities. Black is now negative and white is positive float.
# Compare with intensities of figure further above.

'''
Plotting image of index: [ 0 ]
Class label for this image is:  5
One-hot label representation: [ [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] ]
'''

5.2.3 将图像从 2D 矩阵展平为 1D 向量。 MLP 将特征向量作为输入，而不是 2D 图像(Flatten the images, from 2D matrices to 1D vectors. MLPs take feature-vectors as input, not 2D images)

# 展平图像数据：
#     将每张图像的像素展平成一个一维数组。这通常是在将图像数据输入到全连接神经网络之前所需要做的，因为全连接层需要一维的输入向量
train_imgs_flat = train_imgs.reshape([train_imgs.shape[0], -1]) # Preserve 1st dim (S = num Samples), flatten others.
test_imgs_flat = test_imgs.reshape([test_imgs.shape[0], -1])
print("Shape of numpy array holding the training database:")
print("Original : [N, H, W] = [", train_imgs.shape , "]")
print("Flattened: [N, H*W]  = [", train_imgs_flat.shape , "]")

'''
Shape of numpy array holding the training database:
Original : [N, H, W] = [ (60000, 28, 28) ]
Flattened: [N, H*W]  = [ (60000, 784) ]
'''

5.3 为了AE用SGD进行无监督训练(Unsupervised training with SGD for Auto-Encoders)

from utils.plotting import plot_train_progress_1, plot_grids_of_images  # Use out of the box

# 从训练数据中随机抽取一个批次的图片
def get_random_batch(train_imgs, train_lbls, batch_size, rng):
    # train_imgs: Images. Numpy array of shape [N, H, W]
    # train_lbls: Labels of images. None, or Numpy array of shape [N, C_classes], one hot label for each image.
    # batch_size: integer. Size that the batch should have.
    
    ####### Sample a random batch of images for training. Fill in the blanks (???) ######### 
    indices = range(0, batch_size)  # Remove this line after you fill-in and un-comment the below. 
    indices = rng.randint(low=0, high=train_imgs.shape[0], size=batch_size, dtype='int32')
    #indices = rng.randint(low=??????, high=train_imgs.shape[???????], size=?????????, dtype='int32')
    ##############################################################################################
    
    train_imgs_batch = train_imgs[indices]
    if train_lbls is not None:  # Enables function to be used both for supervised and unsupervised learning
        train_lbls_batch = train_lbls[indices]
    else:
        train_lbls_batch = None
    return [train_imgs_batch, train_lbls_batch]

def unsupervised_training_AE(net,
                             loss_func,
                             rng,
                             train_imgs_all,
                             batch_size,
                             learning_rate,
                             total_iters,
                             iters_per_recon_plot=-1):
    # net: 自编码器网络对象 Instance of a model. See classes: Autoencoder, MLPClassifier, etc further below
    # loss_func: 用于计算损失的函数 Function that computes the loss. See functions: reconstruction_loss or cross_entropy.
    # rng: 随机数生成器对象 numpy random number generator
    # train_imgs_all: 训练图像的完整集合 All the training images. Numpy array, shape [N_tr, H, W]
    # batch_size: 每次迭代用于训练的图像数量 Size of the batch that should be processed per SGD iteration by a model.
    # learning_rate: 优化器的学习率 self explanatory.
    # total_iters: 训练总迭代次数 how many SGD iterations to perform.
    # iters_per_recon_plot: 每隔多少迭代绘制一次重构图像，默认为-1，表示不绘制 Integer. Every that many iterations the model predicts training images ...
    #                      ...and we plot their reconstruction. For visual observation of the results.

    # 初始化一个空列表，用于存储训练过程中的损失值。
    loss_values_to_plot = []
    
    # 创建一个Adam优化器，用于更新网络 net 的参数。
    optimizer = optim.Adam(net.params, lr=learning_rate)  # Will use PyTorch's Adam optimizer out of the box
    
    # 随机梯度下降(SGD)
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        x_imgs, _ = get_random_batch(train_imgs_all, None, batch_size, rng)
        
        # Forward pass: 通过网络执行前向传播，获取重构的图像和编码的潜在表示。
        x_pred, z_codes = net.forward_pass(x_imgs)

        # Compute loss: 计算重构图像和原始图像之间的损失。
        loss = loss_func(x_pred, x_imgs)
        
        # Pytorch way
        # 在每次梯度更新前清零累积的梯度。
        optimizer.zero_grad()
        # 执行反向传播，计算损失相对于网络参数的梯度。
        _ = net.backward_pass(loss)
        # 应用梯度更新网络参数。
        optimizer.step()
        
        # ==== Report training loss and accuracy ======
        loss_np = loss if type(loss) is type(float) else loss.item()  # Pytorch returns tensor. Cast to float
        print("[iter:", t, "]: Training Loss: {0:.2f}".format(loss))
        loss_values_to_plot.append(loss_np)
        
        # =============== Every few iterations, show reconstructions ================#
        if t==total_iters-1 or t%iters_per_recon_plot == 0:
            # Reconstruct all images, to plot reconstructions.
            x_pred_all, z_codes_all = net.forward_pass(train_imgs_all)
            # Cast tensors to numpy arrays
            x_pred_all_np = x_pred_all if type(x_pred_all) is np.ndarray else x_pred_all.detach().numpy()
            
            # Predicted reconstructions have vector shape. Reshape them to original image shape.
            train_imgs_resh = train_imgs_all.reshape([train_imgs_all.shape[0], H_height, W_width])
            x_pred_all_np_resh = x_pred_all_np.reshape([train_imgs_all.shape[0], H_height, W_width])
            
            # Plot a few images, originals and predicted reconstructions.
            plot_grids_of_images([train_imgs_resh[0:100], x_pred_all_np_resh[0:100]],
                                  titles=["Real", "Reconstructions"],
                                  n_imgs_per_row=10,
                                  dynamically=True)
            
    # In the end of the process, plot loss.
    plot_train_progress_1(loss_values_to_plot, iters_per_point=1)

5.4 自动编码器(Auto-Encoder)

# -*- coding: utf-8 -*-
import torch
import torch.optim as optim
import torch.nn as nn

# 定义了一个基本的神经网络结构和反向传播
class Network():
    
    def backward_pass(self, loss):
        # Performs back propagation and computes gradients
        # With PyTorch, we do not need to compute gradients analytically for parameters were requires_grads=True, 
        # Calling loss.backward(), torch's Autograd automatically computes grads of loss wrt each parameter p,...
        # ... and **puts them in p.grad**. Return them in a list.
        loss.backward()
        grads = [param.grad for param in self.params]
        return grads

# 定义了一个四层的自编码器。网络由输入层、编码器隐藏层、瓶颈层、解码器隐藏层和输出层组成
class Autoencoder(Network):
    def __init__(self, rng, D_in, D_hid_enc, D_bottleneck, D_hid_dec):
        # Construct and initialize network parameters
        D_in = D_in # Dimension of input feature-vectors. Length of a vectorised image.
        D_hid_1 = D_hid_enc # Dimension of Encoder's hidden layer
        D_hid_2 = D_bottleneck
        D_hid_3 = D_hid_dec  # Dimension of Decoder's hidden layer
        D_out = D_in # Dimension of Output layer.
        
        self.D_bottleneck = D_bottleneck  # Keep track of it, we will need it.
        
        ##### TODO: Initialize the Auto-Encoder's parameters. Also see forward_pass(...)) ########
        # Dimensions of parameter tensors are (number of neurons + 1) per layer, to account for +1 bias.
        # 初始化权重的值, 根据正态随机分布
        w1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_hid_1))
        w2_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        w3_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_hid_3))
        w4_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_3+1, D_out))
        # Pytorch tensors, parameters of the model
        # Use the above numpy arrays as of random floats as initialization for the Pytorch weights.
        # 权重被转换为tensor张量
        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=True)
        w2 = torch.tensor(w2_init, dtype=torch.float, requires_grad=True)
        w3 = torch.tensor(w3_init, dtype=torch.float, requires_grad=True)
        w4 = torch.tensor(w4_init, dtype=torch.float, requires_grad=True)
        # Keep track of all trainable parameters:
        self.params = [w1, w2, w3, w4]
        ###########################################################################
        
    # 定义了输入如何通过自编码器的各层进行处理。它对编码器和解码器隐藏层使用ReLU激活函数(preact)，并对输出层使用tanh激活函数 
    def forward_pass(self, batch_imgs):
        # Get parameters
        [w1, w2, w3, w4] = self.params
        
        # 将输入数据转换为torch张量
        batch_imgs_t = torch.tensor(batch_imgs, dtype=torch.float)  # Makes pytorch array to pytorch tensor.
        
        # 添加bias单元
        unary_feature_for_bias = torch.ones(size=(batch_imgs.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        #### TODO: Implement the operations at each layer #####
        # Layer 1
        h1_preact = x.mm(w1) # 计算预激活值
        h1_act = h1_preact.clamp(min=0) # 应用ReLU激活函数
        # Layer 2 (bottleneck): 
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1) # 添加bias单元到隐藏层1
        h2_preact = h1_ext.mm(w2) # 计算预激活值
        h2_act = h2_preact.clamp(min=0) # 应用ReLU激活函数
        # Layer 3:
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1) # 添加bias单元到隐藏层2
        h3_preact = h2_ext.mm(w3) # 计算预激活值
        h3_act = h3_preact.clamp(min=0) # 应用ReLU激活函数
        # Layer 4 (output):
        h3_ext = torch.cat((h3_act, unary_feature_for_bias), dim=1) # 添加bias单元到隐藏层3
        h4_preact = h3_ext.mm(w4) # 计算预激活值
        h4_act = torch.tanh(h4_preact) # 应用tanh激活函数
        # Output layer
        x_pred = h4_act
        #######################################################
        
        ### TODO: Get bottleneck's activations ######
        # Bottleneck actications
        acts_bottleneck = h2_act
        #############################################
                
        return (x_pred, acts_bottleneck)
        
# 计算重建图像与原始图像之间的均方误差。这个损失在训练过程中用于调整网络的权重
def reconstruction_loss(x_pred, x_real, eps=1e-7):
    # Cross entropy: See Lecture 5, slide 19.
    # x_pred: [N, D_out] Prediction returned by forward_pass. Numpy array of shape [N, D_out]
    # x_real: [N, D_in]
    
    # If number array is given, change it to a Torch tensor.
    x_pred = torch.tensor(x_pred, dtype=torch.float) if type(x_pred) is np.ndarray else x_pred
    x_real = torch.tensor(x_real, dtype=torch.float) if type(x_real) is np.ndarray else x_real
    
    ######## TODO: Complete the calculation of Reconstruction loss for each sample ###########
    loss_recon = torch.mean(torch.square(x_pred - x_real), dim=1)
    # NOTE: Notice a difference from theory in Lecture: In implementations, we often calculate...
    # the *mean* square error over output's dimensions, rather than the *sum* as often shown in theory.
    # This makes the loss independent of the dimensionality of the input/output, so it can be used
    # without any change for different architectures and image sizes.
    # Otherwise we'd have to adapt the Learning rate whenever we use a network for different image sizes
    # to account for the change of the loss's scale.
    ##########################################################################################
    
    cost = torch.mean(loss_recon, dim=0) # Expectation of loss: Mean over samples (axis=0).
    return cost


# Create the network
rng = np.random.RandomState(seed=SEED)
autoencoder_thin = Autoencoder(rng=rng,
                               D_in=H_height*W_width,
                               D_hid_enc=256,
                               D_bottleneck=2,
                               D_hid_dec=256)
# Start training
# 定义了自编码器的训练循环。过程包括：
#       随机抽取图像批次。
#       通过自编码器处理图像以获得重构的输出。
#       计算重构和原始图像之间的损失。
#       反向传播来更新网络的权重。
#       可选择地每隔几次迭代绘制重构的图像进行可视化。
unsupervised_training_AE(autoencoder_thin,
                         reconstruction_loss,
                         rng,
                         train_imgs_flat,
                         batch_size=40,
                         learning_rate=3e-3,
                         total_iters=1000,
                         iters_per_recon_plot=50)

'''
[iter: 0 ]: Training Loss: 0.94
[iter: 1 ]: Training Loss: 0.92
[iter: 2 ]: Training Loss: 0.88
[iter: 3 ]: Training Loss: 0.78
[iter: 4 ]: Training Loss: 0.60
[iter: 5 ]: Training Loss: 0.42
[iter: 6 ]: Training Loss: 0.30
[iter: 7 ]: Training Loss: 0.33
[iter: 8 ]: Training Loss: 0.31
[iter: 9 ]: Training Loss: 0.32
[iter: 10 ]: Training Loss: 0.32
[iter: 11 ]: Training Loss: 0.28
[iter: 12 ]: Training Loss: 0.32
[iter: 13 ]: Training Loss: 0.32
[iter: 14 ]: Training Loss: 0.32
[iter: 15 ]: Training Loss: 0.31
[iter: 16 ]: Training Loss: 0.32
[iter: 17 ]: Training Loss: 0.34
[iter: 18 ]: Training Loss: 0.33
[iter: 19 ]: Training Loss: 0.32
[iter: 20 ]: Training Loss: 0.29
[iter: 21 ]: Training Loss: 0.30
[iter: 22 ]: Training Loss: 0.31
[iter: 23 ]: Training Loss: 0.33
[iter: 24 ]: Training Loss: 0.32
[iter: 25 ]: Training Loss: 0.29
...
[iter: 47 ]: Training Loss: 0.26
[iter: 48 ]: Training Loss: 0.27
[iter: 49 ]: Training Loss: 0.28
[iter: 50 ]: Training Loss: 0.27
'''

5.5 对潜在(瓶颈)表示中的所有训练样本进行编码(Encode all training samples in the latent (bottleneck) representation)

import matplotlib.pyplot as plt

# 接受的参数包括一个网络、一组扁平化的图像、标签、批处理大小、总迭代次数和一个布尔值决定是否绘制二维嵌入
def encode_and_get_min_max_z(net,
                             imgs_flat,
                             lbls,
                             batch_size,
                             total_iterations=None,
                             plot_2d_embedding=True):
    # This function encodes images, plots the first 2 dimensions of the codes in a plot, and finally...
    # ... returns the minimum and maximum values of the codes for each dimensions of Z.
    # ... We will use  this at a layer task.
    # Arguments:
    # imgs_flat: Numpy array of shape [Number of images, H * W]
    # lbls: Numpy array of shape [number of images], with 1 integer per image. The integer is the class (digit).
    # total_iterations: How many batches to encode. We will use this so that we dont encode and plot ...
    # ... the whoooole training database, because the plot will get cluttered with 60000 points.
    # Returns:
    # min_z: numpy array, vector with [dimensions-of-z] elements. Minimum value per dimension of z.
    # max_z: numpy array, vector with [dimensions-of-z] elements. Maximum value per dimension of z.
    
    # If total iterations is None, the function will just iterate over all data, by breaking them into batches.    
    if total_iterations is None:
        total_iterations = (train_imgs_flat.shape[0] - 1) // batch_size + 1
    
    z_codes_all = []
    lbls_all = []
    for t in range(total_iterations):
        # Sample batch for this SGD iteration
        x_batch = imgs_flat[t*batch_size: (t+1)*batch_size]
        lbls_batch = lbls[t*batch_size: (t+1)*batch_size]
        
        # Forward pass:执行前向传递得到预测的图像和编码值
        x_pred, z_codes = net.forward_pass(x_batch)

        # 如果编码值不是numpy数组，则将其转换为numpy数组
        z_codes_np = z_codes if type(z_codes) is np.ndarray else z_codes.detach().numpy()
        
        # 将编码值和标签存储在列表中
        z_codes_all.append(z_codes_np)  # List of np.arrays
        lbls_all.append(lbls_batch)
    
    z_codes_all = np.concatenate(z_codes_all)  # Make list of arrays in one array by concatenating along dim=0 (image index)
    lbls_all = np.concatenate(lbls_all)
    
    if plot_2d_embedding:
        # Plot the codes with different color per class in a scatter plot:
        plt.scatter(z_codes_all[:,0], z_codes_all[:,1], c=lbls_all, alpha=0.5)  # Plot the first 2 dimensions.
        plt.show()
    
    # 计算并返回编码值的每个维度的最小和最大值
    min_z = np.min(z_codes_all, axis=0)  # min and max for each dimension of z, over all samples.
    max_z = np.max(z_codes_all, axis=0)  # Numpy array (vector) of shape [number of z dimensions]
    
    return min_z, max_z


# Encode training samples, and get the min and max values of the z codes (for each dimension)
min_z, max_z = encode_and_get_min_max_z(autoencoder_thin,
                                        train_imgs_flat,
                                        train_lbls,
                                        batch_size=100,
                                        total_iterations=100)
print("Min Z value per dimension of bottleneck:", min_z)
print("Max Z value per dimension of bottleneck:", max_z)

'''
Min Z value per dimension of bottleneck: [0. 0.]
Max Z value per dimension of bottleneck: [87.92656 64.17436]
'''

5.6 用一个大的瓶颈层训练AE(Train an Auto-Encoder with a larger bottleneck layer)

# 这个任务的目标是检查一个更宽的自编码器如何进行训练，以及它的性能如何
# The below is a copy paste from Task 2.

# Create the network
rng = np.random.RandomState(seed=SEED)
autoencoder_wide = Autoencoder(rng=rng,
                               D_in=H_height*W_width,
                               D_hid_enc=256,
                               D_bottleneck=32,
                               D_hid_dec=256)
# Start training
unsupervised_training_AE(autoencoder_wide,
                         reconstruction_loss,
                         rng,
                         train_imgs_flat,
                         batch_size=40,
                         learning_rate=3e-3,
                         total_iters=1000,
                         iters_per_recon_plot=50)

'''
[iter: 968 ]: Training Loss: 0.10
[iter: 969 ]: Training Loss: 0.11
[iter: 970 ]: Training Loss: 0.10
[iter: 971 ]: Training Loss: 0.12
[iter: 972 ]: Training Loss: 0.11
[iter: 973 ]: Training Loss: 0.10
[iter: 974 ]: Training Loss: 0.12
[iter: 975 ]: Training Loss: 0.12
...
[iter: 996 ]: Training Loss: 0.11
[iter: 997 ]: Training Loss: 0.11
[iter: 998 ]: Training Loss: 0.11
[iter: 999 ]: Training Loss: 0.11
'''

5.7 基本自动编码器是否适合合成新数据？(Is basic Auto-Encoder appropriate for synthesizing new data?)

class Decoder():
    def __init__(self, pretrained_ae):
        ############ TODO: Fill in the gaps. The aim is: ... ############
        # ... to use the weights of the pre-trained AE's decoder,... ####
        # ... to initialize this Decoder.                            ####
        # Reminder: pretrained_ae.params[LAYER] contrains the params of the corresponding layer. See Task 2.

        # 从预先训练的自编码器中提取解码器的权重参数，并将它们转化为Pytorch张量
        w1 = torch.tensor(pretrained_ae.params[2], dtype=torch.float, requires_grad=False)
        w2 = torch.tensor(pretrained_ae.params[3], dtype=torch.float, requires_grad=False)
        self.params = [w1, w2]
        ###########################################################################
        
        
    def decode(self, z_batch):
        # Reconstruct a batch of images from a batch of z codes.
        # z_batch: Random codes. Numpy array of shape: [batch size, number of z dimensions]
        [w1, w2] = self.params
        
        z_batch_t = torch.tensor(z_batch, dtype=torch.float)  # Making a Pytorch tensor from Numpy array.
        # Adding an activation with value 1, for the bias. Similar to Task 2.
        unary_feature_for_bias = torch.ones(size=(z_batch_t.shape[0], 1)) # [N, 1] column vector.
        
        ##### TODO: Fill in the gaps, to REPLICATE the decoder of the AE from Task 4 #####
        # Hidden Layer of Decoder:
        z_batch_act_ext = torch.cat((z_batch_t, unary_feature_for_bias), dim=1)# 添加bias单元到隐藏层
        h1_preact = z_batch_act_ext.mm(w1)# 计算预激活值
        h1_act = h1_preact.clamp(min=0)# 应用ReLU函数
        # Output Layer:
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)# 添加bias单元到隐藏层1
        h2_preact = h1_ext.mm(w2)# 计算预及或者
        h2_act = torch.tanh(h2_preact)# 应用ReLU函数
        ##################################################################################
        # Output
        x_pred = h2_act
        
        return x_pred
        
# Lets instantiate this Decoder, using the pre-trained AE with 32-dims ("wider") bottleneck:
net_decoder_pretrained = Decoder(autoencoder_wide)

# 了解更宽瓶颈的自编码器编码的z值的范围
# NOTE: This function was implemented in Task 3. We simply call it again, but for a different AE, the wider.

# Encode training samples, and get the min and max values of the z codes (for each dimension)
min_z_wider, max_z_wider = encode_and_get_min_max_z(autoencoder_wide,
                                                    train_imgs_flat,
                                                    train_lbls,
                                                    batch_size=100,
                                                    total_iterations=None,  # So that it runs over all data.
                                                    plot_2d_embedding=False)  # Code is 32-Dims. Cant plot in 2D
print("Min Z value per dimension:", min_z_wider)
print("Max Z value per dimension:", max_z_wider)

'''
Min Z value per dimension: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Max Z value per dimension: [ 0.        0.       42.286503  0.       41.441685  0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.       49.99855  39.896553 37.461414  0.       39.914013  0.
 35.367657  0.       41.54573  36.36666   0.        0.       43.863304
  0.        0.        0.       38.31358 ]
'''

def synthesize(net_decoder,
               rng,
               z_min,
               z_max,
               n_samples):
    # net_decoder: 有预先训练权重的解码器
    # z_min: numpy array (vector) of shape [dimensions-of-z]
    # z_max: numpy array (vector) of shape [dimensions-of-z]
    # n_samples: how many samples to produce.
    
    assert len(z_min.shape) == 1 and len(z_max.shape) == 1
    assert z_min.shape[0] == z_max.shape[0]
    
    z_dims = z_min.shape[0]  # Dimensionality of z codes (and input to decoder).
    
    # 在[0, 1)的范围内均匀地随机生成z的样本，生成随机潜在编码
    z_samples = np.random.random_sample([n_samples, z_dims])  # Returns samples from uniform([0, 1))
    z_samples = z_samples * (z_max - z_min)  # Scales [0,1] range ==> [0,(max-min)] range
    z_samples = z_samples + z_min  # Puts the [0,(max-min)] range ==> [min, max] range
    
    # 使用预先训练的解码器网络net_decoder将z样本解码为x样本
    x_samples = net_decoder.decode(z_samples)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
       
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

synthesize(net_decoder_pretrained,
           rng,
           min_z_wider,  # From further above
           max_z_wider,  # From further above
           n_samples=20)

5.8 使用 AE 从未标记数据中学习，以在标记数据有限时补充监督分类器：让我们首先“从头开始”训练一个监督分类器(Learning from Unlabelled data with AE, to complement Supervised Classifier when Labelled data are limited: Lets first train a supervised Classifier 'from scratch')

class Classifier_3layers(Network):
    # 隐藏层使用ReLU激活函数，输出层使用softmax函数来计算类概率
    def __init__(self, D_in, D_hid_1, D_hid_2, D_out, rng):
        D_in = D_in
        D_hid_1 = D_hid_1
        D_hid_2 = D_hid_2
        D_out = D_out
        
        # === NOTE: Notice that this is exactly the same architecture as encoder of AE in Task 4 ====
        w_1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_hid_1))
        w_2_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        w_1 = torch.tensor(w_1_init, dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(w_2_init, dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_1, w_2, w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_imgs_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_imgs_t.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # === NOTE: This is the same architecture as encoder of AE in Task 4, with extra classification layer ===
        # Layer 1
        h1_preact = x.mm(w_1)
        h1_act = h1_preact.clamp(min=0)
        # Layer 2 (corresponds to bottleneck of the AE):
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.mm(w_2)
        h2_act = h2_preact.clamp(min=0)
        # Output classification layer
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1)
        h_out = h2_ext.mm(w_out)
        
        logits = h_out
        
        # === Addition of a softmax function for 
        # Softmax activation function.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) # 使用softmax输出概率
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred

# 计算预测的类概率与真实类标签之间的交叉熵损失
# 在进行对数运算时，为了数值稳定性，添加了一个epsilon（eps）
def cross_entropy(y_pred, y_real, eps=1e-7):
    # y_pred: Predicted class-posterior probabilities, returned by forward_pass. Numpy array of shape [N, D_out]
    # y_real: One-hot representation of real training labels. Same shape as y_pred.
    
    # If number array is given, change it to a Torch tensor.
    y_pred = torch.tensor(y_pred, dtype=torch.float) if type(y_pred) is np.ndarray else y_pred
    y_real = torch.tensor(y_real, dtype=torch.float) if type(y_real) is np.ndarray else y_real
    
    x_entr_per_sample = - torch.sum( y_real*torch.log(y_pred+eps), dim=1)  # Sum over classes, axis=1
    
    loss = torch.mean(x_entr_per_sample, dim=0) # Expectation of loss: Mean over samples (axis=0).
    return loss



from utils.plotting import plot_train_progress_2

def train_classifier(classifier,
                     pretrained_AE,
                     loss_func,
                     rng,
                     train_imgs,
                     train_lbls,
                     test_imgs,
                     test_lbls,
                     batch_size,
                     learning_rate,
                     total_iters,
                     iters_per_test=-1):
    # Arguments:
    # classifier: A classifier network. It will be trained by this function using labelled data.
    #             Its input will be either original data (if pretrained_AE=0), ...
    #             ... or the output of the feature extractor if one is given.
    # pretrained_AE: A pretrained AutoEncoder that will *not* be trained here.
    #      It will be used to encode input data.
    #      The classifier will take as input the output of this feature extractor.
    #      If pretrained_AE = None: The classifier will simply receive the actual data as input.
    # train_imgs: Vectorized training images
    # train_lbls: One hot labels
    # test_imgs: Vectorized testing images, to compute generalization accuracy.
    # test_lbls: One hot labels for test data.
    # batch_size: batch size
    # learning_rate: come on...
    # total_iters: how many SGD iterations to perform.
    # iters_per_test: We will 'test' the model on test data every few iterations as specified by this.
    
    values_to_plot = {'loss':[], 'acc_train': [], 'acc_test': []}
    
    optimizer = optim.Adam(classifier.params, lr=learning_rate)
        
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        # 随机抽样一批数
        train_imgs_batch, train_lbls_batch = get_random_batch(train_imgs, train_lbls, batch_size, rng)
        
        # Forward pass执行前向传递以获得预测
        if pretrained_AE is None:
            inp_to_classifier = train_imgs_batch
        else:
            _, z_codes = pretrained_AE.forward_pass(train_imgs_batch)  # AE encodes. Output will be given to Classifier
            inp_to_classifier = z_codes
            
        y_pred = classifier.forward_pass(inp_to_classifier)
        
        # Compute loss:使用交叉熵函数计算损失
        y_real = train_lbls_batch
        loss = loss_func(y_pred, y_real)  # Cross entropy
        
        # Backprop and updates.使用反向传播计算梯度
        optimizer.zero_grad()
        grads = classifier.backward_pass(loss)
        optimizer.step()
        
        
        # ==== Report training loss and accuracy ======
        # y_pred and loss can be either np.array, or torch.tensor (see later). If tensor, make it np.array.
        y_pred_numpy = y_pred if type(y_pred) is np.ndarray else y_pred.detach().numpy()
        y_pred_lbls = np.argmax(y_pred_numpy, axis=1) # y_pred is soft/probability. Make it a hard one-hot label.
        y_real_lbls = np.argmax(y_real, axis=1)
        
        acc_train = np.mean(y_pred_lbls == y_real_lbls) * 100. # percentage
        
        loss_numpy = loss if type(loss) is type(float) else loss.item()
        print("[iter:", t, "]: Training Loss: {0:.2f}".format(loss), "\t Accuracy: {0:.2f}".format(acc_train))
        
        # =============== Every few iterations, show reconstructions ================#
        if t==total_iters-1 or t%iters_per_test == 0:
            if pretrained_AE is None:
                inp_to_classifier_test = test_imgs
            else:
                _, z_codes_test = pretrained_AE.forward_pass(test_imgs)
                inp_to_classifier_test = z_codes_test
                
            y_pred_test = classifier.forward_pass(inp_to_classifier_test)
            
            # ==== Report test accuracy ======
            y_pred_test_numpy = y_pred_test if type(y_pred_test) is np.ndarray else y_pred_test.detach().numpy()
            
            y_pred_lbls_test = np.argmax(y_pred_test_numpy, axis=1)
            y_real_lbls_test = np.argmax(test_lbls, axis=1)
            acc_test = np.mean(y_pred_lbls_test == y_real_lbls_test) * 100.
            print("\t\t\t\t\t\t\t\t Testing Accuracy: {0:.2f}".format(acc_test))
            
            # Keep list of metrics to plot progress.
            values_to_plot['loss'].append(loss_numpy)
            values_to_plot['acc_train'].append(acc_train)
            values_to_plot['acc_test'].append(acc_test)
                
    # In the end of the process, plot loss accuracy on training and testing data.
    plot_train_progress_2(values_to_plot['loss'], values_to_plot['acc_train'], values_to_plot['acc_test'], iters_per_test)

# Train Classifier from scratch (initialized randomly)

# Create the network
rng = np.random.RandomState(seed=SEED)
net_classifier_from_scratch = Classifier_3layers(D_in=H_height*W_width,
                                                 D_hid_1=256, # TODO: Use same as layer 1 of encoder of wide AE (Task 4)
                                                 D_hid_2=32,  # TODO: Use same as layer 1 of encoder of wide AE (Task 4)
                                                 D_out=C_classes,
                                                 rng=rng)
# Start training
train_classifier(net_classifier_from_scratch,
                 None,  # No pretrained AE
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

'''
[iter: 16 ]: Training Loss: 1.93      Accuracy: 32.50
[iter: 17 ]: Training Loss: 2.04      Accuracy: 30.00
[iter: 18 ]: Training Loss: 1.91      Accuracy: 27.50
[iter: 19 ]: Training Loss: 1.77      Accuracy: 32.50
[iter: 20 ]: Training Loss: 1.71      Accuracy: 40.00
                                 Testing Accuracy: 30.05
[iter: 21 ]: Training Loss: 1.67      Accuracy: 42.50
[iter: 22 ]: Training Loss: 1.64      Accuracy: 57.50
...
[iter: 997 ]: Training Loss: 0.00      Accuracy: 100.00
[iter: 998 ]: Training Loss: 0.00      Accuracy: 100.00
[iter: 999 ]: Training Loss: 0.00      Accuracy: 100.00
                                 Testing Accuracy: 55.60
'''

5.9 当标签有限时，使用无监督 AE 作为监督分类器的“预训练特征提取器”(Use Unsupervised AE as 'pre-trained feature-extractor' for a supervised Classifier when labels are limited)

# Train classifier on top of pre-trained AE encoder

class Classifier_1layer(Network):
    # Classifier with just 1 layer, the classification layer
    def __init__(self, D_in, D_out, rng):
        # D_in: dimensions of input
        # D_out: dimension of output (number of classes)
        
        #### TODO: Fill in the blanks ######################
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_out))
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        ####################################################
        self.params = [w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_inp_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_inp_t.shape[0], 1)) # [N, 1] column vector.
        batch_inp_ext = torch.cat((batch_inp_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias. Lec5, slide 4.
        
        # Output classification layer
        logits = batch_inp_ext.mm(w_out)
        
        # Output layer activation function
        # Softmax activation function. See Lecture 5, slide 18.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) 
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred
    
    
    
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
# As input, it will be getting z-codes from the AE with 32-neurons bottleneck from Task 4.
classifier_1layer = Classifier_1layer(autoencoder_wide.D_bottleneck,  # Input dimension is dimensions of AE's Z
                                      C_classes,
                                      rng=rng)

########### TODO: Fill in the gaps to start training ####################
# Give to the function the 1-layer classifier, as well as the pre-trained AE that will work as feature extractor.
# For the pre-trained AE, give the instance of 'wide' AE that has 32-neurons bottleneck, which you trained in Task 4.
train_classifier(classifier_1layer,  # 要训练的单层分类器
                 autoencoder_wide,  # 预训练的AE，将被用作特征提取器
                 cross_entropy,  # 计算损失的函数
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,   # 5e-3, is the best for 1-layer classifier and all data.
                 total_iters=1000,
                 iters_per_test=20)

'''
[iter: 0 ]: Training Loss: 2.42      Accuracy: 0.00
                                 Testing Accuracy: 10.79
[iter: 1 ]: Training Loss: 2.25      Accuracy: 12.50
[iter: 2 ]: Training Loss: 2.08      Accuracy: 20.00
[iter: 3 ]: Training Loss: 2.16      Accuracy: 7.50
[iter: 4 ]: Training Loss: 1.99      Accuracy: 35.00
[iter: 5 ]: Training Loss: 1.78      Accuracy: 60.00
[iter: 6 ]: Training Loss: 1.90      Accuracy: 32.50
[iter: 7 ]: Training Loss: 1.79      Accuracy: 47.50
[iter: 8 ]: Training Loss: 1.74      Accuracy: 45.00
[iter: 9 ]: Training Loss: 1.67      Accuracy: 50.00
[iter: 10 ]: Training Loss: 1.57      Accuracy: 62.50
[iter: 11 ]: Training Loss: 1.51      Accuracy: 67.50
[iter: 12 ]: Training Loss: 1.37      Accuracy: 72.50
[iter: 13 ]: Training Loss: 1.43      Accuracy: 77.50
[iter: 14 ]: Training Loss: 1.37      Accuracy: 65.00
[iter: 15 ]: Training Loss: 1.33      Accuracy: 60.00
[iter: 16 ]: Training Loss: 1.44      Accuracy: 65.00
[iter: 17 ]: Training Loss: 1.41      Accuracy: 72.50
[iter: 18 ]: Training Loss: 1.39      Accuracy: 70.00
[iter: 19 ]: Training Loss: 1.17      Accuracy: 72.50
[iter: 20 ]: Training Loss: 0.99      Accuracy: 90.00
                                 Testing Accuracy: 60.76
[iter: 21 ]: Training Loss: 1.09      Accuracy: 87.50
[iter: 22 ]: Training Loss: 1.01      Accuracy: 80.00
...
[iter: 997 ]: Training Loss: 0.04      Accuracy: 100.00
[iter: 998 ]: Training Loss: 0.09      Accuracy: 100.00
[iter: 999 ]: Training Loss: 0.05      Accuracy: 100.00
                                 Testing Accuracy: 67.65
'''

# Pre-train a classifier.

# The below classifier has THE SAME architecture as the 3-layer Classifier that we trained...
# ... in a purely supervised manner in Task-6.
# This is done by inheriting the class (Classifier_3layers), therefore uses THE SAME forward_pass() function.
# THE ONLY DIFFERENCE is in the construction __init__.
# This 'pretrained' classifier receives as input a pretrained autoencoder (pretrained_AE) from Task 4.
# It then uses the parameters of the AE's encoder to initialize its own parameters, rather than random initialization.
# The model is then trained all together.
class Classifier_3layers_pretrained(Classifier_3layers):
    def __init__(self, pretrained_AE, D_in, D_out, rng):
        D_in = D_in
        D_hid_1 = 256
        D_hid_2 = 32
        D_out = D_out

        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        w_1 = torch.tensor(pretrained_AE.params[0], dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(pretrained_AE.params[1], dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
classifier_3layers_pretrained = Classifier_3layers_pretrained(autoencoder_wide,  # The AE pre-trained in Task 4.
                                                              train_imgs_flat.shape[1],
                                                              C_classes,
                                                              rng=rng)

# Start training
# NOTE: Only the 3-layer pretrained classifier is used, and will be trained all together.
# No frozen feature extractor.
train_classifier(classifier_3layers_pretrained,  # classifier that will be trained.
                 None,  # No pretrained AE to act as 'frozen' feature extractor.
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

'''
[iter: 0 ]: Training Loss: 2.42      Accuracy: 0.00
                                 Testing Accuracy: 12.06
[iter: 1 ]: Training Loss: 2.19      Accuracy: 12.50
[iter: 2 ]: Training Loss: 2.04      Accuracy: 22.50
[iter: 3 ]: Training Loss: 2.11      Accuracy: 15.00
[iter: 4 ]: Training Loss: 1.89      Accuracy: 50.00
[iter: 5 ]: Training Loss: 1.68      Accuracy: 67.50
[iter: 6 ]: Training Loss: 1.71      Accuracy: 42.50
[iter: 7 ]: Training Loss: 1.58      Accuracy: 57.50
[iter: 8 ]: Training Loss: 1.55      Accuracy: 60.00
[iter: 9 ]: Training Loss: 1.36      Accuracy: 65.00
[iter: 10 ]: Training Loss: 1.18      Accuracy: 65.00
[iter: 11 ]: Training Loss: 1.12      Accuracy: 67.50
[iter: 12 ]: Training Loss: 0.88      Accuracy: 70.00
[iter: 13 ]: Training Loss: 0.98      Accuracy: 62.50
[iter: 14 ]: Training Loss: 0.86      Accuracy: 75.00
[iter: 15 ]: Training Loss: 0.72      Accuracy: 80.00
[iter: 16 ]: Training Loss: 0.93      Accuracy: 77.50
[iter: 17 ]: Training Loss: 0.84      Accuracy: 82.50
[iter: 18 ]: Training Loss: 0.78      Accuracy: 80.00
[iter: 19 ]: Training Loss: 0.44      Accuracy: 92.50
[iter: 20 ]: Training Loss: 0.55      Accuracy: 87.50
                                 Testing Accuracy: 62.03
[iter: 21 ]: Training Loss: 0.46      Accuracy: 87.50
[iter: 22 ]: Training Loss: 0.48      Accuracy: 90.00
...
[iter: 997 ]: Training Loss: 0.00      Accuracy: 100.00
[iter: 998 ]: Training Loss: 0.00      Accuracy: 100.00
[iter: 999 ]: Training Loss: 0.00      Accuracy: 100.00
                                 Testing Accuracy: 69.20
'''

与Classifier_1layer相比，Classifier_3layers_pretrained有以下主要区别：

层数和复杂性:
- Classifier_1layer 只有一个输出层。它直接从输入层转到输出层，因此只有一个层。
- Classifier_3layers_pretrained 有三层：两个隐藏层和一个输出层。这增加了网络的复杂性。
参数初始化:
- Classifier_1layer 使用随机初始化来设置其权重。
- Classifier_3layers_pretrained 使用预训练的自动编码器(AE)的前两层的权重来初始化其前两层的权重。输出层的权重仍然是随机初始化的。
参数数量:
- Classifier_1layer 只有一层的权重。
- Classifier_3layers_pretrained 有三层的权重，因此它有更多的参数，需要更多的计算和存储。
继承:
- Classifier_1layer 是一个独立的网络类。
- Classifier_3layers_pretrained 继承自Classifier_3layers，这意味着它重用了Classifier_3layers的某些方法，特别是forward_pass()方法。它们的主要区别是在初始化函数__init__中。
用途:
- Classifier_1layer 主要用于在预训练的AE编码器上进行分类，AE编码器用作特征提取器。
- Classifier_3layers_pretrained 是完全训练的，即使它使用了预训练的AE的权重来初始化，但在训练过程中，所有的权重都会更新。

6. 变分自动编码器 (VAE) Variational Auto-Encoders(VAEs)

6.1 加载并刷新 MNIST Loading and Refresshing MNIST

# -*- coding: utf-8 -*-
# The below is for auto-reloading external modules after they are changed, such as those in ./utils.
# Issue: https://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

import numpy as np
from utils.data_utils import get_mnist # Helper function. Use it out of the box.

# Constants
DATA_DIR = './data/mnist' # Location we will keep the data.
SEED = 111111

# If datasets are not at specified location, they will be downloaded.
train_imgs, train_lbls = get_mnist(data_dir=DATA_DIR, train=True, download=True)
test_imgs, test_lbls = get_mnist(data_dir=DATA_DIR, train=False, download=True)

print("[train_imgs] Type: ", type(train_imgs), "|| Shape:", train_imgs.shape, "|| Data type: ", train_imgs.dtype )
print("[train_lbls] Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, "|| Data type: ", train_lbls.dtype )
print('Class labels in train = ', np.unique(train_lbls))

print("[test_imgs] Type: ", type(test_imgs), "|| Shape:", test_imgs.shape, " || Data type: ", test_imgs.dtype )
print("[test_lbls] Type: ", type(test_lbls), "|| Shape:", test_lbls.shape, " || Data type: ", test_lbls.dtype )
print('Class labels in test = ', np.unique(test_lbls))

N_tr_imgs = train_imgs.shape[0] # N hereafter. Number of training images in database.
H_height = train_imgs.shape[1] # H hereafter
W_width = train_imgs.shape[2] # W hereafter
C_classes = len(np.unique(train_lbls)) # C hereafter

%matplotlib inline
from utils.plotting import plot_grid_of_images # Helper functions, use out of the box.
plot_grid_of_images(train_imgs[0:100], n_imgs_per_row=10)

6.2 数据预处理 Data pre-processing

# a) Change representation of labels to one-hot vectors of length C=10.
train_lbls_onehot = np.zeros(shape=(train_lbls.shape[0], C_classes ) )
train_lbls_onehot[ np.arange(train_lbls_onehot.shape[0]), train_lbls ] = 1
test_lbls_onehot = np.zeros(shape=(test_lbls.shape[0], C_classes ) )
test_lbls_onehot[ np.arange(test_lbls_onehot.shape[0]), test_lbls ] = 1
print("BEFORE: [train_lbls]        Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, " || Data type: ", train_lbls.dtype )
print("AFTER : [train_lbls_onehot] Type: ", type(train_lbls_onehot), "|| Shape:", train_lbls_onehot.shape, " || Data type: ", train_lbls_onehot.dtype )

# b) Re-scale image intensities, from [0,255] to [-1, +1].
# This commonly facilitates learning:
# A zero-centered signal with small magnitude allows avoiding exploding/vanishing problems easier.
from utils.data_utils import normalize_int_whole_database # Helper function. Use out of the box.
train_imgs = normalize_int_whole_database(train_imgs, norm_type="minus_1_to_1")
test_imgs = normalize_int_whole_database(test_imgs, norm_type="minus_1_to_1")

# Lets plot one image.
from utils.plotting import plot_image, plot_images # Helper function, use out of the box.
index = 0  # Try any, up to 60000
print("Plotting image of index: [", index, "]")
print("Class label for this image is: ", train_lbls[index])
print("One-hot label representation: [", train_lbls_onehot[index], "]")
plot_image(train_imgs[index])
# Notice the magnitude of intensities. Black is now negative and white is positive float.
# Compare with intensities of figure further above.

# c) Flatten the images, from 2D matrices to 1D vectors. MLPs take feature-vectors as input, not 2D images.
train_imgs_flat = train_imgs.reshape([train_imgs.shape[0], -1]) # Preserve 1st dim (S = num Samples), flatten others.
test_imgs_flat = test_imgs.reshape([test_imgs.shape[0], -1])
print("Shape of numpy array holding the training database:")
print("Original : [N, H, W] = [", train_imgs.shape , "]")
print("Flattened: [N, H*W]  = [", train_imgs_flat.shape , "]")

6.3 变分自动编码器 Variational Auto-Encoder

import torch
import torch.optim as optim
import torch.nn as nn

class Network():
    def backward_pass(self, loss):
        # Performs back propagation and computes gradients
        # With PyTorch, we do not need to compute gradients analytically for parameters were requires_grads=True, 
        # Calling loss.backward(), torch's Autograd automatically computes grads of loss wrt each parameter p,...
        # ... and **puts them in p.grad**. Return them in a list.
        loss.backward()
        grads = [param.grad for param in self.params]
        return grads

class VAE(Network):
    def __init__(self, rng, D_in, D_hid_enc, D_bottleneck, D_hid_dec):
        # Construct and initialize network parameters
        D_in = D_in # Dimension of input feature-vectors. Length of a vectorised image.
        D_hid_1 = D_hid_enc # Dimension of Encoder's hidden layer
        D_hid_2 = D_bottleneck
        D_hid_3 = D_hid_dec  # Dimension of Decoder's hidden layer
        D_out = D_in # Dimension of Output layer.
        
        self.D_bottleneck = D_bottleneck  # Keep track of it, we will need it.
        
        ##### TODO: Initialize the VAE's parameters. Also see forward_pass(...)) ########
        # Dimensions of parameter tensors are (number of neurons + 1) per layer, to account for +1 bias.
        # -- (Encoder) layer 1
        w1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_hid_1))
        # -- (Encoder) layer 2, predicting p(z|x)
        w2_mu_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        w2_std_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        # -- (Decoder) layer 3
        w3_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_hid_3))
        # -- (Decoder) layer 4, the output layer
        w4_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_3+1, D_out))
        
        # Pytorch tensors, parameters of the model
        # Use the above numpy arrays as of random floats as initialization for the Pytorch weights.
        # (Encoder)
        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=True)
        # (Encoder) Layer 2, predicting p(z|x)
        w2_mu = torch.tensor(w2_mu_init, dtype=torch.float, requires_grad=True)
        w2_std = torch.tensor(w2_std_init, dtype=torch.float, requires_grad=True)
        # (Decoder)
        w3 = torch.tensor(w3_init, dtype=torch.float, requires_grad=True)
        w4 = torch.tensor(w4_init, dtype=torch.float, requires_grad=True)
        # Keep track of all trainable parameters:
        self.params = [w1, w2_mu, w2_std, w3, w4]
        ###########################################################################
        
    
    def encode(self, batch_imgs):
        # batch_imgs: Numpy array or Pytorch tensor of shape: [number of inputs, dimensionality of x]
        [w1, w2_mu, w2_std, w3, w4] = self.params
        
        batch_imgs_t = torch.tensor(batch_imgs, dtype=torch.float) if type(batch_imgs) is np.ndarray else batch_imgs
        
        unary_feature_for_bias = torch.ones(size=(batch_imgs_t.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # ========== TODO: Fill in the gaps with the correct parameters of the VAE ========
        # Encoder's Layer 1
        h1_preact = x.mm(w1)
        h1_act = h1_preact.clamp(min=0)
        # Encoder's Layer 2 (predicting p(z|x) of Z coding):
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        # ... mu
        h2_mu_preact = h1_ext.mm(w2_mu)   # <------------- ????????
        h2_mu_act = h2_mu_preact
        # ... log(std). Ask yourselves: Why do we do this, instead of directly predicting std deviation?
        h2_logstd_preact = h1_ext.mm(w2_std)  # <-------------- ???????
        h2_logstd_act = h2_logstd_preact  # No (linear) activation function in this tutorial, but can use any.
        # ==============================================================================
        
        z_coding = (h2_mu_act, h2_logstd_act)
        
        return z_coding
        
        
    def decode(self, z_codes):
        # z_codes: numpy array or pytorch tensor, shape [N, dimensionality of Z]
        [w1, w2_mu, w2_std, w3, w4] = self.params
        
        z_codes_t = torch.tensor(z_codes, dtype=torch.float) if type(z_codes) is np.ndarray else z_codes
        
        unary_feature_for_bias = torch.ones(size=(z_codes_t.shape[0], 1)) # [N, 1] column vector.
        
        # ========== TODO: Fill in the gaps with the correct parameters of the VAE ========
        # Decoder's 1st layer (Layer 3 of whole VAE):
        h2_ext = torch.cat((z_codes_t, unary_feature_for_bias), dim=1)
        h3_preact = h2_ext.mm(w3)  # < ----------------------------------
        h3_act = h3_preact.clamp(min=0)
        # Decoder's 2nd layer (Layer 4 of whole VAE): The output layer.
        h3_ext = torch.cat((h3_act, unary_feature_for_bias), dim=1)
        h4_preact = h3_ext.mm(w4)
        h4_act = torch.tanh(h4_preact)
        # ==============================================================================
        
        # Output
        x_pred = h4_act
        
        return x_pred
        
        
    def sample_with_reparameterization(self, z_mu, z_logstd):
        # Reparameterization trick to sample from N(mu, var) using N(0,1) as intermediate step.
        # param z_mu: Tensor. Mean of the predicted Gaussian p(z|x). Shape: [Num samples, Dimensionality of Z]
        # param z_logstd: Tensor. Log of standard deviation of predicted Gaussian p(z|x). [Num samples, Dim of Z]
        # return: Tensor. [Num samples, Dim of Z]
        
        N_samples = z_mu.shape[0]
        Z_dims = z_mu.shape[1]

        # ========== TODO: Fill in the gaps to complete the reparameterization trick ========
        z_std = torch.exp(z_logstd)       #   <--------------- ?????????
        eps = torch.randn(size=[N_samples, Z_dims])  # Samples from N(0,I)
        z_samples = z_mu + z_std * eps    #           <--------------- ?????????
        # ==============================================================================
        
        return z_samples
        
        
    def forward_pass(self, batch_imgs):
        batch_imgs_t = torch.tensor(batch_imgs, dtype=torch.float)  # Makes numpy array to pytorch tensor.
        
        # ========== TODO: Call the appropriate functions, as you defined them above ========
        # Encoder
        z_mu, z_logstd = self.encode(batch_imgs_t)  # <------------- ????????????
        z_samples = self.sample_with_reparameterization(z_mu, z_logstd)  # <------------- ????????????
        # Decoder
        x_pred = self.decode(z_samples)  # <------------- ????????????
        # ===================================================================================
        
        return (x_pred, z_mu, z_logstd, z_samples)

def reconstruction_loss(x_pred, x_real, eps=1e-7):
    # x_pred: [N, D_out] Prediction returned by forward_pass. Numpy array of shape [N, D_out]
    # x_real: [N, D_in]
    
    # If number array is given, change it to a Torch tensor.
    x_pred = torch.tensor(x_pred, dtype=torch.float) if type(x_pred) is np.ndarray else x_pred
    x_real = torch.tensor(x_real, dtype=torch.float) if type(x_real) is np.ndarray else x_real
    
    ######## TODO: Complete the calculation of Reconstruction loss for each sample ###########
    loss_recon = torch.mean(torch.square(x_pred - x_real), dim=1)
    ##########################################################################################
    
    cost = torch.mean(loss_recon, dim=0) # Expectation of loss: Mean over samples (axis=0).
    
    return cost


def regularizer_loss(mu, log_std):
    # mu: Tensor, [number of samples, dimensionality of Z]. Predicted means per z dimension
    # log_std: Tensor, [number of samples, dimensionality of Z]. Predicted log(std.dev.) per z dimension.
    
    ######## TODO: Complete the calculation of Reconstruction loss for each sample ###########
    std = torch.exp(log_std)  # Compute std.dev. from log(std.dev.)
    reg_loss_per_sample = 0.5 * torch.sum(mu**2 + std**2 - 2 * log_std - 1, dim = 1)  # <----------
    reg_loss = torch.mean(reg_loss_per_sample, dim = 0)  # Mean over samples.
    ##########################################################################################
    
    return reg_loss


def vae_loss(x_real, x_pred, z_mu, z_logstd, lambda_rec=1., lambda_reg=0.005, eps=1e-7):
    
    rec_loss = reconstruction_loss(x_pred, x_real, eps=1e-7)
    reg_loss = regularizer_loss(z_mu, z_logstd)
    
    ################### TODO: compute the total loss: #####################################
    # ...by weighting the reconstruction loss by lambda_rec, and the Regularizer by lambda_reg
    weighted_rec_loss = lambda_rec * rec_loss
    weighted_reg_loss = lambda_reg * reg_loss
    total_loss = weighted_rec_loss + weighted_reg_loss
    #######################################################################################
    
    return total_loss, weighted_rec_loss, weighted_reg_loss

6.4 VAE的无监督训练 Unsupervised training of VAE

from utils.plotting import plot_train_progress_VAE, plot_grids_of_images  # Use out of the box


def get_random_batch(train_imgs, train_lbls, batch_size, rng):
    # train_imgs: Images. Numpy array of shape [N, H * W]
    # train_lbls: Labels of images. None, or Numpy array of shape [N, C_classes], one hot label for each image.
    # batch_size: integer. Size that the batch should have.
    
    indices = range(0, batch_size)  # Remove this line after you fill-in and un-comment the below. 
    indices = rng.randint(low=0, high=train_imgs.shape[0], size=batch_size, dtype='int32')
    
    train_imgs_batch = train_imgs[indices]
    if train_lbls is not None:  # Enables function to be used both for supervised and unsupervised learning
        train_lbls_batch = train_lbls[indices]
    else:
        train_lbls_batch = None
    return [train_imgs_batch, train_lbls_batch]


def unsupervised_training_VAE(net,
                             loss_func,
                             lambda_rec,
                             lambda_reg,
                             rng,
                             train_imgs_all,
                             batch_size,
                             learning_rate,
                             total_iters,
                             iters_per_recon_plot=-1):
    # net: Instance of a model. See classes: Autoencoder, MLPClassifier, etc further below
    # loss_func: Function that computes the loss. See functions: reconstruction_loss or cross_entropy.
    # lambda_rec: weighing of reconstruction loss in total loss. Total = lambda_rec * rec_loss + lambda_reg * reg_loss
    # lambda_reg: same as above, but for regularizer
    # rng: numpy random number generator
    # train_imgs_all: All the training images. Numpy array, shape [N_tr, H, W]
    # batch_size: Size of the batch that should be processed per SGD iteration by a model.
    # learning_rate: self explanatory.
    # total_iters: how many SGD iterations to perform.
    # iters_per_recon_plot: Integer. Every that many iterations the model predicts training images ...
    #                      ...and we plot their reconstruction. For visual observation of the results.
    loss_total_to_plot = []
    loss_rec_to_plot = []
    loss_reg_to_plot = []
    
    optimizer = optim.Adam(net.params, lr=learning_rate)  # Will use PyTorch's Adam optimizer out of the box
        
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        x_batch, _ = get_random_batch(train_imgs_all, None, batch_size, rng)
        
        ################### TODO: compute the total loss: ################################################
        # Pass parameters of the predicted distribution per x (mean mu and log(std.dev) to the loss function
        
        # Forward pass: Encodes, samples via reparameterization trick, decodes
        x_pred, z_mu, z_logstd, z_codes = net.forward_pass(x_batch)

        # Compute loss:
        total_loss, rec_loss, reg_loss = loss_func(x_batch, x_pred, z_mu, z_logstd, lambda_rec, lambda_reg) # <-------------
        ####################################################################################################
        # Pytorch way
        optimizer.zero_grad()
        _ = net.backward_pass(total_loss)
        optimizer.step()
        
        # ==== Report training loss and accuracy ======
        total_loss_np = total_loss if type(total_loss) is type(float) else total_loss.item()  # Pytorch returns tensor. Cast to float
        rec_loss_np = rec_loss if type(rec_loss) is type(float) else rec_loss.item()
        reg_loss_np = reg_loss if type(reg_loss) is type(float) else reg_loss.item()
        if t%10==0:  # Print every 10 iterations
            print("[iter:", t, "]: Total training Loss: {0:.2f}".format(total_loss_np))
        loss_total_to_plot.append(total_loss_np)
        loss_rec_to_plot.append(rec_loss_np)
        loss_reg_to_plot.append(reg_loss_np)
        
        # =============== Every few iterations, show reconstructions ================#
        if t==total_iters-1 or t%iters_per_recon_plot == 0:
            # Reconstruct all images, to plot reconstructions.
            x_pred_all, z_mu_all, z_logstd_all, z_codes_all = net.forward_pass(train_imgs_all)
            # Cast tensors to numpy arrays
            x_pred_all_np = x_pred_all if type(x_pred_all) is np.ndarray else x_pred_all.detach().numpy()
            
            # Predicted reconstructions have vector shape. Reshape them to original image shape.
            train_imgs_resh = train_imgs_all.reshape([train_imgs_all.shape[0], H_height, W_width])
            x_pred_all_np_resh = x_pred_all_np.reshape([train_imgs_all.shape[0], H_height, W_width])
            
            # Plot a few images, originals and predicted reconstructions.
            plot_grids_of_images([train_imgs_resh[0:100], x_pred_all_np_resh[0:100]],
                                  titles=["Real", "Reconstructions"],
                                  n_imgs_per_row=10,
                                  dynamically=True)
            
    # In the end of the process, plot loss.
    plot_train_progress_VAE(loss_total_to_plot, loss_rec_to_plot, loss_reg_to_plot, iters_per_point=1, y_lims=[1., 1., None])

##################### TODO: Fill in the blank ##############################
# Create the network
rng = np.random.RandomState(seed=SEED)
vae = VAE(rng=rng,
          D_in=H_height*W_width,
          D_hid_enc=256,
          D_bottleneck=2,  # <--- Set to correct value for instantiating VAE shown & implemented in Task 1. Note: We treat D as dimensionality of Z, rather than number of neurons.
          D_hid_dec=256)
########################################################################
# Start training
unsupervised_training_VAE(vae,
                          vae_loss,
                          lambda_rec=1.0,  # <-------- lambda_rec, weight on reconstruction loss.
                          lambda_reg=0.005,  # <------- lambda_reg, weight on regularizer. 0.005 works ok.
                          rng=rng,
                          train_imgs_all=train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,
                          total_iters=1000,
                          iters_per_recon_plot=50)

6.5 以 Z 表示形式对训练数据进行编码并检查 Encode training data in Z representation and examine

import matplotlib.pyplot as plt

def encode_training_images(net,
                           imgs_flat,
                           lbls,
                           batch_size,
                           total_iterations=None,
                           plot_2d_embedding=True,
                           plot_hist_mu_std_for_dim=0):
    # This function encodes images, plots the first 2 dimensions of the codes in a plot, and finally...
    # ... returns the minimum and maximum values of the codes for each dimensions of Z.
    # ... We will use  this at a layer task.
    # Arguments:
    # imgs_flat: Numpy array of shape [Number of images, H * W]
    # lbls: Numpy array of shape [number of images], with 1 integer per image. The integer is the class (digit).
    # total_iterations: How many batches to encode. We will use this so that we dont encode and plot ...
    # ... the whoooole training database, because the plot will get cluttered with 60000 points.
    # Returns:
    # min_z: numpy array, vector with [dimensions-of-z] elements. Minimum value per dimension of z.
    # max_z: numpy array, vector with [dimensions-of-z] elements. Maximum value per dimension of z.
    
    # If total iterations is None, the function will just iterate over all data, by breaking them into batches.    
    if total_iterations is None:
        total_iterations = (train_imgs_flat.shape[0] - 1) // batch_size + 1
    
    z_mu_all = []
    z_std_all = []
    lbls_all = []
    for t in range(total_iterations):
        # Sample batch for this SGD iteration
        x_batch = imgs_flat[t*batch_size: (t+1)*batch_size]
        lbls_batch = lbls[t*batch_size: (t+1)*batch_size]  # Just to color the embeddings (z codes) in the plot.
        
        ####### TODO: Fill in the blank ##################################
        # Encode a batch of x inputs:
        z_mu, z_logstd = net.encode(x_batch)  # <------------------------
        #################################################################
        z_mu_np = z_mu if type(z_mu) is np.ndarray else z_mu.detach().numpy()
        z_logstd_np = z_logstd if type(z_logstd) is np.ndarray else z_logstd.detach().numpy()
        
        z_mu_all.append(z_mu_np)
        z_std_all.append(np.exp(z_logstd_np))
        lbls_all.append(lbls_batch)
        
    z_mu_all = np.concatenate(z_mu_all)  # Make list of arrays in one array by concatenating along dim=0 (image index)
    z_std_all = np.concatenate(z_std_all)
    lbls_all = np.concatenate(lbls_all)
    
    if plot_2d_embedding:
        print("Z-Space and the MEAN of the predicted p(z|x) for each sample (std.devs not shown)")
        # Plot the codes with different color per class in a scatter plot:
        plt.scatter(z_mu_all[:,0], z_mu_all[:,1], c=lbls_all, alpha=0.5)  # Plot the first 2 dimensions.
        plt.show()
    
    print("Histogram of values of the predicted MEANS")
    plt.hist(z_mu_all[:,plot_hist_mu_std_for_dim], bins=20)
    plt.show()
    print("Histogram of values of the predicted STANDARD DEVIATIONS")
    plt.hist(z_std_all[:,plot_hist_mu_std_for_dim], bins=20)
    plt.show()
    
    


# Encode and plot
encode_training_images(vae,
                       train_imgs_flat,
                       train_lbls,
                       batch_size=100,
                       total_iterations=200,
                       plot_2d_embedding=True,
                       plot_hist_mu_std_for_dim=0)

6.6 仅使用重建损失从任务 1 和 2 训练 VAE Train VAE from Task 1 and 2 only with Reconstruction loss

# Create the network
rng = np.random.RandomState(seed=SEED)
vae_2 = VAE(rng=rng,
            D_in=H_height*W_width,
            D_hid_enc=256,
            D_bottleneck=2,
            D_hid_dec=256)
# Start training
unsupervised_training_VAE(vae_2,
                          vae_loss,
                          lambda_rec=1.0,
                          lambda_reg=0.0,  # <------- No regularization loss. Just reconstruction.
                          rng=rng,
                          train_imgs_all=train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,
                          total_iters=1000,
                          iters_per_recon_plot=50)

# Encode and plot
encode_training_images(vae_2, # The second VAE, trained only with Reconstruction loss.
                       train_imgs_flat,
                       train_lbls,
                       batch_size=100,
                       total_iterations=200,
                       plot_2d_embedding=True,
                       plot_hist_mu_std_for_dim=0)

6.7 从任务 1 和 2 训练 VAE 以仅最小化正则化器 Train VAE from Task 1 and 2 to minimize only the Regularizer

# Create the network
rng = np.random.RandomState(seed=SEED)
vae_3 = VAE(rng=rng,
            D_in=H_height*W_width,
            D_hid_enc=256,
            D_bottleneck=2,
            D_hid_dec=256)
# Start training
unsupervised_training_VAE(vae_3,
                          vae_loss,
                          lambda_rec=0.0,  # <------- No reconstruction loss. Only regularizer
                          lambda_reg=0.005,
                          rng=rng,
                          train_imgs_all=train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,
                          total_iters=1000,
                          iters_per_recon_plot=50)

# Encode and plot
encode_training_images(vae_3, # The second VAE, trained only with Reconstruction loss.
                       train_imgs_flat,
                       train_lbls,
                       batch_size=100,
                       total_iterations=200,
                       plot_2d_embedding=True,
                       plot_hist_mu_std_for_dim=0)

6.8 训练具有更大瓶颈层的 VAE Train a VAE with a larger bottleneck layer

# Same as in Task 2, but using a bottle neck with 32 dimension

# Create the network
rng = np.random.RandomState(seed=SEED)
vae_wide = VAE(rng=rng,
          D_in=H_height*W_width,
          D_hid_enc=256,
          D_bottleneck=32,  # <-----------------------------------
          D_hid_dec=256)
# Start training
unsupervised_training_VAE(vae_wide,
                          vae_loss,
                          1.0,  # alpha on the recon loss.
                          0.005,  # 0.005 works well for synthesis! 0.0005 better for smooth z values for 32n.
                          rng,
                          train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,  # 3e-3
                          total_iters=1000,
                          iters_per_recon_plot=50)

6.9 使用 VAE 合成（生成）新数据 Synthesizing (generating) new data with a VAE

def synthesize(enc_dec_net,
               rng,
               n_samples):
    # enc_dec_net: Network with encoder and decoder, pretrained.
    # n_samples: how many samples to produce.
    
    z_dims = enc_dec_net.D_bottleneck  # Dimensionality of z codes (and input to decoder).
    
    ############################## TODO: Fill in the blanks #############################
    # Create samples of z from Gaussian N(0,I), where means are 0 and standard deviations are 1 in all dimensions.
    z_samples = np.random.normal(loc=0.0, scale=1.0, size=[n_samples, z_dims])
    #####################################################################################
    
    z_samples_t = torch.tensor(z_samples, dtype=torch.float)
    x_samples = enc_dec_net.decode(z_samples_t)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
       
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

synthesize(vae_wide,
           rng,
           n_samples=20)

6.10 对于给定的 x，根据预测的后验 p(z|x) 重建随机样本 For a given x, reconstruct random samples from the predicted posterior p(z|x)

def sample_variations_of_x(enc_dec_net,
                           imgs_flat,
                           idx_img_x,
                           rng,
                           n_samples):
    # enc_dec_net: Network with encoder and decoder, pretrained.
    # imgs_flat:
    # idx_img_x:
    # n_samples: how many samples to produce.
    
    # 从图像数据集中提取索引为idx_img_x的图像，只取一个样本。
    img_x_nparray = imgs_flat[idx_img_x:idx_img_x+1]  # Shape: [num samples = 1, H * W]
    
    # Encode:使用VAE的编码器将图像编码到潜在空间的均值和对数标准差。
    z_mu, z_logstd = enc_dec_net.encode(img_x_nparray)  # expects array shape [N, dims_z]
    
    z_dims = z_mu.shape[1]  # Dimensionality of z codes (and input to decoder).
    z_mu = z_mu.detach().numpy()  # Maky pytorch tensor a numpy
    z_logstd = z_logstd.detach().numpy()
    
    ############# TODO: Fill in the blanks ##################################
    # Samples z values from the predicted probability of z for this sample x: p(z|x) = N(mu(x), std^2(x))
    z_std = np.exp(z_logstd)   # <------------------------------------------------------------------
    z_samples = np.random.normal(loc=z_mu, scale=z_std, size=[n_samples, z_dims]) #<------------------
    #########################################################################
    
    x_samples = enc_dec_net.decode(z_samples)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    print("Real input to encoder:")
    plot_image(img_x_nparray.reshape([H_height, W_width]))   
    print("Reconstructions based on samples from p(z|x=input):")
    plot_grid_of_images(x_samples_np.reshape([n_samples, H_height, W_width]),
                        n_imgs_per_row=10,
                        dynamically=False)
    print("Going to plot all the reconstructed variations one by one, for easier visual investigation:")
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
    
    diff = img_x_nparray[0] - x_samples_np[0]
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

sample_variations_of_x(vae_wide,  # The VAE with 32 dimensional Z.
                       train_imgs_flat,
                       idx_img_x=1,  # We will encode the image with index 1, and then reconstruct it.
                       rng=rng,
                       n_samples=100)

6.11 在空间 Z 中的 x1 和 x2 之间进行插值 Interpolate between x1 and x2 in space Z

def interpolate_between_x1_x2(enc_dec_net,
                              imgs_flat,
                              idx_x1,
                              idx_x2,
                              rng):
    # enc_dec_net: Network with encoder and decoder, pretrained.
    # imgs_flat: [number of images, H * W]
    # idx_x1: index of x1: x1 = imgs_flat[idx_x1]
    # idx_x2: index of x2: x2 = imgs_flat[idx_x2]
    # n_samples: how many samples to produce.
    
    img_x1_nparray = imgs_flat[idx_x1]
    img_x2_nparray = imgs_flat[idx_x2]
    z_mus, z_logstds = enc_dec_net.encode(np.array([img_x1_nparray, img_x2_nparray]))
    z_mus = z_mus.detach().numpy()
    
    z_mu1 = z_mus[0]  # np vector with [z-dims] elements
    z_mu2 = z_mus[1]
    
    z_dims = z_mu1.shape[0]  # Dimensionality of z codes (and input to decoder).
    
    # Reconstruct x1 and x2 based on mu codes:
    x_samples = enc_dec_net.decode(np.array([z_mu1, z_mu2]))
    x_samples = x_samples.detach().numpy()
    x1_rec = x_samples[0]
    x2_rec = x_samples[1]
    
    # Interpolate:
    alphas = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    
    alphas_np = np.ones([11, z_dims], dtype="float16")  # [number of interpolated samples = 11, z-dimensions]
    for row_idx in range(alphas_np.shape[0]):
        alphas_np[row_idx] = alphas_np[row_idx] * alphas[row_idx]  # now whole 1st row == 0.0, 2nd row == 0.1, ...
    
    # Interpolate new z values
    zs_to_decode = z_mu1 + alphas_np * (z_mu2 - z_mu1)
    
    x_samples= enc_dec_net.decode(zs_to_decode)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    print("Inputs to encoder:")
    plot_images([img_x1_nparray.reshape([H_height, W_width]), img_x2_nparray.reshape([H_height, W_width])],
               titles=["Real x1", "Real x2"])
    print("Reconstructions of x1 and x2 based on their most likely predicted z codes (corresponding mus):")
    plot_images([x1_rec.reshape([H_height, W_width]), x2_rec.reshape([H_height, W_width])],
               titles=["Recon of x1", "Recon of x2"])
    print("Decodings based on z samples interpolated between mu(x1) and mu(x2) predicted by encoder:")
    plot_grid_of_images(x_samples_np.reshape([11, H_height, W_width]),
                        n_imgs_per_row=11,
                        dynamically=False)
    print("Going to plot all the reconstructed variations one by one, for easier visual investigation:")
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
    
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

interpolate_between_x1_x2(vae_wide,
                          train_imgs_flat,
                          idx_x1=1,
                          idx_x2=3,
                          rng=rng)

6.12 使用 VAE 从未标记数据中学习，以在标记数据有限时补充监督分类器：让我们首先“从头开始”训练一个监督分类器 Learning from Unlabelled data with a VAE, to complement Supervised Classifier when Labelled data are limited: Lets first train a supervised Classifier 'from scratch'

class Classifier_3layers(Network):
    def __init__(self, D_in, D_hid_1, D_hid_2, D_out, rng):
        D_in = D_in
        D_hid_1 = D_hid_1
        D_hid_2 = D_hid_2
        D_out = D_out
        
        # === NOTE: Notice that this is exactly the same architecture as encoder of AE in Task 4 ====
        w_1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_hid_1))
        w_2_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        w_1 = torch.tensor(w_1_init, dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(w_2_init, dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_1, w_2, w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_imgs_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_imgs_t.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # === NOTE: This is the same architecture as encoder of AE in Task 4, with extra classification layer ===
        # Layer 1
        h1_preact = x.mm(w_1)
        h1_act = h1_preact.clamp(min=0)
        # Layer 2 (corresponds to bottleneck of the AE):
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.mm(w_2)
        h2_act = h2_preact.clamp(min=0)
        # Output classification layer
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1)
        h_out = h2_ext.mm(w_out)
        
        logits = h_out
        
        # === Addition of a softmax function for 
        # Softmax activation function.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) 
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred

    
def cross_entropy(y_pred, y_real, eps=1e-7):
    # y_pred: Predicted class-posterior probabilities, returned by forward_pass. Numpy array of shape [N, D_out]
    # y_real: One-hot representation of real training labels. Same shape as y_pred.
    
    # If number array is given, change it to a Torch tensor.
    y_pred = torch.tensor(y_pred, dtype=torch.float) if type(y_pred) is np.ndarray else y_pred
    y_real = torch.tensor(y_real, dtype=torch.float) if type(y_real) is np.ndarray else y_real
    
    x_entr_per_sample = - torch.sum( y_real*torch.log(y_pred+eps), dim=1)  # Sum over classes, axis=1
    
    loss = torch.mean(x_entr_per_sample, dim=0) # Expectation of loss: Mean over samples (axis=0).
    return loss



from utils.plotting import plot_train_progress_2

def train_classifier(classifier,
                     pretrained_VAE,
                     loss_func,
                     rng,
                     train_imgs,
                     train_lbls,
                     test_imgs,
                     test_lbls,
                     batch_size,
                     learning_rate,
                     total_iters,
                     iters_per_test=-1):
    # Arguments:
    # classifier: A classifier network. It will be trained by this function using labelled data.
    #             Its input will be either original data (if pretrained_VAE=0), ...
    #             ... or the output of the feature extractor if one is given.
    # pretrained_VAE: A pretrained AutoEncoder that will *not* be trained here.
    #      It will be used to encode input data.
    #      The classifier will take as input the output of this feature extractor.
    #      If pretrained_VAE = None: The classifier will simply receive the actual data as input.
    # train_imgs: Vectorized training images
    # train_lbls: One hot labels
    # test_imgs: Vectorized testing images, to compute generalization accuracy.
    # test_lbls: One hot labels for test data.
    # batch_size: batch size
    # learning_rate: come on...
    # total_iters: how many SGD iterations to perform.
    # iters_per_test: We will 'test' the model on test data every few iterations as specified by this.
    
    values_to_plot = {'loss':[], 'acc_train': [], 'acc_test': []}
    
    optimizer = optim.Adam(classifier.params, lr=learning_rate)
        
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        train_imgs_batch, train_lbls_batch = get_random_batch(train_imgs, train_lbls, batch_size, rng)
        
        # Forward pass
        if pretrained_VAE is None:
            inp_to_classifier = train_imgs_batch
        else:
            ############### TODO FOR TASK-11 #########################################
            # FILL IN THE BLANK, to provide as input to the classifier the predicted MEAN of p(z|x) for each x.
            # Why? Because the mean is the most likely (probable) code z for x!!
            #
            z_codes_mu, z_codes_logstd = pretrained_VAE.encode(train_imgs_batch)  # AE encodes. Output will be given to Classifier
            inp_to_classifier = z_codes_mu  # <---------------------------- z_codes_???????
            ############################################################################
            
        y_pred = classifier.forward_pass(inp_to_classifier)
        
        # Compute loss:
        y_real = train_lbls_batch
        loss = loss_func(y_pred, y_real)  # Cross entropy
        
        # Backprop and updates.
        optimizer.zero_grad()
        grads = classifier.backward_pass(loss)
        optimizer.step()
        
        
        # ==== Report training loss and accuracy ======
        # y_pred and loss can be either np.array, or torch.tensor (see later). If tensor, make it np.array.
        y_pred_numpy = y_pred if type(y_pred) is np.ndarray else y_pred.detach().numpy()
        y_pred_lbls = np.argmax(y_pred_numpy, axis=1) # y_pred is soft/probability. Make it a hard one-hot label.
        y_real_lbls = np.argmax(y_real, axis=1)
        
        acc_train = np.mean(y_pred_lbls == y_real_lbls) * 100. # percentage
        
        loss_numpy = loss if type(loss) is type(float) else loss.item()
        if t%10 == 0:
            print("[iter:", t, "]: Training Loss: {0:.2f}".format(loss), "\t Accuracy: {0:.2f}".format(acc_train))
        
        # =============== Every few iterations, test accuracy ================#
        if t==total_iters-1 or t%iters_per_test == 0:
            if pretrained_VAE is None:
                inp_to_classifier_test = test_imgs
            else:
                z_codes_test_mu, z_codes_test_logstd = pretrained_VAE.encode(test_imgs)
                inp_to_classifier_test = z_codes_test_mu
                
            y_pred_test = classifier.forward_pass(inp_to_classifier_test)
            
            # ==== Report test accuracy ======
            y_pred_test_numpy = y_pred_test if type(y_pred_test) is np.ndarray else y_pred_test.detach().numpy()
            
            y_pred_lbls_test = np.argmax(y_pred_test_numpy, axis=1)
            y_real_lbls_test = np.argmax(test_lbls, axis=1)
            acc_test = np.mean(y_pred_lbls_test == y_real_lbls_test) * 100.
            print("\t\t\t\t\t\t\t\t Testing Accuracy: {0:.2f}".format(acc_test))
            
            # Keep list of metrics to plot progress.
            values_to_plot['loss'].append(loss_numpy)
            values_to_plot['acc_train'].append(acc_train)
            values_to_plot['acc_test'].append(acc_test)
                
    # In the end of the process, plot loss accuracy on training and testing data.
    plot_train_progress_2(values_to_plot['loss'], values_to_plot['acc_train'], values_to_plot['acc_test'], iters_per_test)

# Train Classifier from scratch (initialized randomly)

# Create the network
rng = np.random.RandomState(seed=SEED)
net_classifier_from_scratch = Classifier_3layers(D_in=H_height*W_width,
                                                 D_hid_1=256,
                                                 D_hid_2=32,
                                                 D_out=C_classes,
                                                 rng=rng)
# Start training
train_classifier(net_classifier_from_scratch,
                 None,  # No pretrained AE
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

6.13 当标签有限时，使用预训练的 VAE 作为监督分类器的“特征提取器” Use pre-trained VAE as 'feature-extractor' for supervised Classifier when labels are limited

# Train classifier on top of pre-trained AE encoder

class Classifier_1layer(Network):
    # Classifier with just 1 layer, the classification layer
    def __init__(self, D_in, D_out, rng):
        # D_in: dimensions of input
        # D_out: dimension of output (number of classes)
        
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_out))
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        self.params = [w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_inp_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_inp_t.shape[0], 1)) # [N, 1] column vector.
        batch_inp_ext = torch.cat((batch_inp_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # Output classification layer
        logits = batch_inp_ext.mm(w_out)
        
        # Output layer activation function
        # Softmax activation function.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) 
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred
    
    
    
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
# As input, it will be getting z-codes from the AE with 32-neurons bottleneck from Task 4.
classifier_1layer = Classifier_1layer(vae_wide.D_bottleneck,  # Input dimension is dimensions of AE's Z
                                      C_classes,
                                      rng=rng)

train_classifier(classifier_1layer,
                 vae_wide,  # Pretrained AE, to use as feature extractor.
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

6.14 使用 VAE 编码器的参数来初始化监督分类器的权重，然后使用有限的标签进行细化 Use parameters of VAE's encoder to initialize weights of a supervised Classifier, followed by refine ment using limited labels

# Pre-train a classifier.

# The below classifier has THE SAME architecture as the 3-layer Classifier that we trained...
# ... in a purely supervised manner in Task-10.
# This is done by inheriting the class (Classifier_3layers), therefore uses THE SAME forward_pass() function.
# THE ONLY DIFFERENCE is in the construction __init__.
# This 'pretrained' classifier receives as input a pretrained autoencoder (pretrained_VAE) from Task 6.
# It then uses the parameters of the AE's encoder to initialize its own parameters, rather than random initialization.
# The model is then trained all together.
class Classifier_3layers_pretrained(Classifier_3layers):
    def __init__(self, pretrained_VAE, D_in, D_out, rng):
        D_in = D_in
        D_hid_1 = 256
        D_hid_2 = 32
        D_out = D_out

        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        [vae_w1, vae_w2_mu, vae_w2_std, vae_w3, vae_w4] = pretrained_VAE.params  # Pre-trained parameters of pre-trained VAE.
        
        w_1 = torch.tensor(vae_w1, dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(vae_w2_mu, dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
classifier_3layers_pretrained = Classifier_3layers_pretrained(vae_wide,  # The AE pre-trained in Task 4.
                                                              train_imgs_flat.shape[1],
                                                              C_classes,
                                                              rng=rng)

# Start training
# NOTE: Only the 3-layer pretrained classifier is used, and will be trained all together.
# No frozen feature extractor.
train_classifier(classifier_3layers_pretrained,  # classifier that will be trained.
                 None,  # No pretrained AE to act as 'frozen' feature extractor.
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

7. 生成对抗网络 Generative Adversarial Networks(GANs)

7.1 加载并刷新 MNIST Loading and Refresshing MNIST

# -*- coding: utf-8 -*-
# The below is for auto-reloading external modules after they are changed, such as those in ./utils.
# Issue: https://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

import numpy as np
from utils.data_utils import get_mnist # Helper function. Use it out of the box.

# Constants
DATA_DIR = './data/mnist' # Location we will keep the data.
SEED = 111111

# If datasets are not at specified location, they will be downloaded.
train_imgs, train_lbls = get_mnist(data_dir=DATA_DIR, train=True, download=True)
test_imgs, test_lbls = get_mnist(data_dir=DATA_DIR, train=False, download=True)

print("[train_imgs] Type: ", type(train_imgs), "|| Shape:", train_imgs.shape, "|| Data type: ", train_imgs.dtype )
print("[train_lbls] Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, "|| Data type: ", train_lbls.dtype )
print('Class labels in train = ', np.unique(train_lbls))

print("[test_imgs] Type: ", type(test_imgs), "|| Shape:", test_imgs.shape, " || Data type: ", test_imgs.dtype )
print("[test_lbls] Type: ", type(test_lbls), "|| Shape:", test_lbls.shape, " || Data type: ", test_lbls.dtype )
print('Class labels in test = ', np.unique(test_lbls))

N_tr_imgs = train_imgs.shape[0] # N hereafter. Number of training images in database.
H_height = train_imgs.shape[1] # H hereafter
W_width = train_imgs.shape[2] # W hereafter
C_classes = len(np.unique(train_lbls)) # C hereafter

%matplotlib inline
from utils.plotting import plot_grid_of_images # Helper functions, use out of the box.
plot_grid_of_images(train_imgs[0:100], n_imgs_per_row=10)

7.2 数据预处理 Data pre-processing

# a) Change representation of labels to one-hot vectors of length C=10.
train_lbls_onehot = np.zeros(shape=(train_lbls.shape[0], C_classes ) )
train_lbls_onehot[ np.arange(train_lbls_onehot.shape[0]), train_lbls ] = 1
test_lbls_onehot = np.zeros(shape=(test_lbls.shape[0], C_classes ) )
test_lbls_onehot[ np.arange(test_lbls_onehot.shape[0]), test_lbls ] = 1
print("BEFORE: [train_lbls]        Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, " || Data type: ", train_lbls.dtype )
print("AFTER : [train_lbls_onehot] Type: ", type(train_lbls_onehot), "|| Shape:", train_lbls_onehot.shape, " || Data type: ", train_lbls_onehot.dtype )

# b) Re-scale image intensities, from [0,255] to [-1, +1].
# This commonly facilitates learning:
# A zero-centered signal with small magnitude allows avoiding exploding/vanishing problems easier.
from utils.data_utils import normalize_int_whole_database # Helper function. Use out of the box.
train_imgs = normalize_int_whole_database(train_imgs, norm_type="minus_1_to_1")
test_imgs = normalize_int_whole_database(test_imgs, norm_type="minus_1_to_1")

# Lets plot one image.
from utils.plotting import plot_image, plot_images # Helper function, use out of the box.
index = 0  # Try any, up to 60000
print("Plotting image of index: [", index, "]")
print("Class label for this image is: ", train_lbls[index])
print("One-hot label representation: [", train_lbls_onehot[index], "]")
plot_image(train_imgs[index])
# Notice the magnitude of intensities. Black is now negative and white is positive float.
# Compare with intensities of figure further above.

# c) Flatten the images, from 2D matrices to 1D vectors. MLPs take feature-vectors as input, not 2D images.
train_imgs_flat = train_imgs.reshape([train_imgs.shape[0], -1]) # Preserve 1st dim (S = num Samples), flatten others.
test_imgs_flat = test_imgs.reshape([test_imgs.shape[0], -1])
print("Shape of numpy array holding the training database:")
print("Original : [N, H, W] = [", train_imgs.shape , "]")
print("Flattened: [N, H*W]  = [", train_imgs_flat.shape , "]")

7.3 实施 GAN Implementing a GAN

# -*- coding: utf-8 -*-
import torch
import torch.optim as optim
import torch.nn as nn

lrelu = nn.LeakyReLU(0.2)

class Network():
    
    def backward_pass(self, loss):
        # Performs back propagation and computes gradients
        # With PyTorch, we do not need to compute gradients analytically for parameters were requires_grads=True, 
        # Calling loss.backward(), torch's Autograd automatically computes grads of loss wrt each parameter p,...
        # ... and **puts them in p.grad**. Return them in a list.
        loss.backward()
        grads = [param.grad for param in self.params]
        return grads
    
    
class Generator(Network):
    def __init__(self, rng, D_z, D_hid1, D_hid2, D_data):
        self.D_z = D_z  # Keep track of it, we may need it.
        # Initialize weight matrices
        # Dimensions of parameter tensors are (number of neurons + 1) per layer, to account for +1 bias.
        # First 2 hidden layers 随机生成
        w1_init = rng.normal(loc=0.0, scale=np.sqrt(2./(D_z * D_hid1)), size=(D_z + 1, D_hid1))
        w2_init = rng.normal(loc=0.0, scale=np.sqrt(2./(D_hid1 * D_hid2)), size=(D_hid1 + 1, D_hid2))
        # -- Output layer, predicting p(real|x)
        wout_init = rng.normal(loc=0.0, scale=np.sqrt(2./(D_hid2 * D_data)), size=(D_hid2 + 1, D_data))

        # Pytorch tensors, parameters of the model
        # Use the above numpy arrays as of random floats as initialization for the Pytorch weights.
        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=True)
        w2 = torch.tensor(w2_init, dtype=torch.float, requires_grad=True)
        wout = torch.tensor(wout_init, dtype=torch.float, requires_grad=True)
        
        # Keep track of all trainable parameters:
        self.params = [w1, w2, wout]
        
        
    def forward(self, batch_z):
        # z_codes: numpy array or pytorch tensor, shape [N, dimensionality of data]
        [w1, w2, wout] = self.params
        # make numpy to pytorch tensor
        batch_z_t = torch.tensor(batch_z, dtype=torch.float) if type(batch_z) is np.ndarray else batch_z
        # add 1 element for bias
        unary_feature_for_bias = torch.ones(size=(batch_z_t.shape[0], 1))  # [N, 1] column vector.
        
        # ========== TODO: Fill in the gaps ========
        # hidden layer:
        z_ext = torch.cat((batch_z_t, unary_feature_for_bias), dim=1)
        h1_preact = z_ext.mm(w1)
        h1_act = lrelu(h1_preact)  
        # l2
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.mm(w2)
        h2_act = lrelu(h2_preact)
        # output layer.
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1)
        hout_preact = h2_ext.mm(wout)
        hout_act = torch.tanh(hout_preact)
        # ==========================================
        
        # Output
        x_generated = hout_act  # [N_samples, dimensionality of data]
        
        return x_generated
                        
        
class Discriminator(Network):
    def __init__(self, rng, D_data, D_hid1, D_hid2):
        # Initialize weight matrices
        # Dimensions of parameter tensors are (number of neurons + 1) per layer, to account for +1 bias.
        # -- 2 hidden layers 随机生成
        w1_init = rng.normal(loc=0.0, scale=np.sqrt(2. / (D_data * D_hid1)), size=(D_data + 1, D_hid1))
        w2_init = rng.normal(loc=0.0, scale=np.sqrt(2. / (D_hid1 * D_hid2)), size=(D_hid1 + 1, D_hid2))
        # -- Output layer, predicting p(real|x)
        wout_init = rng.normal(loc=0.0, scale=np.sqrt(2. / D_hid2), size=(D_hid2 + 1, 1))
        
        # Pytorch tensors, parameters of the model
        # Use the above numpy arrays as of random floats as initialization for the Pytorch weights.
        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=True)
        w2 = torch.tensor(w2_init, dtype=torch.float, requires_grad=True)
        wout = torch.tensor(wout_init, dtype=torch.float, requires_grad=True)
        
        # Keep track of all trainable parameters:
        self.params = [w1, w2, wout]
        
        
    def forward(self, batch_x):
        # z_codes: numpy array or pytorch tensor, shape [N, dimensionality of data]
        [w1, w2, wout] = self.params
        # make numpy to pytorch tensor
        batch_x_t = torch.tensor(batch_x, dtype=torch.float) if type(batch_x) is np.ndarray else batch_x
        # Add 1 element or bias
        unary_feature_for_bias = torch.ones(size=(batch_x_t.shape[0], 1)) # [N, 1] column vector.
        
        # ========== TODO: Fill in the gaps ========
        # hidden layer:
        x_ext = torch.cat((batch_x_t, unary_feature_for_bias), dim=1)
        h1_preact = x_ext.mm(w1)
        h1_act = lrelu(h1_preact)
        # layer 2
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.mm(w2)
        h2_act = lrelu(h2_preact)
        # output layer.
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1)
        hout_preact = h2_ext.mm(wout)
        hout_act = torch.sigmoid(hout_preact)
        # ===========================================
        
        # Output
        p_real = hout_act
        
        return p_real
    

def generator_loss_practical(p_generated_x_is_real):
    # mu: Tensor, [number of samples]. Predicted probability D(G(z)) that fake data are real. 
    
    ######## TODO: Complete the gap ###########
    loss_per_sample = - torch.log(p_generated_x_is_real)
    ###########################################
    expected_loss = torch.mean(loss_per_sample, dim=0) # Expectation of loss: Mean over samples (axis=0).
    return expected_loss


def discriminator_loss(p_real_x_is_real, p_generated_x_is_real):
    # p_real_x_is_real: [N] Predicted probability D(x) for x~training_data that real data are real. 
    # p_generated_x_is_real: [N]. Predicted probability D(x) for x=G(z) where z~N(0,I) that fake data are real.
    
    ######## TODO: Complete the calculation of Reconstruction loss for each sample ###########
    loss_per_real_x = - torch.log(p_real_x_is_real)
    exp_loss_reals = torch.mean(loss_per_real_x)
    
    loss_per_fake_x = - torch.log(1 - p_generated_x_is_real)
    exp_loss_fakes = torch.mean(loss_per_fake_x)
    ##########################################################################################
    
    total_loss = exp_loss_reals + exp_loss_fakes  # Expectation of loss: Mean over samples (axis=0).
    return total_loss

7.4 实施 GAN 的无监督训练 Implement unsupervised training of a GAN

from utils.plotting import plot_train_progress_GAN, plot_grids_of_images  # Use out of the box


def get_batch_reals(train_imgs, train_lbls, batch_size, rng):
    # train_imgs: Images. Numpy array of shape [N, H * W]
    # train_lbls: Labels of images. None, or Numpy array of shape [N, C_classes], one hot label for each image.
    # batch_size: integer. Size that the batch should have.
    
    indices = range(0, batch_size)  # Remove this line after you fill-in and un-comment the below. 
    indices = rng.randint(low=0, high=train_imgs.shape[0], size=batch_size, dtype='int32')
    
    train_imgs_batch = train_imgs[indices]
    if train_lbls is not None:  # Enables function to be used both for supervised and unsupervised learning
        train_lbls_batch = train_lbls[indices]
    else:
        train_lbls_batch = None
    return [train_imgs_batch, train_lbls_batch]



def unsupervised_training_GAN(generator,
                              discriminator,
                              loss_func_g,
                              loss_func_d,
                              rng,
                              train_imgs_all,
                              batch_size_g,
                              batch_size_d_fakes,
                              batch_size_d_reals,
                              learning_rate_g,
                              learning_rate_d,
                              total_iters_g,
                              inner_iters_d,
                              iters_per_gen_plot=-1):
    # generator: Instance of a Generator.
    # discriminator: Instance of a Discriminator.
    # loss_func_g: Loss functions of G
    # loss_func_d: Loss functions of D
    # rng: numpy random number generator
    # train_imgs_all: All the training images. Numpy array, shape [N_tr, H, W]
    # batch_size_g: Size of the batch for G when it is its turn to get updated.
    # batch_size_d_fakes: Size of batch of fake samples for D when it is its turn to get updated.
    # batch_size_d_reals: Size of batch of real samples for D when it is its turn to get updated.
    # learning_rate_g: Learning rate for G.
    # learning_rate_d: learning rate for D.
    # total_iters_g: how many SGD iterations to perform for G in total (outer loop).
    # inner_iters_d: how many SGD iterations to perform for D before every 1 SGD iteration of G.
    # iters_per_gen_plot: Integer. Every that many iterations the model generates few examples and we plot them.
    loss_g_to_plot = []
    loss_d_to_plot = []
    loss_g_mom_to_plot = []
    loss_d_mom_to_plot = []
    loss_g_mom = None
    loss_d_mom = None
    
    optimizer_g = optim.Adam(generator.params, lr=learning_rate_g, betas=[0.5, 0.999], eps=1e-07, weight_decay=0)  # Will use PyTorch's Adam optimizer out of the box
    optimizer_d = optim.Adam(discriminator.params, lr=learning_rate_d, betas=[0.5, 0.99], eps=1e-07, weight_decay=0)  # Will use PyTorch's Adam optimizer out of the box
    
    for t in range(total_iters_g):
        
        for k in range(inner_iters_d):
            # Train Discriminator for inner_iters_d SGD iterations...
            
            ################## TODO: Fill in the gaps #######################
            # Generate Fake samples with G
            z_batch = np.random.normal(loc=0., scale=1., size=[batch_size_d_fakes, generator.D_z])
            x_gen_batch = generator.forward(z_batch)
            # Forward pass of fake samples through D
            p_gen_x_are_real = discriminator.forward(x_gen_batch)
            
            # Forward pass of real samples through D
            x_reals_batch, _ = get_batch_reals(train_imgs_all, None, batch_size_d_reals, rng)
            p_real_x_are_real = discriminator.forward(x_reals_batch)
            
            # Compute D loss:
            loss_d = loss_func_d(p_real_x_are_real, p_gen_x_are_real)
            ####################################################################
            
            # Backprop to D
            optimizer_d.zero_grad()
            _ = discriminator.backward_pass(loss_d)
            optimizer_d.step()
            
        ############## Train Generator for 1 SGD iteration ############
        
        ########## TODO: Fill in the gaps ##################################
        # Generate Fake samples with G
        z_batch = np.random.normal(loc=0., scale=1., size=[batch_size_g, generator.D_z])
        x_gen_batch = generator.forward(z_batch)
        # Forward pass of fake samples through D
        p_gen_x_are_real = discriminator.forward(x_gen_batch)
        ####################################################################
        
        # Compute G loss:
        loss_g = loss_func_g(p_gen_x_are_real)
        
        # Backprop to G
        optimizer_g.zero_grad()
        _ = generator.backward_pass(loss_g)
        optimizer_g.step()
        
        # ==== Report training loss and accuracy ======
        loss_g_np = loss_g if type(loss_g) is type(float) else loss_g.item()
        loss_d_np = loss_d if type(loss_d) is type(float) else loss_d.item()
        if t % 10 == 0:  # Print every 10 iterations
            print("[iter:", t, "]: Loss G: {0:.2f}".format(loss_g_np), " Loss D: {0:.2f}".format(loss_d_np))

        loss_g_mom = loss_g_np if loss_g_mom is None else loss_g_mom * 0.9 + 0.1 * loss_g_np
        loss_d_mom = loss_d_np if loss_d_mom is None else loss_d_mom * 0.9 + 0.1 * loss_d_np

        loss_g_to_plot.append(loss_g_np)
        loss_d_to_plot.append(loss_d_np)
        loss_g_mom_to_plot.append(loss_g_mom)
        loss_d_mom_to_plot.append(loss_d_mom)
        
        # =============== Every few iterations, plot loss ================#
        if t == total_iters_g - 1 or t % iters_per_gen_plot == 0:
            
            ########## TODO: Fill in the gaps #############################
            # Generate Fake samples with G
            n_samples_to_gen = 100
            z_plot = np.random.normal(loc=0., scale=1., size=[100, generator.D_z])
            x_gen_plot = generator.forward(z_plot)
            # Cast tensors to numpy arrays
            x_gen_plot_np = x_gen_plot if type(x_gen_plot) is np.ndarray else x_gen_plot.detach().numpy()
            ###############################################################
            
            # Generated images have vector shape. Reshape them to original image shape.
            x_gen_plot_resh = x_gen_plot_np.reshape([n_samples_to_gen, H_height, W_width])
            
            train_imgs_resh = train_imgs_all.reshape([train_imgs_all.shape[0], H_height, W_width])
            
            
            # Plot a few generated images.
            plot_grids_of_images([x_gen_plot_resh[0:100], train_imgs_resh[0:100]],
                                  titles=["Generated", "Real"],
                                  n_imgs_per_row=10,
                                  dynamically=True)
            
    # In the end of the process, plot loss.
    plot_train_progress_GAN(loss_g_to_plot, loss_d_to_plot,
                            loss_g_mom_to_plot, loss_d_mom_to_plot,
                            iters_per_point=1, y_lims=[3., 3.])

7.5 实例化并训练您的 GAN Instantiate and Train your GAN

# Create the network
rng = np.random.RandomState(seed=SEED)
generator = Generator(rng=rng,
                      D_z=128,
                      D_hid1=256,
                      D_hid2=512,
                      D_data=H_height*W_width)
discriminator = Discriminator(rng=rng,
                              D_data=H_height*W_width,
                              D_hid1=256,
                              D_hid2=512)

# Start training
unsupervised_training_GAN(generator,
                          discriminator,
                          loss_func_g=generator_loss_practical,
                          loss_func_d=discriminator_loss,
                          rng=rng,
                          train_imgs_all=train_imgs_flat,
                          batch_size_g=32,
                          batch_size_d_fakes=64,
                          batch_size_d_reals=64,
                          learning_rate_g=1e-3,
                          learning_rate_d=1e-3,
                          total_iters_g=5000,
                          inner_iters_d=1,
                          iters_per_gen_plot=100)

7.6 使用 GAN 生成新图像 Generate new images using your GAN

def synthesize(generator, n_samples):
    
        # Generate Fake samples with G
        z_plot = np.random.normal(loc=0., scale=1., size=[n_samples, generator.D_z])
        x_gen_plot = generator.forward(z_plot)
        # Cast tensors to numpy arrays
        x_gen_plot_np = x_gen_plot if type(x_gen_plot) is np.ndarray else x_gen_plot.detach().numpy()

        # Generated images have vector shape. Reshape them to original image shape.
        x_gen_plot_resh = x_gen_plot_np.reshape([n_samples, H_height, W_width])

        for i in range(n_samples):
            plot_image(x_gen_plot_resh[i])
            
synthesize(generator, 100)

8. 循环神经网络 Recurrent Neural Networks(RNN)

8.1 示例一：使用 RNN 生成句子中的下一个单词

8.1.1 导入库

import torch
from torch import nn
import numpy as np

8.1.2 数据生成 Data Generation

text = ['hey we are teaching deep learning','hey how are you', 'have a nice day', 'nice to meet you']

# Join all the sentences together and extract the unique characters from the combined sentences
# 筛选出句子中出现的字母
chars = set(''.join(text))

# Creating a dictionary that maps integers to the characters
# 一个int 转 char的字典
int2char = dict(enumerate(chars))

# Creating another dictionary that maps characters to integers
# 一个char 转 int的字典
char2int = {char: ind for ind, char in int2char.items()}

char2int
'''
{'i': 0,
 'e': 1,
 'o': 2,
 'v': 3,
 'p': 4,
 't': 5,
 'a': 6,
 'r': 7,
 'd': 8,
 ' ': 9,
 'u': 10,
 'y': 11,
 'm': 12,
 'c': 13,
 'g': 14,
 'l': 15,
 'h': 16,
 'n': 17,
 'w': 18}
'''

# Finding the length of the longest string in our data
maxlen = len(max(text, key=len))

# Padding
# 把所有句子padding为maxlen+1的长度
# maxlen+=1 means adding a ' ' at each sentence, which helps to predict last word of sentences
maxlen+=1
# A simple loop that loops through the list of sentences and adds a ' ' whitespace until the length of
# the sentence matches the length of the longest sentence
for i in range(len(text)):
  while len(text[i])


# Creating lists that will hold our input and target sequences
input_seq = []
target_seq = []

for i in range(len(text)):
    # Remove last character for input sequence
  input_seq.append(text[i][:-1])
    
    # Remove first character for target sequence
  target_seq.append(text[i][1:])
  print("Input Sequence: {}\nTarget Sequence: {}".format(input_seq[i], target_seq[i]))

该地方应该是有bug，这里移除的是最后一位padding的空格
正确结果为:
Input Sequence: hey how are yo
Target Sequence: ey how are you
for i in range(len(text)):
    input_seq[i] = [char2int[character] for character in input_seq[i]]
    target_seq[i] = [char2int[character] for character in target_seq[i]]

8.1.3 独热编码 One-Hot Encoding
dict_size = len(char2int)
seq_len = maxlen - 1
batch_size = len(text)

def one_hot_encode(sequence, dict_size, seq_len, batch_size):
    # Creating a multi-dimensional array of zeros with the desired output shape
    # 创造一个多维初始化为0的数组
    features = np.zeros((batch_size, seq_len, dict_size), dtype=np.float32)
    
    # Replacing the 0 at the relevant character index with a 1 to represent that character
    # 替换掉对应位置为1
    for i in range(batch_size):
        for u in range(seq_len):
            features[i, u, sequence[i][u]] = 1
    return features

# Input shape --> (Batch Size, Sequence Length, One-Hot Encoding Size)
input_seq = one_hot_encode(input_seq, dict_size, seq_len, batch_size)
print(input_seq.shape)

'''
(4, 34, 19)
'''

input_seq = torch.from_numpy(input_seq)
target_seq = torch.Tensor(target_seq)

8.1.4 定义RNN模型 Defining RNN Model
class Model(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(Model, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        #Defining the layers
        # RNN Layer
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)   
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
    
    def forward(self, x):
        
        batch_size = x.size(0)

        # Initializing hidden state for first input using method defined below
        hidden = self.init_hidden(batch_size)

        # Passing in the input and hidden state into the model and obtaining outputs
        # 将输入和隐藏状态传递给 RNN 层，获取输出和更新后的隐藏状态
        out, hidden = self.rnn(x, hidden)
        
        # Reshaping the outputs such that it can be fit into the fully connected layer
        # 将 RNN 输出的形状重新调整，以便能够输入到全连接层
        out = out.contiguous().view(-1, self.hidden_dim)
        # 通过全连接层获取最终输出
        out = self.fc(out)
        
        return out, hidden
    
    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        # We'll send the tensor holding the hidden state to the device we specified earlier as well
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
        return hidden

# Instantiate the model with hyperparameters
model = Model(input_size=dict_size, output_size=dict_size, hidden_dim=12, n_layers=1)


# Define hyperparameters
n_epochs = 200
lr=0.01

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

8.1.5 测试RNN Testing RNN
# This function takes in the model and character as arguments and returns the next character prediction and hidden state
def predict(model, character):
    # One-hot encoding our input to fit into the model
    character = np.array([[char2int[c] for c in character]])
    character = one_hot_encode(character, dict_size, character.shape[1], 1)
    character = torch.from_numpy(character)
    
    
    out, hidden = model(character)

    prob = nn.functional.softmax(out[-1], dim=0).data
    # Taking the class with the highest probability score from the output
    char_ind = torch.max(prob, dim=0)[1].item()

    return int2char[char_ind], hidden

def sample(model, start='hey'):
    model.eval() # eval mode
    start = start.lower()
    # First off, run through the starting characters
    chars = [ch for ch in start]
    size = maxlen
    # Now pass in the previous characters and get a new one
    c=0
    for ii in range(size):
        char, h = predict(model, chars)
        c+=1
        if char==' ' and c>1:
            break
        chars.append(char)

    return ''.join(chars)

sample(model, 'hey we are teaching deep')
'''
'hey we are teaching deep learning'
'''

sample(model, 'hey how are')
'''
'hey how are you'
'''

sample(model, 'nice to meet')
'''
'nice to meet you'
'''

sample(model, 'have a nice')
'''
'have a nice day'
'''

8.2 示例二：用RNN进行情感分析 Example Two: Sentiment analysis with an RNN
8.2.1 导入库
import numpy as np
from string import punctuation
from collections import Counter
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn

8.2.2 加载并且可视化数据 Load in and visualize the data
# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

print(reviews[:100])
print()
print(labels[:20])
'''
bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life

positive
negative
po
'''

8.2.3 数据预处理 Data pre-procesing
# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])


# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

all_text[:40]
'''
'bromwell high is a cartoon comedy  it ra'
'''

words[:30]
'''
['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me']
'''

8.2.4 编码单词 Encoding the words
## Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

# stats about vocabulary
print('Unique words: ', len((vocab_to_int))) 
print('Original review: ', reviews_split[1])
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])
'''
Unique words:  74072
Original review:  story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers  unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting  even those from the era should be turned off  the cryptic dialogue would make shakespeare seem easy to a third grader  on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond  future stars sally kirkland and frederic forrest can be seen briefly   

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]
'''

8.2.5 编码标签 Encoding the labels
# 1=positive, 0=negative label conversion
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))
'''
Zero-length reviews: 1
Maximum review length: 2514
'''

print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.

# get indices of any reviews with length 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))
'''
Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000
'''

seq_length = 200

# getting the correct rows x cols shape
features = np.zeros((len(reviews_ints), seq_length), dtype=int)

# for each review, I grab that review and 
for i, row in enumerate(reviews_ints):
    features[i, -len(row):] = np.array(row)[:seq_length]


## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

'''
[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   307 10399  2069  1565  6202  6528  3288 17946 10628]
 [    0     0     0     0     0     0     0     0     0     0]
 [   21   122  2069  1565   515  8181    88     6  1325  1182]
 [    1    20     6    76    40     6    58    81    95     5]
 [   54    10    84   329 26230 46427    63    10    14   614]
 [   11    20     6    30  1436 32317  3769   690 15100     6]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   40    26   109 17952  1422     9     1   327     4   125]
...
 [   10   499     1   307 10399    55    74     8    13    30]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]]
'''

8.2.6 训练，验证，测试数据加载器和批次 Training, Validation, Test DataLoaders and Batching
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

'''
            Feature Shapes:
Train set:         (20000, 200) 
Validation set:     (2500, 200) 
Test set:         (2500, 200)
'''

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.__next__()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

'''
Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[   0,    0,    0,  ...,   76,  771,  243],
        [   0,    0,    0,  ...,   10,  377,    8],
        [  84,  123,   10,  ..., 8505,  509,    1],
        ...,
        [ 596,  251,   36,  ...,   11,   18,   32],
        [   0,    0,    0,  ...,  104,   22,  261],
        [   0,    0,    0,  ...,    3,  246,  816]], dtype=torch.int32)

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
        0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0,
        0, 0], dtype=torch.int32)
'''

8.2.7 RNN的情感网络 Sentiment Network with An RNN
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')


class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            batch_first=True)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        lstm_out = lstm_out[:, -1, :] # getting the last time step output
        
        # fully-connected layer
        out = self.fc(lstm_out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

8.2.8 实例化网络 Instantiate the network
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 1

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)
'''
SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, batch_first=True)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)
'''

8.2.9 训练 Training
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

# training params

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):

    # batch loop
    for inputs, labels in train_loader:
        counter += 1
        
        # initialize hidden state
        h = net.init_hidden(inputs.size(0))
        

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            
            # Get validation loss
            
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:
                val_h = net.init_hidden(inputs.size(0))

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

'''
Epoch: 1/4... Step: 100... Loss: 0.725903... Val Loss: 0.657001
Epoch: 1/4... Step: 200... Loss: 0.647649... Val Loss: 0.649977
Epoch: 1/4... Step: 300... Loss: 0.581160... Val Loss: 0.598053
Epoch: 1/4... Step: 400... Loss: 0.554029... Val Loss: 0.623836
Epoch: 2/4... Step: 500... Loss: 0.436622... Val Loss: 0.632159
Epoch: 2/4... Step: 600... Loss: 0.512016... Val Loss: 0.540165
Epoch: 2/4... Step: 700... Loss: 0.523522... Val Loss: 0.556815
Epoch: 2/4... Step: 800... Loss: 0.448128... Val Loss: 0.539433
Epoch: 3/4... Step: 900... Loss: 0.272073... Val Loss: 0.514724
Epoch: 3/4... Step: 1000... Loss: 0.324500... Val Loss: 0.496553
Epoch: 3/4... Step: 1100... Loss: 0.419444... Val Loss: 0.493266
Epoch: 3/4... Step: 1200... Loss: 0.285082... Val Loss: 0.524629
Epoch: 4/4... Step: 1300... Loss: 0.134475... Val Loss: 0.490992
Epoch: 4/4... Step: 1400... Loss: 0.174407... Val Loss: 0.511377
Epoch: 4/4... Step: 1500... Loss: 0.170249... Val Loss: 0.534627
Epoch: 4/4... Step: 1600... Loss: 0.176381... Val Loss: 0.495451
'''

8.2.10 测试 Testing
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0



net.eval()
# iterate over test data

for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    # init hidden state
    h = net.init_hidden(inputs.size(0))
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))
'''
Test loss: 0.488
Test accuracy: 0.796
'''

8.2.11 对测试评论的推断 Inference on a test review
# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([vocab_to_int.get(word, 0) for word in test_words])

    return test_ints

# test code and generate tokenized review
test_ints = tokenize_review(test_review_neg)
print(test_ints)
'''
[[1, 247, 18, 10, 28, 108, 113, 14, 388, 2, 10, 181, 60, 273, 144, 11, 18, 68, 76, 113, 2, 1, 410, 14, 539]]
'''

# test sequence padding
seq_length=200
features = np.zeros((len(test_ints), seq_length), dtype=int)

#For reviews shorter than seq_length words, left pad with 0s. For reviews longer than seq_length, use only the first seq_length words as the feature vector.
for i, row in enumerate(test_ints):
    features[i, -len(row):] = np.array(row)[:seq_length]

print(features)
'''
[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   1 247  18  10  28
  108 113  14 388   2  10 181  60 273 144  11  18  68  76 113   2   1 410
   14 539]]
'''

#test conversion to tensor and pass into your model
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size())
'''
torch.Size([1, 200])
'''

def predict(net, test_review, sequence_length=200):
    
    net.eval()
    
    # tokenize review
    test_ints = tokenize_review(test_review)
    
    # pad tokenized sequence
    seq_length=sequence_length
    
    features = np.zeros((len(test_ints), seq_length), dtype=int)
    # For reviews shorter than seq_length words, left pad with 0s. For reviews longer than seq_length, use only the first seq_length words as the feature vector.
    for i, row in enumerate(test_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    
    # convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    # printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")

# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'

# call function
seq_length=200 # good to use the length that was trained on

predict(net, test_review_pos, seq_length)
'''
Prediction value, pre-rounding: 0.989194
Positive review detected!
'''

9. 注意力(Attention)
9.1 导入库
import numpy as np

9.2 Data Initialization
def make_data(keyLength = 8, valueLength = 16, items = 32, seed = 42, numItems = 64):
    np.random.seed(seed)
    keys = []
    values = []
    queries = []

    for i in range(numItems):
        if i%8 == 0:
            baseKeyQuery = np.random.randn(keyLength)*0.5 
            baseValue = np.random.rand(valueLength)*1 -0.5
        key = baseKeyQuery + np.random.randn(keyLength)*0.2
        query = baseKeyQuery + np.random.randn(keyLength)*0.2
        value = baseValue + np.random.rand(valueLength)*5 -2.5
        keys.append(key)
        queries.append(query)
        values.append(value)
    return keys,values,queries
    
    
keys, values, queries = make_data(keyLength = 8, valueLength = 16, items = 32, seed = 42, numItems = 64)

9.3 Implement attention for single query
def attentionQuery(query, keys, values):
    
    attention = []
    norm = np.sqrt(len(keys[0]))
    
    for k in keys:
        
        a = (query*k).sum() / norm
        a = np.exp(a)
        attention.append(a)
    attention = np.array(attention)
    attention /= attention.sum() 

    result = np.zeros(len(values[0]))
    for a,v in zip(attention,values):
        result = result + a*v
        
    return attention, result

9.4 Apply the function and plot the results
att, result =  attentionQuery(queries[0], keys, values)

plt.bar(x = np.arange(len(att)), height = att)
plt.xlabel('Key-index')
plt.ylabel('Attention-score')

plt.bar(x = np.arange(len(result)), height = result)
plt.xlabel('Element-index')
plt.ylabel('Element-value')

'''
Text(0, 0.5, 'Element-value')
'''

9.5 Matrix-based implementation
keys_mat = np.array(keys)
values_mat = np.array(values)
queries_mat = np.array(queries)

def attentionQueryMatrix(queries, keys, values):
    norm = np.sqrt(len(keys[0]))
    attention = np.matmul(queries,keys.transpose())/norm
    attention = np.exp(attention)
    attention = attention/attention.sum(axis = 1, keepdims=True)
    return attention, np.matmul(attention,values)

9.6 Apply the function and plot the results
attentions, results = attentionQueryMatrix(queries_mat, keys_mat, values_mat)
plt.imshow(attentions)
plt.xlabel('Key-index')
plt.ylabel('Query-index')
plt.colorbar()
plt.show()
keys_mat.shape

'''
(64, 8)
'''

plt.bar(x = np.arange(len(result)), height = results[0])
plt.xlabel('Element-index')
plt.ylabel('Element-value')

'''
Text(0, 0.5, 'Element-value')
'''

attentions = np.zeros((len(keys),len(keys)))
for i in range (len(keys)):
    a, _ = attentionQuery(queries[i], keys, values)
    attentions[i,:] = a

9.7 Masked attention
def maskedAttentionQueryMatrix(queries, keys, values):
    norm = np.sqrt(len(keys[0]))
    attention = np.matmul(queries,keys.transpose())/norm
    attention = np.exp(attention)
    xs, ys = np.meshgrid(np.arange(attention.shape[1]), np.arange(attention.shape[0]))
    attention[ys

9.8 Apply the masked function and plot the results
plt.bar(x = np.arange(len(result)), height = results[0])
plt.xlabel('Element-index')
plt.ylabel('Element-value')

'''
Text(0, 0.5, 'Element-value')
'''

attentions, results = maskedAttentionQueryMatrix(queries_mat, keys_mat, values_mat)
plt.imshow(attentions)
plt.xlabel('Key-index')
plt.ylabel('Query-index')
plt.colorbar()
plt.show()
keys_mat.shape

'''
(64, 8)
'''

看完了？说点什么呢



机器学习:笔记 - Machine Learning Note
Ver — Fri, 19 Dec 2025 07:07:28 GMT
该渲染由 Shiro API 生成，可能存在排版问题，最佳体验请前往：https://blog.verxie.org/posts/study/2023-08-02-Machine-Learning-Note
2. 模型评估与选择
2.1 经验误差与过拟合
在m个样本中有a个样本分类错误
错误率(Error rate): $E = \frac{a}{m}$
在训练集上的误差称训练误差(training error)或经验误差(empirical error)
在新样本上的误差称泛化误差(generalization error)
精度(Accuracy): $1 - E$
过拟合(Overfitting): 模型在训练数据上表现很好,但在测试数据上表现糟糕,这通常是因为模型过于复杂,以至于“记住”了训练数据的噪声.
过拟合(Overfitting): 模型在训练数据上表现很好,但在测试数据上表现糟糕,这通常是因为模型过于复杂,以至于“记住”了训练数据的噪声.
与NP的关系:
若可以彻底避免过拟合，则通过经验误差最小化就可以获最优解，这就意味着我们构造性的证明了P = NP
2.2 评估方法
2.2.1 留出法
留出法(Hold-out Method) 留出法通常将数据集分为训练集(S)和测试集(T)，有时还可以进一步将训练集分为训练集和验证集。在使用留出法时, 一般要采用若干次随机划分、重复进行实验评估后取平均值作为留出法的评估结果
注意:
训练/测试集的划分要尽可能保持数据分布的一致性, 避免因数据划分过程引入额外的偏差而对最终结果产生影响, 例如在分类任务中至少要保持样本的类别比例相似
若令训练集S包含绝大多数样本, 则训练出的模型可能更接近于用D训练出的模型, 但由于T比较小，评估结果可能不够稳定准确; 若令测试集T多包含一些样本, 则训练集S与D差别更大了, 被评估的模型与用D训练出的模型相比可较大差别, 从而降低了评估结果的保真(fidelity)
2.2.2 交叉验证法
交叉验证法(cross validation/k-fold cross validation)先将数据集D划分为k个大小相似的互斥自己.
每个子集 $D_i$ 都尽可能保持数据分布的一致性, 即从D中通过分层采样得到.
每次用k-1个子集的并集作为训练集, 余下的那个子集作为测试集
这样就可获得k组训练/测试集, 从而可进行k次训练和测试, 最终返回的是这k个测试结果的均值.
缺点: 测试样本数量大时计算开销极大
k最常用取值是10, 此时成为"10折交叉验证"
数据集D划分为k个子集同样存在多种划分方式, 为减小因样本划分不同而引入的差别, k折交叉验证通常要随机使用不同的划分重复p次, 最终的评估结果是这p次k折交叉验证结果的均值, 例如常见的有"10次10折交叉验证"
2.2.3 自助法
自助法(bootstrapping)在给定包含m个样本的数据集, 对其采样产生数据集D':
每次随即从D中挑选一个样本, 将其拷贝放入D
再将该样本放回初始数据集D中，使其仍有机会被再次采样
重复m次后，会获得包含m个样本的数据集D'
 > 样本适中不被采到的概率为 $(1-\frac{1}{m})^m$ , 取极限为 0.368
约有36.8%样本未出现在采样数据集D'中, 这时用D'作为训练集, D\D'用作测试集
 > 这样的测试结果成为"包外估计"(out-of-bag estimate)
优点: 在数据集较小时, 难以有效的划分训练/测试集时很有用; 能从初始数据集中产生多个不同的训练集, 对集成学习有好处
缺点: 改变了初始数据集的分布, 可能引入估计偏差
2.2.4 调参和最终模型
在进行模型估计和选择时, 除了要对使用学习算法进行选择, 还需对算法参数进行设定, 这就是通常所说的"参数调节"/"调参"(parameter tuning)
2.3 性能度量
对学习器的泛化性能进行评估, 不仅需要有效可行的实验估计方法, 还需要有衡量模型泛化能力的评价标准，即性能度量(performance measure)
在预测任务中, 给定样例集 $D = {(x1,y1),(x2,y2),...,(xm,ym)}$ , 其中 $yi$ 是实例 $xi$ 的真实标记. 要估计学习器 $f$ 的性能, 就要把学习器预测结果 $f(x)$ 与真实标记 $y$ 进行比较
均方误差(mean squared error):
$$ E(f;D) = \frac{1}{m}\sum\limits^m{i=1}(f(xi) - y_i)^2 $$
对于数据分布$D$和概率密度函数$p(·)$，均方误差可描述为:
$$ E(f;D) = \int_{x\thicksim D}(f(x)-y)^2p(x)dx $$
2.3.1 错误率与精度
错误率定义为:
$$ E(f;D) = \frac{1}{m}\sum\limits^m{i=1}I(f(xi)\neq y_i) $$
精度定义为:
$$ acc(f;D) = \frac{1}{m}\sum\limits^m{i=1}I(f(xi) = y_i) = 1 - E(f;D) $$
更一般的，对于数据分布$D$和概率密度函数$p(·)$, 错误率和精度可分别表示为:
$$ E(f;D) = \int{x\thicksim D}I(f(xi)\neq y_i)p(x) dx$$
$$ acc(f;D) = \int{x\thicksim D}I(f(xi) =  y_i)p(x) dx = 1 - E(f;D) $$
2.3.2 查准率、查全率与F1
 真实情况  预测结果: 正例  预测结果: 反例 
 正例      TP              FN             
 反例      FP              TN             
查准率(precision):
$$ P = \frac{TP}{TP+FP} $$
查全率(recall):
$$ R = \frac{TP}{TP + FN} $$
平衡点(Break-Even Point)是"查准率=查全率"时的取值。更常用的是F1度量:
$$ F1 = \frac{2 \times P \times R}{P + R} = \frac{2 \times TP}{样例总数 + TP - TN} $$
$F_\beta$ 能够表达对查准率/查全率的不同的偏好, 定义为:
$$  F_\beta = \frac{(1 + \beta^2) \times P \times R}{(\beta^2 \times P) + R} $$
$\beta > 1$ 时查全率有更大影响， $\beta < 1$时查准率有更大影响
多次训练:
现在个混淆矩阵上分别计算出查准率和查全率, 记为 $(P1, R1), (P2, R2),...,(Pn,Rn)$ , 再计算平均值, 这样就得到"宏查准率"(macro-P)、"宏查全率"(macro-R), 以及相应的"宏F1"(macro-F1):
$$   macro-P = \frac{1}{n}\sum\limits^n{i=1}Pi $$
$$  macro-R = \frac{1}{n}\sum\limits^n{i=1}Ri $$
$$  macro-F1 = \frac{2 \times macro-P \times macro-R}{macro-P + macro-R} $$
也可先将各混淆矩阵的对应元素进行平均, 得到 $TP、FP、TN、FN$ 的平均值，分别记得 $\overline{TP}、\overline{FP}、\overline{TN}、\overline{FN}$ , 再基于这些平均值算出"微查准率"(micro-P)、"微查全率"(micro-R)和"微F1"(micro-F1):
$$micro-P = \frac{\overline{TP}}{\overline{TP}+\overline{FP}}$$
$$micro-R = \frac{\overline{TP}}{\overline{TP}+\overline{FN}}$$
$$micro-F1 = \frac{2 \times micro-P \times micro-R}{micro-P + micro-R}$$
2.3.3 ROC与AUC
ROC全称"受试者工作特征"(Receiver Operating Characteristic)曲线。ROC曲线的纵轴时"真正例率"(True Positive Rate, 简称TPR), 横轴是"假正例率"(Flase Positive Rate, 简称FPR), 定义为:
$$TPR = \frac{TP}{TP + FN}$$
$$FPR = \frac{FP}{TN + FP}$$
与P-R图类似，若一个学习器的ROC曲线被另一个学习器的曲线完全包裹, 则可以断言后者的性能优于前者.若两个学习器的ROC曲线发生交叉, 则难以一般性地断言两者孰优孰劣. 此时应比较ROC曲线下的面积, 即AUC(Area Under ROC Curve), 可估算为
$$AUC = \frac{1}{2}\sum\limits^{m-1}{i=1}(x{i+1}-xi)·(yi+y_{i+1})$$
AUC考虑的时样本预测的排序质量, 因此它与排序误差有紧密联系. 给定 $m^+$ 个正例和 $m^-$ 个反例, 令 $D^+$ 和 $D^-$ 分别表示正反例集合, 则排序损失(loss)定义为:
$$l{rank} = \frac{1}{m^+m^-}\sum\limits{x^+\in D^+}\sum\limits_{x^-\in D^-}(I(f(x^+)
$$AUC = 1 - l_{rank}$$
2.3.4 代价敏感错误率与代价曲线
为权衡不同类型错误所造成的不同损失, 可谓错误赋予"非均等代价"(unequal cost)
 真实类别  预测类别: 第0类  预测类别: 第1类 
 第0类     0                $cost_{01}$     
 第1类     $cost_{10}$      0               
在非均等代价下，我们所希望的不再是简单的最小化错误次数, 而是希望最小化"总体代价"(total cost). "代价敏感"(cost-sensitive)错误率为:
$$E(f;D;cost)=\frac{1}{m}(\sum\limits{xi\in D^+}I(f(xi)\neq yi) \times cost{01}+ \sum\limits{xi\in D^-}I(f(xi)\neq yi) \times cost{10})$$
在非均等代价下, ROC曲线不能直接反映出学习器的期望总体代价, 而"代价曲线"(cost curve)则可达到该目的. 代价曲线图的横轴时取值为[0,1]的正例概率代价:
$$P(+)cost = \frac{p \times cost{01}}{p \times cost{01} + (1 - p)\times cost_{10}}$$
其中 $p$ 是样例为正例的的概率, 纵轴是取值为[0,1]的归一化代价
$$cost{norm} = \frac{FNR \times p \times cost{01}+ FPR \times (1-p)\times cost{10}}{p \times cost{01} + (1-p)\times cost_{10}}$$
2.4 比较检验
2.4.1 假设验证(hypothesis test)
在包含m个样本的测试集上, 泛化错误率为 $\epsilon$ 的学习器被测得测试错误率为 $\hat{\epsilon}$ 的概率:
$$P(\hat{\epsilon};\epsilon) = \begin{pmatrix}
m \ \hat{\epsilon} \times m
\end{pmatrix} \epsilon^{\hat{\epsilon} \times m}(1 - \epsilon)^{m - \hat{\epsilon} \times m}$$
 $\alpha$ 的常用取值有0.05, 0.1
这里 $1-\alpha$ 反映了结论的"置信度"(confidence),直观来看, 即非阴影部分的面积
$$\bar{\epsilon} = min \epsilon \space s.t. \space \sum\limits^m_{i = \epsilon \times m + 1} \begin{pmatrix}
m \ i
\end{pmatrix}\epsilon^i0(1 - \epsilon0)^{m-i} < \alpha$$
通过多次重复留出法或是交叉验证法等进行多次训练/测试, 这样会得到多个测试错误率, 此时可使用"t检验"(t-test). 假定我们得到了一个k个测试错误率,  $\hat{\epsilon}1, \hat{\epsilon}2, ..., \hat{\epsilon}_k$ , 则平均测试错误率 $\mu$ 和方差 $\sigma^2$ 为
$$\mu = \frac{1}{k}\sum\limits^k{i=1}\hat{\epsilon}i$$
$$\sigma^2 = \frac{1}{k - 1}\sum\limits^k{i=1}(\hat{\epsilon}i-\mu)^2$$
考虑到这k个测试错误率可看作泛化错误率$$\epsilon_0$$的独立采样, 则变量
$$\taut = \frac{\sqrt{k}(\mu - \epsilon0)}{\sigma}$$
 $\alpha$  k = 2   k = 5  k = 10  k = 20  k = 30 
 0.05        12.706  2.776  2.262   2.093   2.045  
 0.10        6.314   2.132  1.833   1.729   1.699  
2.4.2 交叉验证t检验
对两个学习器A和B, 若我们使用k折交叉验证法得到的测试错误率分别为 $\epsilon^A1, \epsilon^A2, ..., \epsilon^Ak$ 和 $\epsilon^B1, \epsilon^B2, ..., \epsilon^Bk$ , 其中 $\epsilon^Ai$ 和 $\epsilon^Bi$ 是在相同的第i折训练/测试集上得到的结果, 则可用k折交叉验证"成对t检验"(paired t-tests)来进行比较检验
这里的基本思想是若两个学习器的性能相同, 则它们使用相同的训练/测试集得到的测试错误率应相同, 即 $\epsilon^Ai = \epsilon^Bi$
具体来说, 对k折交叉验证产生的k对测试错误率:
先对每对结果球差,  $\Deltai = \epsilon^Ai - \epsilon^B_i$
若两个学习器性能相同, 则差值均值应为0.
因此, 可根据差值, 来对学习器A与B性能相同这个假设做t检验, 计算出差值的均值 $\mu$ 和方差 $\sigma^2$, 在显著度 $\alpha $下,若变量
 $$\tau_t = \vert\frac{\sqrt{k}\mu}{\sigma}\vert$$
 小于临界值 $t_{\alpha/2, k-1} $, 则假设不能被拒绝, 即认为两个学习器的性能没有显著差别
否则可认为两个学习器的性能有显著差别, 且平均错误率较小的那个学习器性能较优
因为样本有限, 在使用交叉验证等实验估计法时, 不同轮次的训练集会有一定的重叠, 为缓解这个问题可以使用"5x2交叉验证"
2.4.3 McNemar检验
对于二分问题, 使用留出法不仅可估计出学习器A和B的测试错误率, 还可获得两学习器分类结果的差别, 即两者都正确、都错误、一个正确一个错误的样本数:
 算法B  算法A: 正确  算法A: 错误 
 正确    $e_{00} $    $e_{01} $  
 错误    $e_{10} $    $e_{11} $  
若我们做的假设是两学习器性能相同, 则应有 $e{01} = e{10} $, 那么变量 $|e{01} - e{10}| $应当服从正态分布. McNemar检验考虑变量
$$\tau{X^2} = \frac{(|e{01} - e{10}| - 1)^2}{e{01} + e_{10}}$$
服从自由度为1的 $X^2 $分布, 即标准正态分布变量的平方.
当以上变量值小于临界值 $X^2_\alpha $时, 不能拒绝假设, 即认为两学习器的性能没有显著差别
否则拒绝假设, 即认为两学习器性能有显著差别, 且平均错误率较小的那个学习器性能较优
2.4.4 Friedman检验与Nemenyi后续检验
 数据集    算法A  算法B  算法C 
  $D_1 $   1      2      3     
  $D_2 $   1      2.5    2.5   
  $D_3 $   1      2      3     
  $D_4 $   1      2      3     
 平均序值  1      2.125  2.875 
使用Friedman检验来判断这些算法是否性能相同.若相同, 则他们的平均序列应当相同.假定我们在N个数据集上比较k个算法, 令 $ri $表示第i个算法的平均序值, (不考虑平分序值)则 $ri $的均值和方差分别为 $(k+1)/2 $和 $(k^2-1)/12N $. 变量
$$\tau{X^2} = \frac{k -1}{k}·\frac{12N}{k^2 - 1} \sum\limits^k{i = 1}(r_i - \frac{k + 1}{2})^2$$
$$=\frac{12N}{k(k+1)}(\sum\limits^k{i=1}ri^2-\frac{k(k+1)^2}{4})$$
在k和N都较大时, 服从自由度为k-1的 $X^2 $分布
然而上述的圆石Friedman检验过于保守, 现在通常使用变量
$$\tauF = \frac{(N - 1)\tau{X^2}}{N(k - 1) - \tau_{X^2}}$$
若"所有算法的性能相同"这个假设被拒绝, 则说明算法的性能显著不同.这时需进行"后续检验"(post-hoc test)来进一步区分个算法. 常用的有Nemenyi后续检验.
Nemenyi检验计算出平均序值差别的临界值域
$$CD = q_\alpha\sqrt{\frac{k(k+1)}{6N}}$$
2.5 偏方与方差
在回归任务中
学习算法的期望预测:
$$\bar{f}(x) = E_D[f(x;D)]$$
使用样本数相同的不同训练集产生的方差为:
$$var(x) = E_D[(f(x;D)-\bar{f}(x))^2]$$
噪声为:
$$\varepsilon^2 = ED[(yD - y)^2]$$
期望输出与真是标记的差别成为偏差(bias), 即:
$$bias^2(x) = (\bar{f}(x)-y)^2$$
泛化误差可分解为偏差、方差、与噪声之和
$$E(f;D) = bias^2(x) + var(x) + \varepsilon^2$$
3. 线性模型
3.1 基本形式
给定有d个属性描述的实例 $\vec{x} = (x1;x2;...;xd) $, 其中 $xi $是x在第i个属性上的取值, 线性模型(linear model)试图学得一个通过属性的线性组合来进行预测的函数, 即
$$ f(\vec{x}) = w1x1 + w2x2 + ... + wdxd + b， $$
一般用向量形式写成
$$ f(\vec{x}) = \vec{w}^T\vec{x} + b$$
3.2 线性回归
给定数据集 $D = {(x1, y1), (x2, y2),...,(xm,ym)} $, 其中 $xi = (x{i1};x{i2};...;x{id}), y_i \in R $
线性回归试图学得
$$f(xi) = wxi + b, 使得f(xi)\simeq yi$$
均方误差最小化:
$$(w^, b^) = \arg min\sum\limits^m{i=1}(f(xi) - y_i)^2$$
$$ = \arg min \sum\limits^m{i=1}(yi - wx_i - b)^2$$
 $w^, b^ $表示 $w $和 $b $的解
均方误差有非常好的几何意义, 它对应了常用的欧几里得距离或简称"欧式距离"(Euclidean distance). 基于均方误差最小化来进行模型求解的方法成为"最小二乘法"(least square method).
求解 $w $和 $b $使 $E{(w,b)} = \sum^m{i=1}(yi - wxi- b)^2 $最小化的过程, 称为线性回归模型的最小二乘"参数估计"(parameter estimation). 我们可将 $E_{(w,b)} $分别对 $w $和 $b $求导, 得到
$$\frac{\partial E{w,b}}{\partial Ew} = 2 (w\sum\limits^m{i=1}xi^2 - \sum\limits^m{i = 1}(yi - b)x_i)$$
$$\frac{\partial E{w,b}}{\partial Ew} = 2 (mb - \sum\limits^m{i = 1}(yi - wx_i))$$
然后令上两式为0, 可得到$w$和$b$最优解的闭式(closed-form)解
$$w = \frac{\sum\limits^m{i=1}yi(xi - \bar{x})}{\sum\limits^m{i = 1}x^2i - \frac{1}{m}(\sum\limits^m{i = 1}x_i)^2}$$
$$b = \frac{1}{m}\sum\limits^m{i = 1}(yi - wx_i)$$
其中 $\bar{x} = \frac{1}{m}\sum\limits^m{i = 1}xi $为 $x $的均值
更一般的形式为:
$$f(\vec{x}i) = \vec{w}T\vec{x}i + b, 使得f(\vec{x}i)\simeq y_i$$
这称为"多元线性回归"(multivariate linear regression)
可利用最小二乘法来对 $\vec{w} $和 $b $进行估计. 为便于讨论, 我们把 $\vec{w} $和 $b $吸收入向量形式 $\hat{\vec{w}}=(\vec{w};b) $, 相应的, 把数据集D表示为一个 $m\times(d+1) $大小的矩阵 $\vec{X} $,其中每行对应于一个示例, 该行前 $d $个元素及对应于实例的 $d $个属性值, 最后一个元素恒置为1, 即
$$ \vec{X} = \begin{pmatrix} x{11} & x{12} & ... & x{1d} & 1 \ x{21} & x{22} & ... & x{2d} & 1 \ ... & ... & ... & ... & ... \ x{m1} & x{m2} & ... & x{md} & 1  \end{pmatrix} =  \begin{pmatrix} \vec{x}^T1 & 1 \ \vec{x}^T2 & 1 \ ... & ... \ \vec{x}^Tm & 1  \end{pmatrix}$$
再把标记也写成向量形式 $\vec{y} = (y1;y2;...;y_m) $, 有
$$\hat{\vec{w}}^* = \arg\limits_{\hat{\vec{w}}}\min(\vec{y} -\vec{X}\hat{\vec{w}})^T(\vec{y} - \vec{X}\hat{\vec{w}})$$
令 $E_{\hat{\vec{w}}} = (\vec{y} -\vec{X}\hat{\vec{w}})^T(\vec{y} - \vec{X}\hat{\vec{w}}) $, 对 $\hat{\vec{w}} $求导得到
$$\frac{\partial E_{\hat{\vec{w}}}}{\partial\hat{\vec{w}}} = 2 \vec{X}^T(\vec{X}\hat{\vec{w}} - \vec{y})$$
当 $X^TX $为满秩矩阵(full-rank matrix)或正定矩阵(positive definite matrix)时, 上式为零可得
$$\hat{\vec{w}}^* = (\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y}$$
其中 $(\vec{X}^T\vec{X})^{-1} $是矩阵 $(\vec{X}^T\vec{X}) $的逆矩阵. 令 $\hat{\vec{x}}i = (\vec{x}i,1) $,则最终学得的多元线性回归模型为
$$f(\hat{x}i) = \hat{x}^Ti(X^TX)^{-1}X^Ty$$
假设我们认为示例所对应的输出标记是在指数尺度上变化, 那就可将输出标记的对数作为线性模型逼近的目标, 即
$$\ln y = \vec{w}^T\vec{x} + b$$
这就是"对数线性回归"(log-linear regression), 它实际上是在试图让 $e^{\vec{w}^T\vec{x}+b} $逼近 $y $.
其作用是把一个非线性关系通过对数形式转换成线性关系
更一般的，考虑单调可微函数 $g(·) $, 令
$$
  y = g^{-1}(\vec{w}^T\vec{x}+b)
$$
这样得到的模型成为"广义线性模型"(generalized linear model)，其中函数 $g(·) $称为"联系函数"(link function).通过联系函数把非线性关系转换成线性关系， $\ln(·) $就是其中一种
3.3 对数几率回归(逻辑回归)
线性回归产生的预测值:
$$
Z = \vec{w}^T\vec{x}+b
$$
对数几率函数(logistic function):
$$
y = \frac{1}{1 + e^{-z}}\

\ln\frac{y}{1-y} = \vec{w}^T\vec{x}+b
$$
对数几率函数(逻辑回归函数)通过函数把需要分类的值(这里指二元分类)压缩成一个连续的阶跃函数
几率(odds):
$$
\frac{y}{1-y}
$$
对数几率(logit):
$$
\ln\frac{y}{1-y}
$$
通过以上式子可得:
$$
p(y = 1|x) = \frac{e^{\vec{w}^T\vec{x}+b}}{1+e^{\vec{w}^T\vec{x}+b}}  \

p(y = 0|x) = \frac{1}{1+e^{\vec{w}^T\vec{x}+b}}
$$
可通过"极大似然法"(maximum likelihood method)来估计 $w $和 $b $. 给定数据集 ${(\vec{x}i,yi)}^m_{i = 1} $, 对率回归模型最大化"对数似然"(log-likelihood)
$$
\ell(\vec{w}, b) = \sum\limits^m_{i=1}\ln p(y_i|\vec{x}_il;\vec{w},b)\
$$
即令每个样本属于其真实标记的概率越大越好.
为便于讨论
令 $\beta = (\vec{w};b) $,  $\hat{\vec{x}} = (\vec{x};1) $
即
$$
   p(yi|\vec{x}i;\vec{w}, b) = yip1(\hat{\vec{x}}i; \beta) + (1 - yi) p0 (\hat{\vec{x}}i; \beta)
$$
结合上式，并最大化可获得:
$$
\ell(\beta) = \sum^m_{i=1}(-y_i\beta^T\hat{\vec{x}}_i+ \ln(1+e^{\beta^T\hat{\vec{x}}_i}))
$$
根据凸优化理论，利用优化算法可得:
$$
\beta^* = \argmin\limits_\beta l(\beta)
$$
以牛顿法为例:
$$
\beta' = \beta - (\frac{\partial^2\ell(\beta)}{\partial\beta\partial\beta^T})^{-1}\frac{\partial \ell(\beta)}{\partial\beta}\
\frac{\partial\ell(\beta)}{\partial\beta} = - \sum^m_{i = 1} \hat{\vec{x}_i}(y_i - p_1(\hat{\vec{x}_i; \beta}))\
\frac{\partial^2\ell(\beta)}{\partial\beta\partial\beta^T} = \sum^m_{i=1}\hat{\vec{x}}_i\hat{\vec{x}}_i^Tp_1(\hat{\vec{x}}_i; \beta)(1 - p_1(\hat{\vec{x}}_i; \beta))
$$
3.4 线性判别分析
欲使同样样例的投影点尽可能接近，可以让同类样例投影点的协方差尽可能小，即
 $\vec{w}^T\sum0\vec{w} + \vec{w}^T\sum1\vec{w} $尽可能小
欲使异类样例的投影点尽可能远离，可以让类中心之间的距离尽可能大，即
 $||\vec{w}^T\mu0 - \vec{w}^T\mu1||^2_2 $尽可能小
$$
\begin{aligned}\text{J}&=\frac{|\boldsymbol{w}^\mathrm{T}\boldsymbol{\mu}_0-\boldsymbol{w}^\mathrm{T}\boldsymbol{\mu}_1|_2^2}{\boldsymbol{w}^\mathrm{T}\boldsymbol{\Sigma}_0\boldsymbol{w}+\boldsymbol{w}^\mathrm{T}\boldsymbol{\Sigma}_1\boldsymbol{w}}\&=\frac{\boldsymbol{w}^\mathrm{T}(\boldsymbol{\mu}_0-\boldsymbol{\mu}_1)(\boldsymbol{\mu}_0-\boldsymbol{\mu}_1)^\mathrm{T}\boldsymbol{w}}{\boldsymbol{w}^\mathrm{T}(\boldsymbol{\Sigma}_0+\boldsymbol{\Sigma}_1)\boldsymbol{w}} .\end{aligned}
$$
定义“类内散度矩阵”(within-class scatter matrix)
$$
\begin{aligned}
\mathbf{s}{w}& =\boldsymbol{\Sigma}0+\boldsymbol{\Sigma}1 \
&=\sum{x\in X{0}}\left(\boldsymbol{x}-\boldsymbol{\mu}{0}\right)\left(\boldsymbol{x}-\boldsymbol{\mu}{0}\right)^{\mathrm{T}}+\sum{x\in X{1}}\left(\boldsymbol{x}-\boldsymbol{\mu}{1}\right)\left(\boldsymbol{x}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}}
\end{aligned}
$$

参考书籍: 《机器学习》- 周志华著
看完了？说点什么呢


崩坏3rd - 第一部剧情全流程
Ver — Fri, 19 Dec 2025 07:02:47 GMT
该渲染由 Shiro API 生成，可能存在排版问题，最佳体验请前往：https://blog.verxie.org/posts/entertainment/2025-12-19-Honkai3rd
欢迎进入这个美好的世界
崩坏三全剧情-不含战斗
 1到4章正常打
  编年史《永世回忆篇》；
 5章结束
  漫画《第二次崩坏》
 6章结束
  编年史《轩辕篇》、编年史《蚩尤篇》
 7章结束
  漫画《逃离长空篇》
  《女武神的餐桌》第一季
 8、9章连打，9间章
 10章结束，
  漫画《双子：起源》
  漫画《绀海篇》
 11、11间章结束
  漫画《神之键秘话》的其中一话：地藏御魂
 12章结束
 13、14章结束
  【《崩坏3》动画短片「天穹流星」】
  琪亚娜装甲「领域装·白练」的队长技能名称
  《starglow》（记得看看歌词）
  漫画《蛇之篇》
 15、16、17章连着
  《雷鸣的剧场——女孩内心的碎梦在此重演》
  《四周年——于未尽的硝烟之中，希望永续》
 18、19结束后，
  漫画《神之键秘话》的其中一话：原初之翼
  编年史《遗忘之人》
  漫画《年岁》
  漫画《云墨剑心》
  视觉小说《神州折剑录》
  漫画《月影篇》
 20、21、22章结束后，
  4.6版本宣传PV
  视觉小说《逆熵》
  《女武神的餐桌》第二季
 23章结束后
  希儿的一日冒险——今天也猜不透另一个我在想什么
 24章结束后
  有时间的话建议24、25连着做完，25章大概4小时的样子。如果不能一次过完，那建议把24章第三部分留着和25章一起做，这是怕你当天晚上睡不着觉
 25章结束后，
  5.0版本宣传PV
  《星火的燃起——琪亚娜的成长之路》
  《「流星的旅途」——「薪炎***」**幕后花絮》
---
分割
  视觉小说《幽兰黛尔》
  4.4版本宣传PV
  漫画《雾都假日》
  漫画《传承》
  漫画《异乡》
  漫画《紫鸢篇》
---
 25间章
 26、27
  「初心永恒」——***********「天**英」宣传PV
 28结束后
  《时光作证》
接着不要打主线，去打往世乐土，
  往世乐土
 接主线29、30、31
  动画《黄金庭院》，没找到中配 待定
 接着打主线32～35章
---
至此第一部完结（一次性打完体验更佳）
如果还有兴致，可以去看看之前的活动剧情，比如崩坏国记，星与你消失之日，之类的。
---
可以观看的二创
参考视频: 【崩3剧情观看流程推荐】
看完了？说点什么呢

数据集	算法A	算法B	算法C
$D_1 $	1	2	3
$D_2 $	1	2.5	2.5
$D_3 $	1	2	3
$D_4 $	1	2	3
平均序值	1	2.125	2.875