启动数据自愈系统BUG分析5

SHI XIAOLONG

15 Feb 2026 — 10 min read

数据自愈系统严重 BUG — 第四轮因果链分析

分析日期: 2026-02-15
涉及文件: orchestrator.py, repair_executor.py, test_basic.py
背景: 第三轮重写修复了前两轮的 7 大缺陷后，当前代码仍存在 4 个 BUG

概览

第三轮重写引入了 Diagnosis 数据类、make_interval SQL 修复和新鲜度检查，成功修复了之前的致命问题。但代码审查发现，当前版本仍存在 4 个 BUG，其中 BUG #1 和 #2 在特定场景下会联合导致自愈系统静默失败（无报错但不修复任何数据）。

BUG #1（严重 🔴）：`_diagnose()` 对「不连续 + 不新鲜 + 不足」三重异常生成空修复目标

因果链

阶段	详情
输入	系统停机超过 24 天后重启，`heal_and_prepare(required_count=144)` 开始执行
状态变化	`_load_zscore_history(144)` 加载到 10 条记录（48h 窗口内只能找到 ≤12 条 4h 数据）
调用路径	`_diagnose(records=[10条], required_count=144)`
出错点	`orchestrator.py:251-263`: 三项检查的条件全部互斥，导致所有修复目标列表为空
根因	`stale_targets` 要求 `is_continuous AND len >= 144`，`shortfall_targets` 要求 `is_continuous AND not gap_times`，但实际数据既不连续、又不足、又过时

具体推演

# orchestrator.py 第 250-263 行
# 1. 连续性检查
is_continuous, gap_times, completeness = \
    self.checker.check_continuity(records, required_count)
# 结果: records 只有 10 条，可能是连续的（10 条之间间隔正确）
# is_continuous = True, gap_times = [], completeness = 6.9%

# 2. 新鲜度检查
is_fresh, staleness_min = self._check_freshness(records)
# 结果: 最新记录是 24 天前 → staleness = 34560min
# is_fresh = False

stale_targets = []
if not is_fresh and is_continuous and len(records) >= required_count:
    # ❌ 10 < 144 → 条件不满足 → stale_targets 保持空
    stale_targets = self._generate_stale_targets(records)

# 3. 数量不足检查
shortfall_targets = []
if is_continuous and not gap_times and len(records) < required_count:
    # ✅ True and True and True → 条件满足!
    shortfall_targets = self._generate_shortfall_targets(records, 144)
    # 生成 134 个向更早时间扩展的目标

等一下——在上面的推演中，条件 is_continuous and not gap_times and len(records) < required_count 实际上能满足。让我重新分析一个更危险的场景：

场景：10 条记录，其中有 1 个时间断层
  is_continuous = False
  gap_times = [3 个缺失时间点]
  len(records) = 10 < 144
  is_fresh = False

检查 2 (stale_targets):
  条件: not is_fresh AND is_continuous AND len >= 144
  结果: False AND ... → 不满足 → stale_targets = []

检查 3 (shortfall_targets):
  条件: is_continuous AND not gap_times AND len < 144
  结果: False AND ... → 不满足 → shortfall_targets = []

最终 Diagnosis:
  gap_targets = [3 个时间点]    ← 只有这 3 个
  stale_targets = []
  shortfall_targets = []

后果链

_merge_repair_targets(diagnosis)
  → all_targets = gap_targets（仅 3 个缺失时间点）
  → executor.repair()：只修复 3 个缺口
  → 重新加载：13 条记录（10 + 3）
  → 第 2 轮诊断：is_continuous = True, len = 13 < 144
  → shortfall_targets 条件满足，生成 131 个向前扩展目标
  → executor.repair()：尝试修复 131 个目标
  → 由于 K 线窗口需要 130 条，131 个目标中大部分需要更早的 K 线
  → 可能部分成功

→ 实际上跑 3 轮循环后可以修复到一定程度（不致命但低效）

但在以下边界场景中会触发 BUG：

场景：records = [] (数据库完全没有该 symbol 的记录)
  → _diagnose 走 line 238 分支: records 为空
  → 返回 shortfall_targets = _generate_full_timeline(144)
  → all_targets = 144 个目标
  → executor.repair(144 个目标)
  → _find_kline_gaps 尝试查找 K 线...

  但：_generate_full_timeline(144) 生成的时间点从 NOW() 向前推算
  → 包含 NOW()-576h 之前的时间点
  → 交易所 API 可能不提供这么久远的 4h K 线
  → fill_missing_data_precise 超出 API 数据范围
  → 大量 K 线补充失败
  → _repair_from_klines 中窗口不足 → zscore 计算失败
  → repaired_count = 0
  → "无进展" → break
  → 最终 status = 'failed'

⚠️ WARNING: 此 BUG 在 records=[] 场景下影响较大：当数据库完全没有数据时，_generate_full_timeline 生成的 144 个时间点跨度为 576 小时（24 天），但大多数交易所 API 只提供最近几天到几周的 4h K 线。前面大量时间点会因 API 数据不可用而修复失败。

修复方向

# _generate_full_timeline 应该只生成交易所 API 可用范围内的时间点
# 或者 repair_executor 应该从最新的时间点开始修复（而非从最早的开始），
# 这样至少能修复近期数据

def _generate_full_timeline(self, required_count: int) -> List[datetime]:
    """无历史数据时，从当前时间向前推算——但限制在 API 可用范围内"""
    interval = timedelta(minutes=self.interval_minutes)
    now = self._align_time(self._get_db_now())

    # 限制最大回溯范围（交易所 API 通常只提供最近 N 天的数据）
    max_lookback_hours = 168  # 7 天，根据交易所实际限制调整
    max_count = min(required_count, int(max_lookback_hours * 60 / self.interval_minutes))

    times = [now - i * interval for i in range(max_count)]
    times.sort()
    return times

BUG #2（严重 🟠）：`_build_analysis_record` 无条件写入 `zscore_4h` — 非 4h 修复时数据污染

因果链

阶段	详情
输入	`repair_executor.repair()` 用 `timeframe='4h'` 计算出 zscore 值
状态变化	`_build_analysis_record()` 构建 DB 记录
调用路径	`repair_executor.py:166` → `_build_analysis_record(missing_time, symbol, base, timeframe, zscore, corr)`
出错点	`repair_executor.py:320`: `'zscore_4h': zscore` — 无条件写入，不检查 `timeframe`
根因	`zscore_5m` 和 `zscore_1h` 按 timeframe 条件写入，但 `zscore_4h` 被遗漏了条件检查

具体推演

# repair_executor.py 第 313-329 行
def _build_analysis_record(..., timeframe, zscore, ...):
    return {
        'zscore_5m': zscore if timeframe == '5m' else None,   # ← 有条件
        'zscore_1h': zscore if timeframe == '1h' else None,   # ← 有条件
        'zscore_4h': zscore,                                   # ❌ 无条件！
        'corr_5m_7d': corr if timeframe == '5m' else None,    # ← 有条件
        'corr_1h_30d': corr if timeframe == '1h' else None,   # ← 有条件
        'corr_4h_60d': corr if timeframe == '4h' else None,   # ← 有条件
    }

当前风险：当前 _run_data_healing() 硬编码 repair_timeframe='4h'，所以目前不会触发。但如果未来扩展为多周期修复（如 repair_timeframe='5m'），则：

timeframe = '5m' 时:
  'zscore_5m': zscore ✅
  'zscore_1h': None   ✅
  'zscore_4h': zscore  ❌ 污染！用 5m 周期的 zscore 覆盖了 4h 字段

⚠️ IMPORTANT: 虽然当前不会触发，但这是一个定时炸弹。代码明显存在不一致性——三个 zscore 字段中两个有条件判断，一个没有。

修复方案

'zscore_4h': zscore if timeframe == '4h' else None,  # ← 与其他两行一致

BUG #3（严重 🟠）：`_build_analysis_record` 硬编码 `cointegration_passed=True`

因果链

阶段	详情
输入	修复器计算出 zscore 后调用 `_build_analysis_record()`
状态变化	写入 `analysis_results` 表
出错点	`repair_executor.py:328`: `'cointegration_passed': True`
根因	修复器没有实际执行协整检验，直接假设通过

具体推演

# repair_executor.py 第 328 行
'cointegration_passed': True,  # ❌ 硬编码为 True！

实时分析路径 (realtime_kline_service_base.py:1203) 的写法：

'cointegration_passed': multi_period_result['cointegration_count'] >= COINTEGRATION_THRESHOLD,

实时路径通过 analyze_multi_period() 做了真正的协整检验，而修复路径直接假设 True。

后果

修复器写入的记录 cointegration_passed = True
策略引擎查询 analysis_results 时可能依赖 cointegration_passed 字段
对于实际上协整不通过的交易对，策略可能基于虚假的协整通过记录做出错误交易决策
尤其危险：修复器批量写入 144 条记录，全部标记为 cointegration_passed=True，这会严重影响历史数据的可信度

⚠️ CAUTION: 如果策略引擎使用 cointegration_passed 字段做交易决策过滤，此 BUG 可能导致在不满足协整条件的交易对上开仓。

修复方案

# 方案 A: 保守——标记为 None（未知）
'cointegration_passed': None,  # 修复路径未做协整检验

# 方案 B: 实际计算
# 在 _repair_from_klines 中调用协整检验（需要额外依赖）

BUG #4（低 🟢）：`test_basic.py:291` — `AssertionError` 拼写错误永远无法捕获断言失败

因果链

阶段	详情
输入	运行 `python test_basic.py`
状态变化	某个 `assert` 语句失败，抛出 `AssertionError`
调用路径	`test_basic.py:274 main()` → `test_xxx()` → `assert ... ❌`
出错点	`test_basic.py:291`: `except AssertionError as e:`
根因	`AssertionError` 是拼写错误，正确拼写是 `AssertionError`

具体推演

# test_basic.py 第 280-301 行
try:
    test_continuity_checker()
    test_quality_assessor()
    ...
except AssertionError as e:      # ❌ 拼写错误：少了一个 "s"
    print(f"\nTEST FAILED: {e}")  # 正确拼写: AssertionError
    ...
    return 1
except Exception as e:           # ← 断言错误会被这里捕获
    print(f"\nTEST ERROR: {e}")   # 显示为 "TEST ERROR" 而非 "TEST FAILED"
    ...
    return 1

注意：Python 中 AssertionError 不是内置异常名，这行代码在定义时应该报 NameError。但由于 except 子句中的异常名只在运行到该分支时才会被求值…

等等——实际上 Python 的 except 语句中的异常名是在 try 块 抛出异常后才会被求值。如果没有任何测试失败，AssertionError 这个名字永远不会被查找。

但如果测试真的失败了：

1. assert 失败 → 抛出 AssertionError
2. except AssertionError → Python 查找 "AssertionError" → NameError!
3. NameError 没有被捕获 → 但实际上...
   NameError 是 Exception 的子类 → 被 except Exception 捕获? 不，
   NameError 是在 except 子句匹配过程中抛出的，这会导致未处理异常

实际行为（Python 3）：except AssertionError 触发 NameError，这个 NameError 不会被后续的 except Exception 捕获（它们是同级的 except 子句），而是直接向上传播，变成一个莫名其妙的 NameError: name 'AssertionError' is not defined。

后果

当测试全部通过时：无影响
当任何 assert 失败时：不显示 "TEST FAILED"，而是显示难以理解的 NameError
测试的错误报告功能完全失效

修复方案

except AssertionError as e:  # ← 正确拼写

BUG 联合效应

graph TD
    A["系统长时间停机后重启<br>repair_timeframe='4h'<br>required_count=144"] --> B["_load_zscore_history(144)"]
    B --> C{"数据库有数据？"}
    C -->|"记录为空"| D["_diagnose: records=[]<br>_generate_full_timeline(144)"]
    D --> E["生成 144 个目标<br>跨度 576h = 24天"]
    E --> F["executor.repair(144 个目标)"]
    F --> G{"交易所 API<br>有 24 天 4h K 线？"}
    G -->|"通常没有"| H["BUG #1<br>大量 K 线补充失败"]
    H --> I["repaired_count ≈ 0<br>'无进展' → break"]
    I --> J["status = 'failed'"]

    C -->|"有少量记录<br>且不连续"| K["_diagnose: gap_targets 仅有几个"]
    K --> L["修复几个缺口"]
    L --> M["重新加载"]
    M --> N{"满足 144 条？"}
    N -->|"未满足"| O["shortfall_targets<br>向更早时间扩展"]
    O --> P["修复时写入 DB"]
    P --> Q["BUG #3: 全部标记<br>cointegration_passed=True"]
    Q --> R["策略引擎<br>信任虚假协整结果"]

    P --> S["BUG #2: zscore_4h<br>无条件写入（当前安全）"]

    style H fill:#ff6b6b,color:#fff
    style J fill:#ff6b6b,color:#fff
    style Q fill:#ffa07a,color:#fff
    style S fill:#ffe4b5,color:#000

优先级排序

优先级	BUG	严重性	场景	修复复杂度
P0	BUG #1: `_generate_full_timeline` 无范围限制	严重 🔴	新 symbol 首次接入 / 数据库清空	低
P1	BUG #3: `cointegration_passed` 硬编码 `True`	严重 🟠	每次修复都会触发	低
P2	BUG #2: `zscore_4h` 无条件写入	严重 🟠	当前不触发，未来扩展时触发	极低（改一行）
P3	BUG #4: `AssertionError` 拼写错误	低 🟢	测试失败时	极低（改一个字母）

与前三轮 BUG 的关系

轮次	发现	状态
第一轮 (BUG1.md)	`_find_kline_gaps` 概念混淆	✅ 第三轮已修复 (`_generate_complete_timeline`)
第二轮 (BUG2.md)	间隔硬编码 / 无新鲜度检查 / 时间窗口不匹配	✅ 第三轮已修复 (`Diagnosis` + `make_interval` + `_check_freshness`)
第三轮 (BUG3.md)	SQL参数化 / 变量冲突 / 窗口边界 / 测试不兼容	✅ 第三轮已修复
第四轮 (本轮)	timeline 无限制 / zscore_4h 无条件 / cointegration 硬编码 / 拼写错误	❌ 待修复

启动数据自愈系统BUG分析5

SHI XIAOLONG

数据自愈系统严重 BUG — 第四轮因果链分析

概览

BUG #1（严重 🔴）：`_diagnose()` 对「不连续 + 不新鲜 + 不足」三重异常生成空修复目标

因果链

具体推演

后果链

修复方向

BUG #2（严重 🟠）：`_build_analysis_record` 无条件写入 `zscore_4h` — 非 4h 修复时数据污染

因果链

具体推演

修复方案

BUG #3（严重 🟠）：`_build_analysis_record` 硬编码 `cointegration_passed=True`

因果链

具体推演

后果

修复方案

BUG #4（低 🟢）：`test_basic.py:291` — `AssertionError` 拼写错误永远无法捕获断言失败

因果链

具体推演

后果

修复方案

BUG 联合效应

优先级排序

与前三轮 BUG 的关系

Read more

跑步的技巧（滚动落地）

AMI的优越性

什么是：“世界模型（World Models）”

K线周期可配置化设计方案

数据自愈系统严重 BUG — 第四轮因果链分析

概览

BUG #1（严重 🔴）：_diagnose() 对「不连续 + 不新鲜 + 不足」三重异常生成空修复目标

因果链

具体推演

后果链

修复方向

BUG #2（严重 🟠）：_build_analysis_record 无条件写入 zscore_4h — 非 4h 修复时数据污染

因果链

具体推演

修复方案

BUG #3（严重 🟠）：_build_analysis_record 硬编码 cointegration_passed=True

因果链

具体推演

后果

修复方案

BUG #4（低 🟢）：test_basic.py:291 — AssertionError 拼写错误永远无法捕获断言失败

因果链

具体推演

后果

修复方案

BUG 联合效应

优先级排序

与前三轮 BUG 的关系

Read more

跑步的技巧（滚动落地）

AMI的优越性

什么是：“世界模型（World Models）”

K线周期可配置化设计方案

BUG #1（严重 🔴）：`_diagnose()` 对「不连续 + 不新鲜 + 不足」三重异常生成空修复目标

BUG #2（严重 🟠）：`_build_analysis_record` 无条件写入 `zscore_4h` — 非 4h 修复时数据污染

BUG #3（严重 🟠）：`_build_analysis_record` 硬编码 `cointegration_passed=True`

BUG #4（低 🟢）：`test_basic.py:291` — `AssertionError` 拼写错误永远无法捕获断言失败