# Binance P2P Data Collection — Detailed Implementation Spec > **Purpose:** This document is the single source of truth for the data collection phase. Every field, every endpoint, every edge case is specified so a coder can implement without ambiguity. > > **Status:** Phase 1 — Data Collection only. No ML. No trading. No algorithm decisions yet. --- ## 1. The Core Loop (Exact Pseudocode) ``` while True: try: buy_snap = fetch_all_ads(tradeType="BUY", asset="USDT", fiat="VES") sell_snap = fetch_all_ads(tradeType="SELL", asset="USDT", fiat="VES") flat_buy = [normalize_ad(ad, "BUY", now_utc) for ad in buy_snap] flat_sell = [normalize_ad(ad, "SELL", now_utc) for ad in sell_snap] validate_snapshot(flat_buy + flat_sell) store_parquet(flat_buy, base_path / "raw" / "buy_ads" / date_partition) store_parquet(flat_sell, base_path / "raw" / "sell_ads" / date_partition) log_success(len(flat_buy), len(flat_sell), elapsed) except Exception as e: log_error(e, consecutive_failures) consecutive_failures += 1 if consecutive_failures >= 5: write_alert_file() # human needs to check sleep(jitter(interval_seconds)) # default 300s ± 10% ``` --- ## 2. API Client — Exact Implementation ### 2.1 Endpoint ``` POST https://p2p.binance.com/bapi/c2c/v2/friendly/c2c/adv/search ``` **No API key.** This is fully public. ### 2.2 Headers | Header | Value | |---|---| | `Content-Type` | `application/json` | | `User-Agent` | `Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36` | | `Accept` | `*/*` | | `Origin` | `https://p2p.binance.com` | | `Referer` | `https://p2p.binance.com/` | ### 2.3 Request Body (BUY example) ```json { "asset": "USDT", "fiat": "VES", "tradeType": "BUY", "page": 1, "rows": 20, "payTypes": [], "countries": [], "publisherType": null, "classify": "personal", "filter": {} } ``` **Key notes for the coder:** - `tradeType: "BUY"` = advertiser wants to **give you VES** in exchange for your USDT. They are *buying* USDT from you. - `tradeType: "SELL"` = advertiser wants to **give you USDT** in exchange for your VES. They are *selling* USDT to you. - `payTypes: []` = no filter, return all payment methods - `rows: 20` = Binance's max per page (do not change) - `publisherType: null` = both merchants and regular users - `classify: "personal"` = personal ads (not business) — covers the P2P marketplace ### 2.4 Pagination Logic ```python def fetch_all_ads(trade_type, asset, fiat, max_pages=10): all_ads = [] for page in range(1, max_pages + 1): body = { "asset": asset, "fiat": fiat, "tradeType": trade_type, "page": page, "rows": 20, "payTypes": [], "countries": [], "publisherType": None, "classify": "personal", "filter": {} } resp = httpx.post(URL, json=body, headers=HEADERS, timeout=15) resp.raise_for_status() data = resp.json() if not data.get("success"): raise APIError(f"API returned success=false: {data}") ads = data.get("data", []) total = data.get("total", 0) all_ads.extend(ads) # Stop if we've collected all available ads if len(all_ads) >= total: break # Don't request a page that starts beyond total ads if page * 20 >= total: break if page < max_pages: time.sleep(0.5) # 500ms between pages return all_ads ``` ### 2.5 Rate Limiting — Defensive Strategy | Event | Wait time | Notes | |---|---|---| | Between pages (same snapshot) | 500 ms | Fixed | | Between snapshots (BUY → SELL) | 1 second | Fixed | | Between full cycles | 300 s ± 30s | Jittered to avoid clock sync | | HTTP 429 (rate limited) | 60s → 120s → 240s → 480s | Exponential backoff, cap at 480s | | Connection error | 30s retry | Transient network issues | | 5xx server error | 60s retry | Binance server-side issues | **Important:** After a 429, reset the backoff after one successful full snapshot. ### 2.6 Proxy Support (Optional — keep simple first) Start with **no proxy**, direct from VPS. Only add proxy rotation if we hit rate limits. Binance rarely rate-limits P2P at 1 request/5min. --- ## 3. Normalization — Exact Field Mapping ### 3.1 The Flattened Schema (one row = one ad) | # | Output field | Type | JSON path | Notes | |---|---|---|---|---| | 1 | `snapshot_id` | string | auto: `{fetch_ts_iso}_{trade_type}` | e.g. `"20260605T133000Z_BUY"` | | 2 | `fetched_at` | datetime | auto: now_utc | Always UTC | | 3 | `fetched_date` | string | auto: YYYY-MM-DD | Partition column | | 4 | `trade_type` | string | `adv.tradeType` | "BUY" or "SELL" | | 5 | `adv_no` | string | `adv.advNo` | Unique ad ID | | 6 | `asset` | string | `adv.asset` | "USDT" | | 7 | `fiat` | string | `adv.fiatUnit` | "VES" | | 8 | `price` | float | `adv.price` | Parse as float | | 9 | `surplus_amount` | float | `adv.surplusAmount` | Remaining USDT | | 10 | `min_amount` | float | `adv.minSingleTransAmount` | Min USDT per trade | | 11 | `max_amount` | float | `adv.maxSingleTransAmount` | Max USDT per trade | | 12 | `tradable_quantity` | float | `adv.tradableQuantity` | Same as surplus? | | 13 | `advertiser_no` | string | `advertiser.userNo` | **Stable ID** — use this | | 14 | `advertiser_name` | string | `advertiser.nickName` | For reference only | | 15 | `advertiser_type` | string | `advertiser.userType` | "merchant" or "user" | | 16 | `month_order_count` | int | `advertiser.monthOrderCount` | | | 17 | `month_finish_rate` | float | `advertiser.monthFinishRate` | 0.0 to 1.0 | | 18 | `positive_rate` | float | `advertiser.positiveRate` | 0.0 to 1.0 | | 19 | `user_positive_rate` | float | `advertiser.userPositiveRate` | older field, same idea | | 20 | `payment_methods` | list[str] | `adv.tradeMethods[].payType` | e.g. `["BANESCO", "PAGO_MOVIL"]` | | 21 | `payment_method_ids` | list[str] | `adv.tradeMethods[].identifier` | e.g. `["Banco_Banesco", "Pago_Movil"]` | | 22 | `ad_created_at` | datetime | `adv.createTime` | Unix millisecond → datetime | | 23 | `price_type` | string | `adv.priceType` | Usually "FIXED" | ### 3.2 JSON Path Details (nested structure) The API response has this structure: ```json { "data": [ { "adv": { "advNo": "6f8b2e...", "tradeType": "BUY", "asset": "USDT", "fiatUnit": "VES", "price": "58.50", "surplusAmount": "1520.43", "maxSingleTransAmount": "5000.00", "minSingleTransAmount": "100.00", "tradableQuantity": "1520.43", "createTime": 1749128400000, "fiatSymbol": "Bs", "priceType": "FIXED", "tradeMethods": [ { "identifier": "Banco_Banesco", "payType": "BANESCO", "payMethodId": "BANESCO" }, { "identifier": "Pago_Movil", "payType": "PAGO_MOVIL", "payMethodId": "PAGO_MOVIL" } ] }, "advertiser": { "userNo": "ABC123", "nickName": "CryptoTraderVE", "userType": "merchant", "monthOrderCount": 342, "monthFinishRate": 0.97, "positiveRate": 0.99, "userPositiveRate": 0.99 } } ], "total": 156, "pageSize": 20, "success": true } ``` ### 3.3 Normalization Code Sketch ```python def normalize_ad(raw_ad: dict, trade_type: str, fetched_at: datetime) -> dict: adv = raw_ad["adv"] adver = raw_ad["advertiser"] payment_methods = [m["payType"] for m in adv.get("tradeMethods", [])] payment_method_ids = [m["identifier"] for m in adv.get("tradeMethods", [])] return { "snapshot_id": f"{fetched_at.strftime('%Y%m%dT%H%M%SZ')}_{trade_type}", "fetched_at": fetched_at, "fetched_date": fetched_at.strftime("%Y-%m-%d"), "trade_type": trade_type, "adv_no": adv["advNo"], "asset": adv["asset"], "fiat": adv["fiatUnit"], "price": float(adv["price"]), "surplus_amount": float(adv.get("surplusAmount", 0)), "min_amount": float(adv.get("minSingleTransAmount", 0)), "max_amount": float(adv.get("maxSingleTransAmount", 0)), "tradable_quantity": float(adv.get("tradableQuantity", 0)), "advertiser_no": adver["userNo"], "advertiser_name": adver["nickName"], "advertiser_type": adver.get("userType", "user"), "month_order_count": adver.get("monthOrderCount", 0), "month_finish_rate": float(adver.get("monthFinishRate", 0)), "positive_rate": float(adver.get("positiveRate", 0)), "user_positive_rate": float(adver.get("userPositiveRate", 0)), "payment_methods": payment_methods, # e.g. ["BANESCO", "PAGO_MOVIL"] "payment_method_ids": payment_method_ids, # e.g. ["Banco_Banesco", "Pago_Movil"] "ad_created_at": datetime.fromtimestamp( adv["createTime"] / 1000, tz=timezone.utc ), "price_type": adv.get("priceType", "FIXED"), } ``` --- ## 4. Payment Methods — The Critical Column ### 4.1 Known Payment Method Identifiers for Venezuela | `payType` value | `identifier` value | Common name | |---|---|---| | `BANESCO` | `Banco_Banesco` | Banesco bank transfer | | `MERCANTIL` | `Banco_Mercantil` | Mercantil bank transfer | | `PROVINCIAL` | `Banco_Provincial` | Banco Provincial (BBVA) | | `VENEZUELA` | `Banco_De_Venezuela` | Banco de Venezuela (BDV) | | `BANCO_NACIONAL_CREDITO` | `Banco_Nacional_De_Credito` | BNC | | `SOFITASA` | `Sofitasa` | Sofitasa | | `BANCAMIGA` | `Bancamiga` | Bancamiga | | `BANCO_EXTERIOR` | `Banco_Exterior` | Banco Exterior | | `BANCO_OCCIDENTE` | `Banco_Occidente` | Banco Occidental de Descuento (BOD) | | `BANCO_PLATA` | `Banco_Plata` | Banco Plaza | | `BANESCO_PERSONAL` | `Banesco_Personal` | Banesco personal account | | `PAGO_MOVIL` | `Pago_Movil` | Mobile payment (inter-bank) | | `BANCANET` | `Bancanet` | Bancanet | | `BANPLUS` | `Banplus` | Banplus | | `ZELLE` | `Zelle` | Zelle (USD, not VES) | | `PAYPAL` | `Paypal` | PayPal (USD) | | `CASH_VEF` | `Efectivo_VEF` | Cash in VES | | `CASH_USD` | `Efectivo_USD` | Cash in USD | | `PAGO_MOVIL` | `Pago_Movil_Banco_Venezuela` | Mobile payment at specific bank | ### 4.2 Why This Matters for Bank Arbitrage ```python # Example analysis query after ~1 week of data: # For each snapshot, find the best path: # # Best BUY price (sell USDT → get VES): Banesco, 60.50 VES/USDT # Best SELL price (buy USDT → give VES): Mercantil, 62.30 VES/USDT # Gross arbitrage: 62.30 - 60.50 = 1.80 VES/USDT = ~2.9% spread # # If same bank: you lose 0% on internal transfer # If different banks: you lose bank transfer fee (maybe 0.5%) # Net profit = 2.9% - 0.5% = 2.4% per round trip ``` ### 4.3 Storage Consideration `payment_methods` is a **list of strings** — this is fine in Parquet (stored as a repeated field). For CSV it would need to be JSON-encoded or one-hot encoded later. --- ## 5. Storage — Exact File Layout ``` /path/to/data/ ├── raw/ │ ├── buy_ads/ │ │ └── year=2026/ │ │ └── month=06/ │ │ └── day=05/ │ │ ├── snapshot_20260605_133000.parquet │ │ ├── snapshot_20260605_133500.parquet │ │ └── ... │ ├── sell_ads/ │ │ └── year=2026/ │ │ └── month=06/ │ │ └── day=05/ │ │ ├── snapshot_20260605_133000.parquet │ │ └── ... │ └── daily_merged/ <-- OPTIONAL: daily combined view │ └── year=2026/ │ └── month=06/ │ └── 2026-06-05.parquet │ ├── logs/ │ └── collector_20260605.log │ ├── alerts/ <-- alert marker files go here │ └── (empty if no issues) │ └── checkpoint.json <-- for restart resilience ``` ### 5.1 File Naming Convention **Snapshot files:** `snapshot_{YYYYMMDD}_{HHMMSS}.parquet` - Time used: the start timestamp of the snapshot (UTC) - Example: `snapshot_20260605_133000.parquet` **Why no UUIDs?** The timestamp + trade_type partition is already unique. No repeated names unless you run two collectors (don't). ### 5.2 Atomic Writes (No Partial Files) ```python def store_parquet(rows, base_dir, fetched_at): if not rows: return # Build partition path from timestamp year = fetched_at.strftime("%Y") month = fetched_at.strftime("%m") day = fetched_at.strftime("%d") filename = f"snapshot_{fetched_at.strftime('%Y%m%d_%H%M%S')}.parquet" dest_dir = Path(base_dir) / f"year={year}" / f"month={month}" / f"day={day}" dest_dir.mkdir(parents=True, exist_ok=True) # Write to temp file first tmp_path = dest_dir / (filename + ".tmp") final_path = dest_dir / filename df = pd.DataFrame(rows) df.to_parquet(tmp_path, index=False, engine="pyarrow") # Atomic rename tmp_path.rename(final_path) ``` ### 5.3 Schema Consistency Check Each snapshot should write a schema marker file once: ```python # After first successful write per partition, write schema.parquet as a reference schema_path = dest_dir / "_schema.parquet" if not schema_path.exists(): df.iloc[:0].to_parquet(schema_path) # empty DataFrame with same schema ``` This allows downstream readers to discover the schema without reading a full snapshot. --- ## 6. Data Validation During Collection ### 6.1 Row-Level Rejection Rules Reject (skip, don't crash) individual ads if: | Condition | Why | Action | |---|---|---| | `price` is None or ≤ 0 | Bad data | Log warning, skip | | `surplusAmount` is None or ≤ 0 | Ad has no USDT left | Log debug, skip | | `monthFinishRate` is 0.0 and `monthOrderCount` > 0 | Merchant hasn't completed any orders (suspicious) | Log warning, skip | | `price` < 1.0 or `price` > 500.0 | Way outside VES/USDT normal range (should be ~50–150) | Log warning, skip this ad | | Empty `advNo` | Missing identifier | Log error, skip | | Duplicate `advNo` within same snapshot | Possible API glitch | Log warning, keep first occurrence | ### 6.2 Snapshot-Level Validation After collecting all ads for one snapshot: ``` ✅ TOTAL ADS: BUY=47 SELL=53 (should be 20-200 each) ✅ PRICE RANGE: BUY [54.20 - 62.80] SELL [58.00 - 68.50] (SELL should be consistently higher than BUY) If not: LOG WARNING "BUY/SELL overlap detected" ✅ SPREAD: SELL_min - BUY_max = 58.00 - 62.80 = -4.80 (If negative: spread is inverted — unusual but possible) Log: "Current spread: {spread:.2f} VES/USDT" ✅ MEDIAN PRICE: BUY=58.30 SELL=63.50 ✅ AD STALENESS: 0 ads with createTime > 7 days old (If any: they're stale, still keep them, but log it) ✅ EMPTY SNAPSHOT: If BUY=0 AND SELL=0 → CRITICAL ALERT ``` ### 6.3 Snapshot Summary Log Line (one line per snapshot) ``` 2026-06-05 13:30:00 UTC | BUY=47 ads [54.20–62.80] SELL=53 ads [58.00–68.50] | spread= -4.80 | took 3.2s | methods=[BANESCO,PAGO_MOVIL,MERCANTIL,...] ``` --- ## 7. Scheduling & Lifecycle ### 7.1 Startup Behavior ``` 1. Read checkpoint.json (if exists) → "last_completed_snapshot": "2026-06-05T13:25:00Z" → Wait until (last_completed + interval) before starting → If checkpoint is missing or corrupted, start immediately 2. Verify data directory is writable → Try writing a test file, then delete it 3. Log: "Starting collector. Interval=300s. Pairs=USDT/VES" ``` ### 7.2 Graceful Shutdown ```python import signal running = True def handle_signal(sig, frame): global running logging.info("Received signal %s, finishing current snapshot...", sig) running = False signal.signal(signal.SIGINT, handle_signal) signal.signal(signal.SIGTERM, handle_signal) # In main loop: while running: # ... do snapshot ... # Write checkpoint after each successful snapshot write_checkpoint({"last_completed_snapshot": now_utc.isoformat()}) ``` ### 7.3 Checkpoint File Format ```json { "last_completed_snapshot": "2026-06-05T13:30:00Z", "last_buy_ad_count": 47, "last_sell_ad_count": 53, "consecutive_failures": 0, "total_snapshots": 284, "first_snapshot": "2026-06-01T00:00:00Z", "version": "1.0" } ``` ### 7.4 Alert Marker File After 5 consecutive failures, write: ``` /path/to/data/alerts/20260605_133000_5_failures.alert ``` Content: ```json { "timestamp": "2026-06-05T13:30:00Z", "error": "HTTP 500 after 3 retries", "consecutive_failures": 5, "traceback": "..." } ``` --- ## 8. First-Run Verification Protocol After the collector writes its **first snapshot**, the coder should manually verify: ### Step 1: Read the Parquet file back ```python import pandas as pd df = pd.read_parquet("data/raw/buy_ads/year=2026/month=06/day=05/snapshot_20260605_133000.parquet") df.info() df.head() ``` Check: - [ ] All columns present (23 columns from spec) - [ ] No null values in critical fields (price, adv_no, advertiser_no) - [ ] `price` is float type, not string - [ ] `fetched_at` is datetime type - [ ] `payment_methods` is a proper list column ### Step 2: Verify BUY vs SELL logic ```python buy_ads = df[df["trade_type"] == "BUY"] sell_ads = df[df["trade_type"] == "SELL"] print(f"BUY ads count: {len(buy_ads)}") print(f"SELL ads count: {len(sell_ads)}") print(f"BUY price range: {buy_ads['price'].min():.2f} - {buy_ads['price'].max():.2f}") print(f"SELL price range: {sell_ads['price'].min():.2f} - {sell_ads['price'].max():.2f}") ``` Expected: SELL prices are higher than BUY prices (advertiser selling USDT charges a premium vs. buying USDT). ### Step 3: Verify payment methods are captured ```python all_methods = set() for methods in df["payment_methods"]: all_methods.update(methods) print(f"Payment methods found: {sorted(all_methods)}") ``` Expected: At least BANESCO and PAGO_MOVIL will appear. Possibly 5–15 different banks. ### Step 4: Verify advertiser diversity ```python print(f"Unique advertisers: {df['advertiser_no'].nunique()}") print(f"Merchants: {(df['advertiser_type'] == 'merchant').sum()}") print(f"Users: {(df['advertiser_type'] == 'user').sum()}") ``` ### Step 5: Run the collector for 1 hour (~12 snapshots) and verify: ```bash ls data/raw/buy_ads/year=2026/month=06/day=05/ | wc -l # Should be ~12 ``` - [ ] No duplicate timestamps - [ ] No gaps > 6 minutes - [ ] No crash/restart in the logs --- ## 9. File & Module Structure (Exact) ``` p2p-collector/ ├── collect_p2p.py # Entry point: argument parsing, main loop ├── config.yaml # All configurable settings ├── binance_client.py # fetch_all_ads(), pagination, rate limiting ├── normalizer.py # normalize_ad(), flatten schema ├── storage.py # store_parquet(), atomic writes, checkpoint ├── validator.py # validate_row(), validate_snapshot() ├── scheduler.py # main loop, sleep/jitter, signal handling ├── alert.py # write_alert_file(), logging setup ├── utils.py # jitter(), datetime helpers ├── requirements.txt # pinned versions ├── Makefile # setup, run, clean, test commands ├── tests/ │ ├── test_normalizer.py # Test with sample API response │ ├── test_storage.py # Test atomic writes │ └── test_validator.py # Test rejection rules ├── sample_responses/ │ ├── response_buy.json # One real-ish API response for tests │ └── response_sell.json └── README.md # Run instructions ``` ### requirements.txt ``` httpx>=0.27,<1.0 pandas>=2.0,<3.0 pyarrow>=14.0,<16.0 pyyaml>=6.0,<7.0 ``` Note: `httpx` over `requests` because it has native timeout support, cleaner API. Fall back to `requests` if the coder prefers. --- ## 10. `config.yaml` — Complete Reference ```yaml binance: base_url: "https://p2p.binance.com/bapi/c2c/v2/friendly/c2c/adv/search" user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" timeout_seconds: 15 max_pages: 10 request_delay_seconds: 0.5 collection: pairs: - asset: "USDT" fiat: "VES" interval_seconds: 300 output_dir: "./data/raw" retry_attempts: 3 retry_delay_base_seconds: 10 validation: price_min: 1.0 price_max: 500.0 reject_zero_finish_rate: true reject_zero_surplus: true logging: level: "INFO" file: "./data/logs/collector.log" max_bytes: 10485760 # 10 MB backup_count: 5 format: "%(asctime)s | %(levelname)s | %(message)s" alerts: consecutive_failure_threshold: 5 alert_dir: "./data/alerts" ``` --- ## 11. Run Modes ### Mode 1: One-shot test ```bash python collect_p2p.py --once ``` - Fetches one BUY snapshot + one SELL snapshot - Writes to disk - Prints summary - Exits - Used for: first run, testing, debugging ### Mode 2: Daemon (continuous) ```bash python collect_p2p.py ``` - Runs forever - Loop with interval - Graceful shutdown on Ctrl+C ### Mode 3: Backfill (future) ```bash python collect_p2p.py --backfill --start=2026-06-01 --end=2026-06-03 ``` - Not needed now - Architecture supports it later ### Mode 4: Validate-only ```bash python collect_p2p.py --validate data/raw/buy_ads/year=2026/month=06/day=05/ ``` - Reads Parquet files - Runs validation checks - Prints report - No API calls --- ## 12. Testing the Coder's Work Hand this checklist to the coder when they say "it's done": | # | Test | How | |---|---|---| | 1 | **API connectivity** | `python collect_p2p.py --once` returns ads without error | | 2 | **Pagination works** | Inspect: total ads fetched vs `total` field from API | | 3 | **Both BUY and SELL** | Both directories have at least one file after `--once` | | 4 | **Schema correct** | `pd.read_parquet(file)` → 23 columns, correct dtypes | | 5 | **Payment methods populated** | At least 3 payment methods in the first snapshot | | 6 | **Atomic write** | Kill the process mid-write (SIGKILL), no partial files remain. Only `.tmp` files | | 7 | **Graceful shutdown** | Ctrl+C during a snapshot → clean exit, last snapshot saved | | 8 | **Restart resilience** | Start collector, kill it, restart → resumes without duplicate timestamps | | 9 | **Rate limiting** | No HTTP 429 in logs after 1 hour of continuous running | | 10 | **Storage efficiency** | 1 hour of data ≤ 3 MB total on disk | --- ## 13. Post-Collection — What the Data Will Look Like After One Week | Metric | Expected value | |---|---| | Snapshots collected | ~2,016 (7 days × 288 snapshots/day) | | Total raw ads | ~200,000–400,000 rows | | Storage used | ~20–100 MB | | Unique advertisers | 100–500 | | Unique payment methods | 10–20 | | Price range (BUY) | ~55–65 VES/USDT (fluctuates with parallel dollar) | | Price range (SELL) | ~58–70 VES/USDT | | Typical spread | ~2–6 VES/USDT (3–10%) | **After 1 week of collection, we stop and do EDA before any ML decisions.** --- ## 14. Known Gotchas / FAQ for the Coder **Q: What if the API returns different fields than documented?** A: The normalizer should use `.get()` with defaults for every field. Log a warning if a field is missing that we expected. Don't crash. **Q: What if `tradeMethods` is empty?** A: Some ads have no payment methods listed. Store as empty list `[]`. Continue. This is valid data. **Q: What timezone should I use?** A: **Everything in UTC.** The user is in VET (UTC-4), but all stored timestamps are UTC. Timezone conversion is only for display. **Q: What if the VPS reboots?** A: systemd `Restart=always` handles this. The collector reads the last checkpoint and continues after the appropriate delay. **Q: Should I use asyncio?** A: No. Simple synchronous code. The delay between requests (5 minutes) means async provides zero benefit and adds complexity. **Q: Can I use SQLite instead of Parquet?** A: You could, but Parquet is more storage-efficient and directly loadable into ML frameworks (Pandas, Polars, PyTorch). Stick with Parquet. --- *End of data collection spec. Hand this to the coding agent as the single source of truth.*