binance-p2p-market-history/p2p-collector/README.md
Gabriel Ramos 2c41a7a6b3 feat: implement binance p2p collector daemon
Set up continuous P2P VES/USDT market history data collection, normalization, validation, and date-partitioned Parquet storage.
2026-06-05 14:40:05 -04:00

103 lines
3.8 KiB
Markdown

# Binance P2P Data Collector
This tool continuously collects public peer-to-peer (P2P) market advertisements from Binance P2P for Venezuela (VES/USDT), normalizing, validating, and saving them as atomic date-partitioned Parquet files for subsequent exploratory data analysis and arbitrage modeling.
## Project Structure
```
p2p-collector/
├── collect_p2p.py # Entry point: argument parsing, validation/daemon modes
├── config.yaml # Application configuration (endpoints, delays, validation limits)
├── binance_client.py # HTTP client, pagination logic, retry, and 429 backoff
├── normalizer.py # Converts raw nested API responses into a flat 23-column schema
├── validator.py # Row-level filtering and snapshot-level integrity checks
├── storage.py # Atomic Parquet writes, schema references, and checkpoints
├── scheduler.py # Loop executor, initial start offsets, signal handling
├── alert.py # Write alert marker files on 5 consecutive failures & logger setup
├── utils.py # Time and sleep/jitter helpers
├── requirements.txt # Package dependencies (httpx, pandas, pyarrow, pyyaml)
├── Makefile # Automation targets (setup, test, run, clean)
└── tests/ # Suite of unit tests for all components
```
## Prerequisites
- **Python 3.8+** (Developed and tested with Python 3.14)
- **Make** (utility for running Makefile targets)
## Installation & Setup
Set up the Python virtual environment and install all dependencies:
```bash
make setup
```
## Running the Collector
### Mode 1: Continuous Daemon Mode
Runs indefinitely, fetching snapshots according to the configured interval (default: 5 minutes) with a ±10% sleep jitter to prevent pattern recognition. Handles graceful shutdown on SIGINT/SIGTERM.
```bash
make run
```
### Mode 2: One-shot Mode (Test/Debug)
Runs exactly one cycle (one BUY snapshot and one SELL snapshot), writes the results to disk, and exits immediately:
```bash
make run-once
```
### Mode 3: Validate-Only Mode
Validates existing Parquet files without making any network calls. It prints statistics (row count, min/max prices, payment methods) and checks for critical schema issues:
```bash
make validate PATH_TO_VALIDATE=data/raw/buy_ads/year=2026/month=06/day=05/
```
## Running Tests
Run the test suite to verify the client, normalizer, storage, and validation behaviors:
```bash
make test
```
## Output Directory Structure
The data is saved under `./data/` folder inside the project root:
```
data/
├── raw/
│ ├── buy_ads/
│ │ └── year=YYYY/month=MM/day=DD/
│ │ ├── _schema.parquet # Empty schema reference
│ │ └── snapshot_YYYYMMDD_HHMMSS.parquet # Atomic snapshot data
│ └── sell_ads/
│ └── year=YYYY/month=MM/day=DD/
│ ├── _schema.parquet
│ └── snapshot_YYYYMMDD_HHMMSS.parquet
├── logs/
│ └── collector.log # Rotating logs
├── alerts/
│ └── YYYYMMDD_HHMMSS_5_failures.alert # Alert marker JSON file (only on failures)
└── checkpoint.json # Restart resilience marker
```
## Checkpoint Format
A checkpoint file is updated on every successful snapshot, ensuring that restarting the daemon will not query the API until the expected interval has passed:
```json
{
"last_completed_snapshot": "2026-06-05T13:30:00Z",
"last_buy_ad_count": 47,
"last_sell_ad_count": 53,
"consecutive_failures": 0,
"total_snapshots": 284,
"first_snapshot": "2026-06-01T00:00:00Z",
"version": "1.0"
}
```