Technical Architecture

A comprehensive overview of the engineering excellence, algorithms, and architecture patterns powering CFHelper's intelligent competitive programming analysis platform

Python 3.10+ Django 4.2 MongoDB Google Gemini AI REST API WebSockets Multi-threading

System Architecture

Multi-layered architecture implementing microservices patterns, caching strategies, and asynchronous processing for high performance and scalability

Web Client
Responsive SPA
Vanilla JS + HTML5
Django REST API
Rate Limiting
CORS + CSRF Protection
Solution Fetcher
Parallel Processing
ThreadPoolExecutor
Problem Scraper
Firecrawl + BeautifulSoup
Anti-bot Bypass
AI Engine
Gemini 2.0 Flash
Hint Generation
MongoDB Cache
Solutions + Hints
1 Year TTL
SQLite DB
Query Logs
Health Metrics
Codeforces API
200K+ Submissions
Google AI
Gemini 2.0
Firecrawl
Web Scraping

Request Processing Pipeline

Optimized request flow with parallel processing, intelligent caching, and performance monitoring at every stage

1
Request Reception & Validation
URL parsing with regex pattern matching, input sanitization, rate limit checking (30 req/min per IP), CSRF token validation, and parameter normalization
~2ms Security: CSRF + Rate Limiting
2
Multi-Tier Cache Lookup
MD5-based cache key generation with version control, MongoDB query with indexed lookups, pickle deserialization, TTL validation (1 year), and filtered response building
~50-100ms (cache hit) Hit Rate: ~80%
3
Parallel Data Fetching
ThreadPoolExecutor with 3 workers: Contest metadata fetch, Problem info + statement scraping (Firecrawl primary, BeautifulSoup fallback), Submissions batch fetch (200K limit). All operations run concurrently with timeout protection
~3-5s (parallel) 3x faster than sequential
4
User Intelligence Gathering
Extract unique handles from submissions, sort by submission speed to prioritize high-rated users, batch API calls (500 handles/request), rate limit respect (0.3s delays), intelligent fallback for missing data
Max 1000 users Batched API calls
5
Statistical Analysis & Selection
Language distribution with collections.Counter, performance metrics (fastest/average using top 500), exceptional solution detection (50% threshold algorithm), high-rated user prioritization, diversity-aware selection
Single-pass O(n) List comprehensions
6
AI Hint Generation (Async)
Separate endpoint for non-blocking UX, MongoDB hint cache check, problem statement extraction, Gemini 2.0 API call with structured prompts, hint caching for future requests
~1-3s (first time) ~50ms (cached)
7
Response Assembly & Caching
JSON serialization, two-tier caching (specific request + base data with raw submissions for filtering), query logging to SQLite, cache statistics update, response compression
Total: 3-8s uncached ~100ms cached

Intelligent Algorithms

Production-grade algorithms optimized for performance, accuracy, and scalability

Solution Selection Algorithm O(n log n)
# Priority-based selection with rating and performance optimization
def select_high_rated_solutions(submissions, user_info, language_filter, num_solutions):
    # 1. Find absolute fastest submission (O(n))
    fastest = min(submissions, key=lambda x: x['timeConsumedMillis'])

    # 2. Map users to their best submission - single pass (O(n))
    best_by_user = {}
    for sub in submissions:
        handle = sub['author']['members'][0]['handle']
        if handle not in best_by_user or \
           sub['timeConsumedMillis'] < best_by_user[handle]['timeConsumedMillis']:
            best_by_user[handle] = sub

    # 3. Sort by rating descending (O(n log n))
    user_submissions = [(h, user_info.get(h, {'rating': -1})['rating'], sub)
                        for h, sub in best_by_user.items()]
    user_submissions.sort(key=lambda x: x[1], reverse=True)

    # 4. Build result: fastest + high-rated users (O(k) where k = num_solutions)
    results = [fastest]  # Always include fastest
    selected_handles = {fastest['author']['members'][0]['handle']}

    for handle, rating, sub in user_submissions:
        if len(results) >= num_solutions: break
        if handle not in selected_handles:
            results.append(sub)
            selected_handles.add(handle)

    return results
Exceptional Solution Detection Statistical Analysis
# Finds standout solutions using statistical thresholds
def find_exceptional_solutions(submissions, user_info):
    # Calculate statistical baselines
    avg_time = sum(s['timeConsumedMillis'] for s in submissions) / len(submissions)
    avg_memory = sum(s['memoryConsumedBytes'] for s in submissions) / len(submissions)

    exceptional = []

    # 1. Exceptionally fast: 50% faster than average
    fastest = min(submissions, key=lambda x: x['timeConsumedMillis'])
    if fastest['timeConsumedMillis'] < avg_time * 0.5:
        exceptional.append({'submission': fastest, 'reason': 'Exceptionally Fast'})

    # 2. Memory efficient: 50% less memory than average
    efficient = min(submissions, key=lambda x: x['memoryConsumedBytes'])
    if efficient['memoryConsumedBytes'] < avg_memory * 0.5:
        exceptional.append({'submission': efficient, 'reason': 'Memory Efficient'})

    # 3. Hidden gem: Low-rated user with excellent performance
    for sub in submissions:
        rating = user_info.get(sub['author']['handle'], {}).get('rating', 0)
        if rating < 1600 and sub['timeConsumedMillis'] < avg_time * 0.7:
            exceptional.append({'submission': sub, 'reason': 'Lower-Rated Excellence'})
            break

    return exceptional[:3]  # Top 3 exceptional solutions
Cache Key Generation with Versioning MD5 Hashing
# Intelligent cache invalidation with version control
def _generate_cache_key(url, num_solutions, language_filter):
    CACHE_VERSION = "v2"  # Increment to invalidate all old cache
    key_data = f"{CACHE_VERSION}:{url}:{num_solutions}:{language_filter}"
    return hashlib.md5(key_data.encode()).hexdigest()

# Two-tier caching strategy:
# 1. Specific request cache (url + filters + count)
# 2. Base cache with raw submissions (allows filtering without re-fetch)

Advanced Engineering Features

Production-ready implementations showcasing software engineering best practices

Intelligent Multi-Tier Caching Strategy
Implements two-tier caching with versioning support. First tier caches specific requests (URL + filters + count), while second tier stores base data with raw submissions, enabling dynamic filtering without re-fetching. Cache invalidation via version bumping ensures consistency across deployments.
MongoDB Pickle Serialization MD5 Hashing TTL Management
Concurrent Request Processing with ThreadPoolExecutor
Fetches contest metadata, problem information, and submissions in parallel using Python's ThreadPoolExecutor. Reduces total fetch time by 3x compared to sequential execution while maintaining proper error handling and timeout protection for each thread.
concurrent.futures Parallel I/O Timeout Protection Exception Handling
Automatic Background Health Monitoring
Custom Django management command integrated with WSGI lifecycle. Launches daemon thread on application start, monitors external services (Codeforces API, MongoDB, Gemini) every 30 minutes, and stores metrics in SQLite. Automatic cleanup of old records (7-day retention) prevents database bloat.
Django Management Commands Daemon Threads WSGI Hooks Service Monitoring
Intelligent User Prioritization Algorithm
When analyzing 200K+ submissions, limits user profile fetches to 1000 by prioritizing users with fastest submissions (higher correlation with rating). Batches API calls (500 handles per request), implements rate limiting (0.3s delays), and provides graceful fallbacks for missing data.
Optimization Algorithm Batch Processing API Rate Limiting Priority Queue
Statistical Performance Analysis Engine
Single-pass O(n) analysis using collections.Counter for language distribution. Calculates metrics using only fastest 500 submissions to avoid outlier bias. Identifies exceptional solutions using statistical thresholds (50% faster/more efficient than average).
O(n) Complexity Statistical Analysis Outlier Detection Performance Metrics
AI Hint Generation with Caching
Separate asynchronous endpoint prevents blocking main request flow. Implements hint-specific caching in MongoDB with 1-year TTL. Structured prompt engineering ensures consistent, non-spoiler hints. Fallback to problem tags when statement scraping fails.
Async Processing Prompt Engineering Gemini 2.0 API Intelligent Caching
Dual-Method Web Scraping with Fallback
Primary: Firecrawl API with markdown/HTML extraction for anti-bot bypass. Fallback: Direct requests with BeautifulSoup for reliability. Implements thread-based timeout (15s) to prevent hanging. Persistent session management for connection reuse.
Firecrawl API BeautifulSoup4 Thread Timeout Session Management
Comprehensive Logging & Monitoring
Structured logging with separate file and console handlers. Query logging tracks URL, filters, response time, cache status, and IP address with database indexes. Real-time health metrics for all external services with historical trend analysis.
Python Logging Database Indexing Metrics Collection Performance Tracking

Data Flow Visualization

End-to-end data transformation pipeline from user request to final response

User Request
URL + Parameters
Validation
Regex + Sanitization
Cache Key
MD5(v2:url:params)
MongoDB Query
Indexed Lookup
Cache Check
Hit: Return | Miss: Fetch
Pickle Deserialize
Binary → Python Object
Parallel Fetch
ThreadPool(3 workers)
Data Aggregation
Combine Results
User Info Batch
API Calls (500/batch)
Analysis
Statistics + Selection
Format Response
JSON Serialization
Cache + Log
Store for Future
Client Response
JSON + HTTP 200
Async Hint
Separate Endpoint
Complete
Full Analysis Ready

Production Deployment

Enterprise-grade deployment architecture with high availability and security

Nginx
Reverse Proxy
SSL Termination
Static Files
Gunicorn
(CPU×2+1) Workers
WSGI Server
Load Balancing
Django App
Business Logic
Multi-threading
Background Tasks
Systemd
Process Manager
Auto-restart
Health Monitor
Let's Encrypt
SSL Certificates
Auto-renewal
HTTPS
99.5%
Target Uptime
Auto-restart on failure
100+
Concurrent Users
Multi-worker support
300s
Request Timeout
Configurable
1GB
Memory Limit
Per service

Performance Metrics

Production metrics demonstrating system efficiency and optimization success

80%
Cache Hit Rate
MongoDB + Pickle
50ms
Cached Response
96% faster
3-5s
Fresh Fetch
Parallel processing
200K
Submissions Analyzed
Per request
1000
User Profiles
Batched API calls
30
Requests/Min
Rate limit per IP
O(n)
Analysis Complexity
Single-pass algorithms
1 Year
Cache TTL
Auto-cleanup

Technology Stack

Production-grade technologies chosen for reliability, performance, and maintainability

Backend Framework
  • Django 4.2.7 (Python Web Framework)
  • WSGI with Gunicorn
  • Multi-worker architecture
  • Built-in ORM for database operations
  • Middleware for security & logging
Data Storage
  • MongoDB (NoSQL cache storage)
  • SQLite (Relational metadata)
  • Pickle serialization
  • Indexed queries for fast lookups
  • Automatic TTL management
AI & Machine Learning
  • Google Gemini 2.0 Flash
  • Structured prompt engineering
  • Context-aware hint generation
  • Rate limit management
  • Error handling & fallbacks
Web Scraping
  • Firecrawl API (primary)
  • BeautifulSoup4 (fallback)
  • Anti-bot bypass techniques
  • Timeout protection (15s)
  • Persistent session management
Performance
  • ThreadPoolExecutor (parallel processing)
  • Collections.Counter (O(n) counting)
  • List comprehensions
  • Single-pass algorithms
  • WhiteNoise static file compression
Security
  • CSRF token validation
  • CORS policy enforcement
  • Rate limiting (django-ratelimit)
  • Input sanitization & validation
  • SSL/TLS encryption (production)

Ready to Explore?

Experience the platform firsthand and see these technologies in action

Try CFHelper Now