Performance Guide

This guide covers performance optimization strategies and best practices for using git-pandas effectively with large repositories and datasets.

Overview

Git-pandas provides several mechanisms to optimize performance:

Caching: Multiple cache backends for reusing expensive computations
Parallelization: Multi-threading support for repository operations
Data Filtering: Glob patterns and limits to reduce dataset size
Memory Management: Efficient data structures and memory usage patterns

Caching System

The caching system is the most important performance optimization in git-pandas. It stores the results of expensive Git operations and analysis computations.

Cache Backends

git-pandas supports three cache backends:

EphemeralCache (In-Memory)

Best for: Single-session analysis, development, testing
Pros: Fast access, no disk I/O, automatic cleanup
Cons: Data lost when process ends, limited by available RAM
Use case: Interactive analysis, Jupyter notebooks

DiskCache (Persistent)

Best for: Multi-session analysis, CI/CD pipelines, long-running processes
Pros: Survives process restarts, configurable size limits, compression
Cons: Slower than memory cache, disk space usage
Use case: Regular analysis workflows, automated reporting

RedisDFCache (Distributed)

Best for: Multi-user environments, distributed analysis, shared cache
Pros: Shared across processes/machines, TTL support, Redis features
Cons: Requires Redis server, network latency, additional complexity
Use case: Team environments, production deployments

Cache Configuration

Basic cache setup:

from gitpandas import Repository
from gitpandas.cache import EphemeralCache, DiskCache, RedisDFCache

# In-memory cache (fastest for single session)
cache = EphemeralCache(max_keys=1000)
repo = Repository('/path/to/repo', cache_backend=cache)

# Persistent cache (best for repeated analysis)
cache = DiskCache('/tmp/gitpandas_cache.gz', max_keys=500)
repo = Repository('/path/to/repo', cache_backend=cache)

# Redis cache (best for shared environments)
cache = RedisDFCache(host='localhost', max_keys=2000, ttl=3600)
repo = Repository('/path/to/repo', cache_backend=cache)

Cache Warming

Pre-populate cache for better performance:

# Warm cache with commonly used methods
result = repo.warm_cache(
    methods=['commit_history', 'branches', 'blame', 'file_detail'],
    limit=100,  # Reasonable limit for cache warming
    ignore_globs=['*.log', '*.tmp']
)

print(f"Cache entries created: {result['cache_entries_created']}")
print(f"Execution time: {result['execution_time']:.2f} seconds")

Cache Management

Monitor and manage cache performance:

# Get cache statistics
stats = repo.get_cache_stats()
print(f"Repository entries: {stats['repository_entries']}")
if stats['global_cache_stats']:
    global_stats = stats['global_cache_stats']
    print(f"Cache usage: {global_stats['cache_usage_percent']:.1f}%")
    print(f"Average entry age: {global_stats['average_entry_age_hours']:.2f} hours")

# Invalidate specific cache entries
repo.invalidate_cache(keys=['commit_history'])

# Clear old cache entries by pattern
repo.invalidate_cache(pattern='blame*')

# Clear all cache for repository
repo.invalidate_cache()

Performance Benchmarks

Typical performance improvements with caching:

Operation	No Cache	With Cache	Speedup
commit_history	2.5s	0.05s	50x
blame	4.2s	0.08s	52x
file_detail	1.8s	0.03s	60x
branches/tags	0.3s	0.01s	30x
bus_factor	3.1s	0.06s	51x

Benchmarks based on medium-sized repository (~5000 commits, 500 files)

Data Filtering and Limits

Reduce dataset size to improve performance:

Glob Patterns

Use glob patterns to focus analysis on relevant files:

# Analyze only Python files
commits = repo.commit_history(include_globs=['*.py'])

# Exclude test and build files
blame = repo.blame(ignore_globs=['test_*.py', 'build/*', '*.pyc'])

# Multiple patterns
rates = repo.file_change_rates(
    include_globs=['*.py', '*.js', '*.html'],
    ignore_globs=['*/tests/*', '*/node_modules/*']
)

Limits and Time Windows

Limit analysis scope for faster results:

# Limit to recent commits
recent_commits = repo.commit_history(limit=500)

# Analyze last 90 days only
recent_changes = repo.file_change_rates(days=90)

# Combine limits with filtering
python_commits = repo.commit_history(
    limit=1000,
    include_globs=['*.py']
)

Branch-Specific Analysis

Analyze specific branches for better performance:

# Analyze main development branch only
main_commits = repo.commit_history(branch='main')

# Compare specific branches
feature_commits = repo.commit_history(branch='feature/new-ui')

Parallelization

git-pandas uses parallel processing when joblib is available:

Installation

pip install joblib

Parallel Operations

Several operations automatically use parallelization:

from gitpandas import ProjectDirectory

# Parallel analysis across repositories
project = ProjectDirectory('/path/to/projects')

# These operations run in parallel when joblib is available:
commits = project.commit_history()  # Parallel across repos
branches = project.branches()       # Parallel across repos
blame = project.cumulative_blame()  # Parallel across commits

# Control parallelization
blame = repo.parallel_cumulative_blame(
    workers=4,  # Number of parallel workers
    limit=100   # Limit for better performance
)

Memory Management

Optimize memory usage for large repositories:

DataFrame Memory Usage

Monitor and optimize DataFrame memory:

import pandas as pd

# Check memory usage
commits = repo.commit_history(limit=10000)
print(f"Memory usage: {commits.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")

# Optimize data types
commits['insertions'] = commits['insertions'].astype('int32')
commits['deletions'] = commits['deletions'].astype('int32')

# Use categorical for repeated strings
commits['committer'] = commits['committer'].astype('category')

Chunked Processing

Process large datasets in chunks:

def analyze_in_chunks(repo, chunk_size=1000):
    """Analyze repository in chunks to manage memory."""
    all_results = []
    offset = 0

    while True:
        # Get chunk of commits
        chunk = repo.commit_history(limit=chunk_size, skip=offset)
        if chunk.empty:
            break

        # Process chunk
        result = process_chunk(chunk)
        all_results.append(result)

        offset += chunk_size

        # Optional: Clear cache periodically
        if offset % 10000 == 0:
            repo.invalidate_cache(pattern='commit_history*')

    return pd.concat(all_results, ignore_index=True)

Large Repository Strategies

Special considerations for large repositories:

Repository Size Guidelines

Size	Commits/Files	Recommended Strategy
Small	<1K commits	Any cache, no limits
Medium	1K-10K commits	DiskCache, reasonable limits
Large	10K-100K commits	DiskCache + filtering
Very Large	>100K commits	Redis + chunking + limits

Configuration for Large Repositories

# Large repository configuration
cache = DiskCache('/fast/disk/cache.gz', max_keys=10000)
repo = Repository(
    '/path/to/large/repo',
    cache_backend=cache,
    default_branch='main'
)

# Use aggressive filtering
analysis = repo.commit_history(
    limit=5000,  # Reasonable limit
    days=365,    # Last year only
    include_globs=['*.py', '*.js'],  # Core files only
    ignore_globs=['*/tests/*', '*/vendor/*']  # Exclude bulk dirs
)

Monitoring Performance

Track performance metrics:

import time

def benchmark_operation(func, *args, **kwargs):
    """Benchmark any git-pandas operation."""
    start_time = time.time()
    start_memory = get_memory_usage()

    result = func(*args, **kwargs)

    end_time = time.time()
    end_memory = get_memory_usage()

    print(f"Execution time: {end_time - start_time:.2f}s")
    print(f"Memory delta: {end_memory - start_memory:.1f}MB")
    print(f"Result size: {len(result)} rows")

    return result

# Example usage
commits = benchmark_operation(
    repo.commit_history,
    limit=1000,
    include_globs=['*.py']
)

Performance Anti-Patterns

Avoid these common performance issues:

❌ No Caching

# Slow: No cache means repeated expensive Git operations
repo = Repository('/path/to/repo')  # No cache_backend
for branch in ['main', 'develop', 'feature']:
    commits = repo.commit_history(branch=branch)  # Repeated work

✅ With Caching

# Fast: Cache reuses expensive operations
cache = DiskCache('/tmp/analysis_cache.gz', max_keys=1000)
repo = Repository('/path/to/repo', cache_backend=cache)
for branch in ['main', 'develop', 'feature']:
    commits = repo.commit_history(branch=branch)  # Cached after first

❌ No Filtering

# Slow: Processes all files including irrelevant ones
blame = repo.blame()  # Includes build files, logs, etc.

✅ With Filtering

# Fast: Only analyzes relevant source files
blame = repo.blame(
    include_globs=['*.py', '*.js', '*.html'],
    ignore_globs=['*/build/*', '*/logs/*', '*.pyc']
)

❌ Unlimited Analysis

# Slow: Processes entire repository history
commits = repo.commit_history()  # Could be millions of commits

✅ Limited Analysis

# Fast: Focuses on recent, relevant commits
commits = repo.commit_history(
    limit=1000,  # Last 1000 commits
    days=90       # Last 90 days
)

❌ Memory Leaks

# Memory issues: Large DataFrames accumulating
results = []
for i in range(100):
    commits = repo.commit_history(limit=10000)
    results.append(commits)  # Accumulating large DataFrames

✅ Memory Management

# Memory efficient: Process and aggregate
total_commits = 0
for i in range(100):
    commits = repo.commit_history(limit=1000)  # Smaller chunks
    total_commits += len(commits)
    # Don't accumulate DataFrames
print(f"Total commits processed: {total_commits}")

Best Practices Summary

Always use caching for any analysis beyond one-off queries
Choose the right cache backend for your use case: - EphemeralCache: Development, interactive analysis - DiskCache: Regular workflows, CI/CD - RedisDFCache: Team environments, production
Warm your cache before intensive analysis sessions
Use glob patterns to filter relevant files only
Set reasonable limits on commit history and time windows
Monitor cache performance and invalidate when needed
Profile memory usage for large repositories
Process in chunks when dealing with very large datasets
Use parallelization with ProjectDirectory for multiple repositories
Avoid anti-patterns that lead to repeated expensive operations

For more detailed examples, see the performance examples in the examples/ directory:

examples/cache_management.py - Cache management and monitoring
examples/remote_fetch_and_cache_warming.py - Cache warming strategies
examples/parallel_blame.py - Parallel processing examples