Performance Guide
This guide covers performance optimization strategies and best practices for using git-pandas effectively with large repositories and datasets.
Overview
Git-pandas provides several mechanisms to optimize performance:
Caching: Multiple cache backends for reusing expensive computations
Parallelization: Multi-threading support for repository operations
Data Filtering: Glob patterns and limits to reduce dataset size
Memory Management: Efficient data structures and memory usage patterns
Caching System
The caching system is the most important performance optimization in git-pandas. It stores the results of expensive Git operations and analysis computations.
Cache Backends
git-pandas supports three cache backends:
- EphemeralCache (In-Memory)
Best for: Single-session analysis, development, testing
Pros: Fast access, no disk I/O, automatic cleanup
Cons: Data lost when process ends, limited by available RAM
Use case: Interactive analysis, Jupyter notebooks
- DiskCache (Persistent)
Best for: Multi-session analysis, CI/CD pipelines, long-running processes
Pros: Survives process restarts, configurable size limits, compression
Cons: Slower than memory cache, disk space usage
Use case: Regular analysis workflows, automated reporting
- RedisDFCache (Distributed)
Best for: Multi-user environments, distributed analysis, shared cache
Pros: Shared across processes/machines, TTL support, Redis features
Cons: Requires Redis server, network latency, additional complexity
Use case: Team environments, production deployments
Cache Configuration
Basic cache setup:
from gitpandas import Repository
from gitpandas.cache import EphemeralCache, DiskCache, RedisDFCache
# In-memory cache (fastest for single session)
cache = EphemeralCache(max_keys=1000)
repo = Repository('/path/to/repo', cache_backend=cache)
# Persistent cache (best for repeated analysis)
cache = DiskCache('/tmp/gitpandas_cache.gz', max_keys=500)
repo = Repository('/path/to/repo', cache_backend=cache)
# Redis cache (best for shared environments)
cache = RedisDFCache(host='localhost', max_keys=2000, ttl=3600)
repo = Repository('/path/to/repo', cache_backend=cache)
Cache Warming
Pre-populate cache for better performance:
# Warm cache with commonly used methods
result = repo.warm_cache(
methods=['commit_history', 'branches', 'blame', 'file_detail'],
limit=100, # Reasonable limit for cache warming
ignore_globs=['*.log', '*.tmp']
)
print(f"Cache entries created: {result['cache_entries_created']}")
print(f"Execution time: {result['execution_time']:.2f} seconds")
Cache Management
Monitor and manage cache performance:
# Get cache statistics
stats = repo.get_cache_stats()
print(f"Repository entries: {stats['repository_entries']}")
if stats['global_cache_stats']:
global_stats = stats['global_cache_stats']
print(f"Cache usage: {global_stats['cache_usage_percent']:.1f}%")
print(f"Average entry age: {global_stats['average_entry_age_hours']:.2f} hours")
# Invalidate specific cache entries
repo.invalidate_cache(keys=['commit_history'])
# Clear old cache entries by pattern
repo.invalidate_cache(pattern='blame*')
# Clear all cache for repository
repo.invalidate_cache()
Performance Benchmarks
Typical performance improvements with caching:
Operation |
No Cache |
With Cache |
Speedup |
---|---|---|---|
commit_history |
2.5s |
0.05s |
50x |
blame |
4.2s |
0.08s |
52x |
file_detail |
1.8s |
0.03s |
60x |
branches/tags |
0.3s |
0.01s |
30x |
bus_factor |
3.1s |
0.06s |
51x |
Benchmarks based on medium-sized repository (~5000 commits, 500 files)
Data Filtering and Limits
Reduce dataset size to improve performance:
Glob Patterns
Use glob patterns to focus analysis on relevant files:
# Analyze only Python files
commits = repo.commit_history(include_globs=['*.py'])
# Exclude test and build files
blame = repo.blame(ignore_globs=['test_*.py', 'build/*', '*.pyc'])
# Multiple patterns
rates = repo.file_change_rates(
include_globs=['*.py', '*.js', '*.html'],
ignore_globs=['*/tests/*', '*/node_modules/*']
)
Limits and Time Windows
Limit analysis scope for faster results:
# Limit to recent commits
recent_commits = repo.commit_history(limit=500)
# Analyze last 90 days only
recent_changes = repo.file_change_rates(days=90)
# Combine limits with filtering
python_commits = repo.commit_history(
limit=1000,
include_globs=['*.py']
)
Branch-Specific Analysis
Analyze specific branches for better performance:
# Analyze main development branch only
main_commits = repo.commit_history(branch='main')
# Compare specific branches
feature_commits = repo.commit_history(branch='feature/new-ui')
Parallelization
git-pandas uses parallel processing when joblib is available:
Installation
pip install joblib
Parallel Operations
Several operations automatically use parallelization:
from gitpandas import ProjectDirectory
# Parallel analysis across repositories
project = ProjectDirectory('/path/to/projects')
# These operations run in parallel when joblib is available:
commits = project.commit_history() # Parallel across repos
branches = project.branches() # Parallel across repos
blame = project.cumulative_blame() # Parallel across commits
# Control parallelization
blame = repo.parallel_cumulative_blame(
workers=4, # Number of parallel workers
limit=100 # Limit for better performance
)
Memory Management
Optimize memory usage for large repositories:
DataFrame Memory Usage
Monitor and optimize DataFrame memory:
import pandas as pd
# Check memory usage
commits = repo.commit_history(limit=10000)
print(f"Memory usage: {commits.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")
# Optimize data types
commits['insertions'] = commits['insertions'].astype('int32')
commits['deletions'] = commits['deletions'].astype('int32')
# Use categorical for repeated strings
commits['committer'] = commits['committer'].astype('category')
Chunked Processing
Process large datasets in chunks:
def analyze_in_chunks(repo, chunk_size=1000):
"""Analyze repository in chunks to manage memory."""
all_results = []
offset = 0
while True:
# Get chunk of commits
chunk = repo.commit_history(limit=chunk_size, skip=offset)
if chunk.empty:
break
# Process chunk
result = process_chunk(chunk)
all_results.append(result)
offset += chunk_size
# Optional: Clear cache periodically
if offset % 10000 == 0:
repo.invalidate_cache(pattern='commit_history*')
return pd.concat(all_results, ignore_index=True)
Large Repository Strategies
Special considerations for large repositories:
Repository Size Guidelines
Size |
Commits/Files |
Recommended Strategy |
---|---|---|
Small |
<1K commits |
Any cache, no limits |
Medium |
1K-10K commits |
DiskCache, reasonable limits |
Large |
10K-100K commits |
DiskCache + filtering |
Very Large |
>100K commits |
Redis + chunking + limits |
Configuration for Large Repositories
# Large repository configuration
cache = DiskCache('/fast/disk/cache.gz', max_keys=10000)
repo = Repository(
'/path/to/large/repo',
cache_backend=cache,
default_branch='main'
)
# Use aggressive filtering
analysis = repo.commit_history(
limit=5000, # Reasonable limit
days=365, # Last year only
include_globs=['*.py', '*.js'], # Core files only
ignore_globs=['*/tests/*', '*/vendor/*'] # Exclude bulk dirs
)
Monitoring Performance
Track performance metrics:
import time
def benchmark_operation(func, *args, **kwargs):
"""Benchmark any git-pandas operation."""
start_time = time.time()
start_memory = get_memory_usage()
result = func(*args, **kwargs)
end_time = time.time()
end_memory = get_memory_usage()
print(f"Execution time: {end_time - start_time:.2f}s")
print(f"Memory delta: {end_memory - start_memory:.1f}MB")
print(f"Result size: {len(result)} rows")
return result
# Example usage
commits = benchmark_operation(
repo.commit_history,
limit=1000,
include_globs=['*.py']
)
Performance Anti-Patterns
Avoid these common performance issues:
❌ No Caching
# Slow: No cache means repeated expensive Git operations
repo = Repository('/path/to/repo') # No cache_backend
for branch in ['main', 'develop', 'feature']:
commits = repo.commit_history(branch=branch) # Repeated work
✅ With Caching
# Fast: Cache reuses expensive operations
cache = DiskCache('/tmp/analysis_cache.gz', max_keys=1000)
repo = Repository('/path/to/repo', cache_backend=cache)
for branch in ['main', 'develop', 'feature']:
commits = repo.commit_history(branch=branch) # Cached after first
❌ No Filtering
# Slow: Processes all files including irrelevant ones
blame = repo.blame() # Includes build files, logs, etc.
✅ With Filtering
# Fast: Only analyzes relevant source files
blame = repo.blame(
include_globs=['*.py', '*.js', '*.html'],
ignore_globs=['*/build/*', '*/logs/*', '*.pyc']
)
❌ Unlimited Analysis
# Slow: Processes entire repository history
commits = repo.commit_history() # Could be millions of commits
✅ Limited Analysis
# Fast: Focuses on recent, relevant commits
commits = repo.commit_history(
limit=1000, # Last 1000 commits
days=90 # Last 90 days
)
❌ Memory Leaks
# Memory issues: Large DataFrames accumulating
results = []
for i in range(100):
commits = repo.commit_history(limit=10000)
results.append(commits) # Accumulating large DataFrames
✅ Memory Management
# Memory efficient: Process and aggregate
total_commits = 0
for i in range(100):
commits = repo.commit_history(limit=1000) # Smaller chunks
total_commits += len(commits)
# Don't accumulate DataFrames
print(f"Total commits processed: {total_commits}")
Best Practices Summary
Always use caching for any analysis beyond one-off queries
Choose the right cache backend for your use case: - EphemeralCache: Development, interactive analysis - DiskCache: Regular workflows, CI/CD - RedisDFCache: Team environments, production
Warm your cache before intensive analysis sessions
Use glob patterns to filter relevant files only
Set reasonable limits on commit history and time windows
Monitor cache performance and invalidate when needed
Profile memory usage for large repositories
Process in chunks when dealing with very large datasets
Use parallelization with ProjectDirectory for multiple repositories
Avoid anti-patterns that lead to repeated expensive operations
For more detailed examples, see the performance examples in the examples/
directory:
examples/cache_management.py
- Cache management and monitoringexamples/remote_fetch_and_cache_warming.py
- Cache warming strategiesexamples/parallel_blame.py
- Parallel processing examples