Caching

Git-Pandas supports pluggable cache backends to optimize performance for expensive, repetitive operations. This is particularly useful for large repositories or when running multiple analyses.

Overview

The caching system provides: * In-memory caching for temporary results * Disk-based caching for persistent storage across sessions * Redis-based caching for distributed storage * Cache management and invalidation methods * Decorator-based caching for expensive operations * Cache timestamp tracking - know when cache entries were populated * Cache statistics and monitoring - track cache performance and usage

Available Cache Backends

In-Memory Cache (EphemeralCache)

The default in-memory cache is ephemeral and will be cleared when the process ends:

from gitpandas import Repository
from gitpandas.cache import EphemeralCache

# Create an in-memory cache with default settings
cache = EphemeralCache()

# Or customize the cache size
cache = EphemeralCache(max_keys=500)

# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)

Disk Cache (DiskCache)

For persistent caching that survives between sessions:

from gitpandas import Repository
from gitpandas.cache import DiskCache

# Create a disk cache
cache = DiskCache(filepath='/path/to/cache.gz', max_keys=1000)

# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)

Redis Cache (RedisDFCache)

For persistent caching across sessions, use Redis:

from gitpandas import Repository
from gitpandas.cache import RedisDFCache

# Create a Redis cache with default settings
cache = RedisDFCache()

# Or customize Redis connection and cache settings
cache = RedisDFCache(
    host='localhost',
    port=6379,
    db=12,
    max_keys=1000,
    ttl=3600  # Cache entries expire after 1 hour
)

# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)

Cache Timestamp Information

All cache backends now track when cache entries were populated. You can access this information without any changes to the Repository or ProjectDirectory API:

from gitpandas import Repository
from gitpandas.cache import EphemeralCache

# Create repository with cache
cache = EphemeralCache()
repo = Repository('/path/to/repo', cache_backend=cache)

# Populate cache with some operations
commit_history = repo.commit_history(limit=10)
file_list = repo.list_files()

# Check what's in the cache and when it was cached
cached_keys = cache.list_cached_keys()
for entry in cached_keys:
    print(f"Key: {entry['key']}")
    print(f"Cached at: {entry['cached_at']}")
    print(f"Age: {entry['age_seconds']:.1f} seconds")

# Get specific cache information
key = "commit_history_main_10_None_None_None_None"
info = cache.get_cache_info(key)
if info:
    print(f"Cache entry age: {info['age_minutes']:.2f} minutes")

Cache Information Methods

All cache backends support these methods for accessing timestamp information:

list_cached_keys() - Returns list of all cached keys with metadata
get_cache_info(key) - Returns detailed information about a specific cache entry

The returned information includes:

cached_at - UTC timestamp when the entry was cached
age_seconds - Age of the cache entry in seconds
age_minutes - Age of the cache entry in minutes
age_hours - Age of the cache entry in hours
cache_key - The original cache key

Using the Cache Decorator

The @multicache decorator can be used to cache method results:

from gitpandas.cache import multicache

@multicache(
    key_prefix="method_name",
    key_list=["param1", "param2"],
    skip_if=lambda x: x.get("param1") is None
)
def expensive_method(self, param1, param2):
    # Method implementation
    pass

Configuration

Cache backends can be configured with various parameters:

EphemeralCache: * max_keys: Maximum number of keys to store in memory (default: 1000)

DiskCache: * filepath: Path to the cache file (required) * max_keys: Maximum number of keys to store (default: 1000)

RedisDFCache: * host: Redis host (default: ‘localhost’) * port: Redis port (default: 6379) * db: Redis database number (default: 12) * max_keys: Maximum number of keys to store (default: 1000) * ttl: Time-to-live in seconds for cache entries (default: None, no expiration) * Additional keyword arguments are passed to redis.StrictRedis

Backward Compatibility

The cache timestamp functionality is fully backward compatible:

Existing cache files will continue to work
Old cache entries without timestamps will be automatically converted
No changes to Repository or ProjectDirectory APIs
All existing code continues to work unchanged

Best Practices

Shared Cache Usage

Warning

Recommendation: Use Separate Cache Instances

While it’s technically possible to share the same cache object across multiple Repository instances, we strongly recommend using separate cache instances for each repository for the following reasons:

Recommended Approach - Separate Caches:

from gitpandas import Repository
from gitpandas.cache import DiskCache

# Create separate cache instances for each repository
cache1 = DiskCache(filepath='repo1_cache.gz')
cache2 = DiskCache(filepath='repo2_cache.gz')

repo1 = Repository('/path/to/repo1', cache_backend=cache1)
repo2 = Repository('/path/to/repo2', cache_backend=cache2)

Benefits of Separate Caches:

Complete Isolation: No risk of cache eviction conflicts between repositories
Predictable Memory Usage: Each repository has its own memory budget
Easier Debugging: Cache issues are isolated to specific repositories
Better Performance: No lock contention in multi-threaded scenarios
Clear Cache Management: You can clear or manage each repository’s cache independently

If You Must Share Caches:

If you need to share a cache object across multiple repositories (e.g., for memory constraints), the system is designed to handle this safely:

from gitpandas import Repository
from gitpandas.cache import EphemeralCache

# Shared cache (not recommended but supported)
shared_cache = EphemeralCache(max_keys=1000)

repo1 = Repository('/path/to/repo1', cache_backend=shared_cache)
repo2 = Repository('/path/to/repo2', cache_backend=shared_cache)

# Each repository gets separate cache entries
files1 = repo1.list_files()  # Creates cache key: list_files||repo1||None
files2 = repo2.list_files()  # Creates cache key: list_files||repo2||None

Shared Cache Considerations:

Repository names are included in cache keys to prevent collisions
Cache eviction affects all repositories sharing the cache
Memory usage is shared across all repositories
Very active repositories may evict cache entries from less active ones

Cache Size Planning

When planning cache sizes, consider:

Repository Size: Larger repositories generate more cache entries
Operation Types: Some operations (like cumulative_blame) create many cache entries
Memory Constraints: Balance cache size with available system memory
Analysis Patterns: Frequently repeated analyses benefit from larger caches

Recommended Cache Sizes:

# Small repositories (< 1000 commits)
cache = EphemeralCache(max_keys=100)

# Medium repositories (1000-10000 commits)
cache = EphemeralCache(max_keys=500)

# Large repositories (> 10000 commits)
cache = EphemeralCache(max_keys=1000)

# For disk/Redis caches, you can use larger sizes
cache = DiskCache(filepath='cache.gz', max_keys=5000)