Caching

Git-Pandas supports pluggable cache backends to optimize performance for expensive, repetitive operations. This is particularly useful for large repositories or when running multiple analyses.

Overview

The caching system provides: * In-memory caching for temporary results * Disk-based caching for persistent storage across sessions * Redis-based caching for distributed storage * Cache management and invalidation methods * Decorator-based caching for expensive operations * Cache timestamp tracking - know when cache entries were populated * Cache statistics and monitoring - track cache performance and usage

Available Cache Backends

In-Memory Cache (EphemeralCache)

The default in-memory cache is ephemeral and will be cleared when the process ends:

from gitpandas import Repository
from gitpandas.cache import EphemeralCache

# Create an in-memory cache with default settings
cache = EphemeralCache()

# Or customize the cache size
cache = EphemeralCache(max_keys=500)

# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)

Disk Cache (DiskCache)

For persistent caching that survives between sessions:

from gitpandas import Repository
from gitpandas.cache import DiskCache

# Create a disk cache
cache = DiskCache(filepath='/path/to/cache.gz', max_keys=1000)

# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)

Redis Cache (RedisDFCache)

For persistent caching across sessions, use Redis:

from gitpandas import Repository
from gitpandas.cache import RedisDFCache

# Create a Redis cache with default settings
cache = RedisDFCache()

# Or customize Redis connection and cache settings
cache = RedisDFCache(
    host='localhost',
    port=6379,
    db=12,
    max_keys=1000,
    ttl=3600  # Cache entries expire after 1 hour
)

# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)

Cache Timestamp Information

All cache backends now track when cache entries were populated. You can access this information without any changes to the Repository or ProjectDirectory API:

from gitpandas import Repository
from gitpandas.cache import EphemeralCache

# Create repository with cache
cache = EphemeralCache()
repo = Repository('/path/to/repo', cache_backend=cache)

# Populate cache with some operations
commit_history = repo.commit_history(limit=10)
file_list = repo.list_files()

# Check what's in the cache and when it was cached
cached_keys = cache.list_cached_keys()
for entry in cached_keys:
    print(f"Key: {entry['key']}")
    print(f"Cached at: {entry['cached_at']}")
    print(f"Age: {entry['age_seconds']:.1f} seconds")

# Get specific cache information
key = "commit_history_main_10_None_None_None_None"
info = cache.get_cache_info(key)
if info:
    print(f"Cache entry age: {info['age_minutes']:.2f} minutes")

Cache Information Methods

All cache backends support these methods for accessing timestamp information:

  • list_cached_keys() - Returns list of all cached keys with metadata

  • get_cache_info(key) - Returns detailed information about a specific cache entry

The returned information includes:

  • cached_at - UTC timestamp when the entry was cached

  • age_seconds - Age of the cache entry in seconds

  • age_minutes - Age of the cache entry in minutes

  • age_hours - Age of the cache entry in hours

  • cache_key - The original cache key

Using the Cache Decorator

The @multicache decorator can be used to cache method results:

from gitpandas.cache import multicache

@multicache(
    key_prefix="method_name",
    key_list=["param1", "param2"],
    skip_if=lambda x: x.get("param1") is None
)
def expensive_method(self, param1, param2):
    # Method implementation
    pass

Configuration

Cache backends can be configured with various parameters:

EphemeralCache: * max_keys: Maximum number of keys to store in memory (default: 1000)

DiskCache: * filepath: Path to the cache file (required) * max_keys: Maximum number of keys to store (default: 1000)

RedisDFCache: * host: Redis host (default: ‘localhost’) * port: Redis port (default: 6379) * db: Redis database number (default: 12) * max_keys: Maximum number of keys to store (default: 1000) * ttl: Time-to-live in seconds for cache entries (default: None, no expiration) * Additional keyword arguments are passed to redis.StrictRedis

Backward Compatibility

The cache timestamp functionality is fully backward compatible:

  • Existing cache files will continue to work

  • Old cache entries without timestamps will be automatically converted

  • No changes to Repository or ProjectDirectory APIs

  • All existing code continues to work unchanged

Best Practices

Shared Cache Usage

Warning

Recommendation: Use Separate Cache Instances

While it’s technically possible to share the same cache object across multiple Repository instances, we strongly recommend using separate cache instances for each repository for the following reasons:

Recommended Approach - Separate Caches:

from gitpandas import Repository
from gitpandas.cache import DiskCache

# Create separate cache instances for each repository
cache1 = DiskCache(filepath='repo1_cache.gz')
cache2 = DiskCache(filepath='repo2_cache.gz')

repo1 = Repository('/path/to/repo1', cache_backend=cache1)
repo2 = Repository('/path/to/repo2', cache_backend=cache2)

Benefits of Separate Caches:

  • Complete Isolation: No risk of cache eviction conflicts between repositories

  • Predictable Memory Usage: Each repository has its own memory budget

  • Easier Debugging: Cache issues are isolated to specific repositories

  • Better Performance: No lock contention in multi-threaded scenarios

  • Clear Cache Management: You can clear or manage each repository’s cache independently

If You Must Share Caches:

If you need to share a cache object across multiple repositories (e.g., for memory constraints), the system is designed to handle this safely:

from gitpandas import Repository
from gitpandas.cache import EphemeralCache

# Shared cache (not recommended but supported)
shared_cache = EphemeralCache(max_keys=1000)

repo1 = Repository('/path/to/repo1', cache_backend=shared_cache)
repo2 = Repository('/path/to/repo2', cache_backend=shared_cache)

# Each repository gets separate cache entries
files1 = repo1.list_files()  # Creates cache key: list_files||repo1||None
files2 = repo2.list_files()  # Creates cache key: list_files||repo2||None

Shared Cache Considerations:

  • Repository names are included in cache keys to prevent collisions

  • Cache eviction affects all repositories sharing the cache

  • Memory usage is shared across all repositories

  • Very active repositories may evict cache entries from less active ones

Cache Size Planning

When planning cache sizes, consider:

  • Repository Size: Larger repositories generate more cache entries

  • Operation Types: Some operations (like cumulative_blame) create many cache entries

  • Memory Constraints: Balance cache size with available system memory

  • Analysis Patterns: Frequently repeated analyses benefit from larger caches

Recommended Cache Sizes:

# Small repositories (< 1000 commits)
cache = EphemeralCache(max_keys=100)

# Medium repositories (1000-10000 commits)
cache = EphemeralCache(max_keys=500)

# Large repositories (> 10000 commits)
cache = EphemeralCache(max_keys=1000)

# For disk/Redis caches, you can use larger sizes
cache = DiskCache(filepath='cache.gz', max_keys=5000)

API Reference

class gitpandas.cache.EphemeralCache(max_keys=1000)[source]

Bases: object

A simple in-memory cache.

__init__(max_keys=1000)[source]
evict(n=1)[source]
set(k, v)[source]
get(k)[source]
_get_entry(k)[source]

Internal method that returns the CacheEntry object.

exists(k)[source]
get_cache_info(k)[source]

Get cache entry metadata for a key.

list_cached_keys()[source]

List all cached keys with their metadata.

invalidate_cache(keys=None, pattern=None)[source]

Invalidate specific cache entries or all entries.

Parameters:
  • keys (Optional[List[str]]) – List of specific keys to invalidate

  • pattern (Optional[str]) – Pattern to match keys (supports * wildcard)

Note

If both keys and pattern are None, all cache entries are invalidated.

get_cache_stats()[source]

Get comprehensive cache statistics.

Returns:

Cache statistics including size, hit rates, and age information

Return type:

dict

save()[source]

Empty save method for compatibility with DiskCache.

class gitpandas.cache.DiskCache(filepath, max_keys=1000)[source]

Bases: EphemeralCache

An in-memory cache that can be persisted to disk using pickle.

Inherits LRU eviction logic from EphemeralCache. Thread-safe for concurrent access.

__init__(filepath, max_keys=1000)[source]

Initializes the cache. Tries to load from the specified filepath if it exists.

Parameters:
  • filepath – Path to the file for persisting the cache.

  • max_keys – Maximum number of keys to keep in the cache (LRU).

set(k, v)[source]

Thread-safe set operation that prevents nested save calls.

get(k)[source]

Thread-safe get operation with disk loading capability.

_get_entry(k)[source]

Internal method that returns the CacheEntry object.

exists(k)[source]

Thread-safe exists check.

evict(n=1)[source]

Thread-safe eviction.

load()[source]

Loads the cache state (_cache dictionary and _key_list) from the specified filepath using pickle. Handles file not found and deserialization errors. Thread-safe operation.

save()[source]

Saves the current cache state (_cache dictionary and _key_list) to the specified filepath using pickle. Creates parent directories if needed. Thread-safe operation.

get_cache_info(k)

Get cache entry metadata for a key.

get_cache_stats()

Get comprehensive cache statistics.

Returns:

Cache statistics including size, hit rates, and age information

Return type:

dict

invalidate_cache(keys=None, pattern=None)

Invalidate specific cache entries or all entries.

Parameters:
  • keys (Optional[List[str]]) – List of specific keys to invalidate

  • pattern (Optional[str]) – Pattern to match keys (supports * wildcard)

Note

If both keys and pattern are None, all cache entries are invalidated.

list_cached_keys()

List all cached keys with their metadata.

class gitpandas.cache.RedisDFCache(host='localhost', port=6379, db=12, max_keys=1000, ttl=None, **kwargs)[source]

Bases: object

A redis based cache, using redis-py under the hood.

Parameters:
  • host – default localhost

  • port – default 6379

  • db – the database to use, default 12

  • max_keys – the max number of keys to cache, default 1000

  • ttl – time to live for any cached results, default None

  • kwargs – additional options available to redis.StrictRedis

__init__(host='localhost', port=6379, db=12, max_keys=1000, ttl=None, **kwargs)[source]
evict(n=1)[source]
set(orik, v)[source]
get(orik)[source]
_get_entry(orik)[source]

Internal method that returns the CacheEntry object.

exists(k)[source]
sync()[source]

Syncs the key list with what is in redis. :return: None

get_cache_info(orik)[source]

Get cache entry metadata for a key.

list_cached_keys()[source]

List all cached keys with their metadata.

invalidate_cache(keys=None, pattern=None)[source]

Invalidate specific cache entries or all entries.

Parameters:
  • keys (Optional[List[str]]) – List of specific keys to invalidate (without prefix)

  • pattern (Optional[str]) – Pattern to match keys (supports * wildcard, without prefix)

Note

If both keys and pattern are None, all cache entries are invalidated.

get_cache_stats()[source]

Get comprehensive cache statistics.

Returns:

Cache statistics including size, hit rates, and age information

Return type:

dict

purge()[source]
gitpandas.cache.multicache(key_prefix, key_list, skip_if=None)[source]

Decorator to cache the results of a method call.

Parameters:
  • key_prefix (str) – Prefix for the cache key.

  • key_list (list[str]) – List of argument names (from kwargs) to include in the cache key.

  • skip_if (callable, optional) – A function that takes kwargs and returns True if caching should be skipped entirely (no read, no write). Defaults to None.

The decorated method can accept an optional force_refresh=True argument to bypass the cache read but still update the cache with the new result. This force_refresh state propagates to nested calls on the same object instance.

class gitpandas.cache.CacheEntry(data, cache_key=None)[source]

Bases: object

Wrapper for cached values that includes metadata.

__init__(data, cache_key=None)[source]
to_dict()[source]

Convert to dictionary for serialization.

classmethod from_dict(d)[source]

Create CacheEntry from dictionary.

age_seconds()[source]

Return age of cache entry in seconds.

age_minutes()[source]

Return age of cache entry in minutes.

age_hours()[source]

Return age of cache entry in hours.