Caching¶
Git-Pandas supports pluggable cache backends to optimize performance for expensive, repetitive operations. This is particularly useful for large repositories or when running multiple analyses.
Overview¶
The caching system provides: * In-memory caching for temporary results * Disk-based caching for persistent storage across sessions * Redis-based caching for distributed storage * Cache management and invalidation methods * Decorator-based caching for expensive operations * Cache timestamp tracking - know when cache entries were populated * Cache statistics and monitoring - track cache performance and usage
Available Cache Backends¶
In-Memory Cache (EphemeralCache)¶
The default in-memory cache is ephemeral and will be cleared when the process ends:
from gitpandas import Repository
from gitpandas.cache import EphemeralCache
# Create an in-memory cache with default settings
cache = EphemeralCache()
# Or customize the cache size
cache = EphemeralCache(max_keys=500)
# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)
Disk Cache (DiskCache)¶
For persistent caching that survives between sessions:
from gitpandas import Repository
from gitpandas.cache import DiskCache
# Create a disk cache
cache = DiskCache(filepath='/path/to/cache.gz', max_keys=1000)
# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)
Redis Cache (RedisDFCache)¶
For persistent caching across sessions, use Redis:
from gitpandas import Repository
from gitpandas.cache import RedisDFCache
# Create a Redis cache with default settings
cache = RedisDFCache()
# Or customize Redis connection and cache settings
cache = RedisDFCache(
host='localhost',
port=6379,
db=12,
max_keys=1000,
ttl=3600 # Cache entries expire after 1 hour
)
# Use the cache with a repository
repo = Repository('/path/to/repo', cache_backend=cache)
Cache Timestamp Information¶
All cache backends now track when cache entries were populated. You can access this information without any changes to the Repository or ProjectDirectory API:
from gitpandas import Repository
from gitpandas.cache import EphemeralCache
# Create repository with cache
cache = EphemeralCache()
repo = Repository('/path/to/repo', cache_backend=cache)
# Populate cache with some operations
commit_history = repo.commit_history(limit=10)
file_list = repo.list_files()
# Check what's in the cache and when it was cached
cached_keys = cache.list_cached_keys()
for entry in cached_keys:
print(f"Key: {entry['key']}")
print(f"Cached at: {entry['cached_at']}")
print(f"Age: {entry['age_seconds']:.1f} seconds")
# Get specific cache information
key = "commit_history_main_10_None_None_None_None"
info = cache.get_cache_info(key)
if info:
print(f"Cache entry age: {info['age_minutes']:.2f} minutes")
Cache Information Methods¶
All cache backends support these methods for accessing timestamp information:
list_cached_keys()- Returns list of all cached keys with metadataget_cache_info(key)- Returns detailed information about a specific cache entry
The returned information includes:
cached_at- UTC timestamp when the entry was cachedage_seconds- Age of the cache entry in secondsage_minutes- Age of the cache entry in minutesage_hours- Age of the cache entry in hourscache_key- The original cache key
Using the Cache Decorator¶
The @multicache decorator can be used to cache method results:
from gitpandas.cache import multicache
@multicache(
key_prefix="method_name",
key_list=["param1", "param2"],
skip_if=lambda x: x.get("param1") is None
)
def expensive_method(self, param1, param2):
# Method implementation
pass
Configuration¶
Cache backends can be configured with various parameters:
EphemeralCache:
* max_keys: Maximum number of keys to store in memory (default: 1000)
DiskCache:
* filepath: Path to the cache file (required)
* max_keys: Maximum number of keys to store (default: 1000)
RedisDFCache:
* host: Redis host (default: ‘localhost’)
* port: Redis port (default: 6379)
* db: Redis database number (default: 12)
* max_keys: Maximum number of keys to store (default: 1000)
* ttl: Time-to-live in seconds for cache entries (default: None, no expiration)
* Additional keyword arguments are passed to redis.StrictRedis
Backward Compatibility¶
The cache timestamp functionality is fully backward compatible:
Existing cache files will continue to work
Old cache entries without timestamps will be automatically converted
No changes to Repository or ProjectDirectory APIs
All existing code continues to work unchanged
Best Practices¶
Cache Size Planning¶
When planning cache sizes, consider:
Repository Size: Larger repositories generate more cache entries
Operation Types: Some operations (like
cumulative_blame) create many cache entriesMemory Constraints: Balance cache size with available system memory
Analysis Patterns: Frequently repeated analyses benefit from larger caches
Recommended Cache Sizes:
# Small repositories (< 1000 commits)
cache = EphemeralCache(max_keys=100)
# Medium repositories (1000-10000 commits)
cache = EphemeralCache(max_keys=500)
# Large repositories (> 10000 commits)
cache = EphemeralCache(max_keys=1000)
# For disk/Redis caches, you can use larger sizes
cache = DiskCache(filepath='cache.gz', max_keys=5000)
API Reference¶
- class gitpandas.cache.EphemeralCache(max_keys=1000)[source]¶
Bases:
objectA simple in-memory cache.
- invalidate_cache(keys=None, pattern=None)[source]¶
Invalidate specific cache entries or all entries.
- Parameters:
Note
If both keys and pattern are None, all cache entries are invalidated.
- class gitpandas.cache.DiskCache(filepath, max_keys=1000)[source]¶
Bases:
EphemeralCacheAn in-memory cache that can be persisted to disk using pickle.
Inherits LRU eviction logic from EphemeralCache. Thread-safe for concurrent access.
- __init__(filepath, max_keys=1000)[source]¶
Initializes the cache. Tries to load from the specified filepath if it exists.
- Parameters:
filepath – Path to the file for persisting the cache.
max_keys – Maximum number of keys to keep in the cache (LRU).
- load()[source]¶
Loads the cache state (_cache dictionary and _key_list) from the specified filepath using pickle. Handles file not found and deserialization errors. Thread-safe operation.
- save()[source]¶
Saves the current cache state (_cache dictionary and _key_list) to the specified filepath using pickle. Creates parent directories if needed. Thread-safe operation.
- get_cache_info(k)¶
Get cache entry metadata for a key.
- get_cache_stats()¶
Get comprehensive cache statistics.
- Returns:
Cache statistics including size, hit rates, and age information
- Return type:
- invalidate_cache(keys=None, pattern=None)¶
Invalidate specific cache entries or all entries.
- Parameters:
Note
If both keys and pattern are None, all cache entries are invalidated.
- list_cached_keys()¶
List all cached keys with their metadata.
- class gitpandas.cache.RedisDFCache(host='localhost', port=6379, db=12, max_keys=1000, ttl=None, **kwargs)[source]¶
Bases:
objectA redis based cache, using redis-py under the hood.
- Parameters:
host – default localhost
port – default 6379
db – the database to use, default 12
max_keys – the max number of keys to cache, default 1000
ttl – time to live for any cached results, default None
kwargs – additional options available to redis.StrictRedis
- invalidate_cache(keys=None, pattern=None)[source]¶
Invalidate specific cache entries or all entries.
- Parameters:
Note
If both keys and pattern are None, all cache entries are invalidated.
- gitpandas.cache.multicache(key_prefix, key_list, skip_if=None)[source]¶
Decorator to cache the results of a method call.
- Parameters:
The decorated method can accept an optional force_refresh=True argument to bypass the cache read but still update the cache with the new result. This force_refresh state propagates to nested calls on the same object instance.