Remote Operations and Cache Warming

Git-pandas provides safe and efficient methods for working with remote repositories and optimizing performance through cache warming. These features allow you to keep your repositories up to date and improve analysis performance through intelligent caching.

Safe Remote Fetch

The safe_fetch_remote method allows you to safely fetch changes from remote repositories without modifying your working directory or current branch.

Repository.safe_fetch_remote()

Repository.safe_fetch_remote(remote_name='origin', prune=False, dry_run=False)[source]

Safely fetch changes from remote repository.

Fetches the latest changes from a remote repository without modifying the working directory. This is a read-only operation that only updates remote-tracking branches.

Parameters:
  • remote_name (str, optional) – Name of remote to fetch from. Defaults to ‘origin’.

  • prune (bool, optional) – Remove remote-tracking branches that no longer exist on remote. Defaults to False.

  • dry_run (bool, optional) – Show what would be fetched without actually fetching. Defaults to False.

Returns:

Fetch results with keys:
  • success (bool): Whether the fetch was successful

  • message (str): Status message or error description

  • remote_exists (bool): Whether the specified remote exists

  • changes_available (bool): Whether new changes were fetched

  • error (Optional[str]): Error message if fetch failed

Return type:

dict

Note

This method is safe as it only fetches remote changes and never modifies the working directory or current branch. It will not perform any merges, rebases, or checkouts.

Basic Usage

from gitpandas import Repository
from gitpandas.cache import EphemeralCache

# Create repository with caching
cache = EphemeralCache(max_keys=100)
repo = Repository('/path/to/repo', cache_backend=cache)

# Perform a dry run to see what would be fetched
dry_result = repo.safe_fetch_remote(dry_run=True)
print(f"Would fetch from: {dry_result['message']}")

# Safely fetch changes
if dry_result['remote_exists']:
    result = repo.safe_fetch_remote()
    if result['success']:
        print(f"Fetch completed: {result['message']}")
        if result['changes_available']:
            print("New changes are available!")
    else:
        print(f"Fetch failed: {result['error']}")

Advanced Options

# Fetch from a specific remote
result = repo.safe_fetch_remote(remote_name='upstream')

# Fetch and prune deleted remote branches
result = repo.safe_fetch_remote(prune=True)

# Perform dry run to preview without fetching
result = repo.safe_fetch_remote(dry_run=True)

Safety Features

  • Read-only operation: Never modifies working directory or current branch

  • Error handling: Gracefully handles network errors and missing remotes

  • Validation: Checks for remote existence before attempting fetch

  • Dry run support: Preview operations without making changes

Cache Warming

Cache warming pre-populates the cache with commonly used data to improve performance of subsequent analysis operations.

Repository.warm_cache()

Repository.warm_cache(methods=None, **kwargs)[source]

Pre-populate cache with commonly used data.

Executes a set of commonly used repository analysis methods to populate the cache, improving performance for subsequent calls. Only methods that support caching will be executed.

Parameters:
  • methods (Optional[List[str]]) – List of method names to pre-warm. If None, uses a default set of commonly used methods. Available methods: - ‘commit_history’: Load commit history - ‘branches’: Load branch information - ‘tags’: Load tag information - ‘blame’: Load blame information - ‘file_detail’: Load file details - ‘list_files’: Load file listing - ‘file_change_rates’: Load file change statistics

  • **kwargs – Additional keyword arguments to pass to the methods. Common arguments include: - branch: Branch to analyze (default: repository’s default branch) - limit: Limit number of commits to analyze - ignore_globs: List of glob patterns to ignore - include_globs: List of glob patterns to include

Returns:

Results of cache warming operations with keys:
  • success (bool): Whether cache warming was successful

  • methods_executed (List[str]): List of methods that were executed

  • methods_failed (List[str]): List of methods that failed

  • cache_entries_created (int): Number of cache entries created

  • execution_time (float): Total execution time in seconds

  • errors (List[str]): List of error messages for failed methods

Return type:

dict

Note

This method will only execute methods if a cache backend is configured. If no cache backend is available, it will return immediately with a success status but no methods executed.

Basic Usage

from gitpandas import Repository
from gitpandas.cache import DiskCache

# Create repository with persistent cache
cache = DiskCache('/tmp/my_cache.gz', max_keys=200)
repo = Repository('/path/to/repo', cache_backend=cache)

# Warm cache with default methods
result = repo.warm_cache()
print(f"Cache warming completed in {result['execution_time']:.2f} seconds")
print(f"Created {result['cache_entries_created']} cache entries")
print(f"Methods executed: {result['methods_executed']}")

Custom Cache Warming

# Warm specific methods with custom parameters
result = repo.warm_cache(
    methods=['commit_history', 'blame', 'file_detail'],
    limit=100,
    branch='main',
    ignore_globs=['*.log', '*.tmp']
)

# Check results
if result['success']:
    print(f"Successfully warmed {len(result['methods_executed'])} methods")
else:
    print(f"Errors occurred: {result['errors']}")

Available Methods

The following methods can be warmed:

  • commit_history: Load commit history

  • branches: Load branch information

  • tags: Load tag information

  • blame: Load blame information

  • file_detail: Load file details

  • list_files: Load file listing

  • file_change_rates: Load file change statistics

Performance Benefits

Cache warming can significantly improve performance:

import time

# Test cold performance
start = time.time()
history_cold = repo.commit_history(limit=100)
cold_time = time.time() - start

# Warm the cache
repo.warm_cache(methods=['commit_history'], limit=100)

# Test warm performance
start = time.time()
history_warm = repo.commit_history(limit=100)
warm_time = time.time() - start

speedup = cold_time / warm_time
print(f"Cache warming provided {speedup:.1f}x speedup!")

Bulk Operations

For projects with multiple repositories, bulk operations allow you to efficiently fetch and warm caches across all repositories.

ProjectDirectory.bulk_fetch_and_warm()

ProjectDirectory.bulk_fetch_and_warm(fetch_remote=False, warm_cache=False, parallel=True, remote_name='origin', prune=False, dry_run=False, cache_methods=None, **kwargs)[source]

Safely fetch remote changes and pre-warm cache for all repositories.

Performs bulk operations across all repositories in the project directory, optionally fetching from remote repositories and pre-warming caches to improve subsequent analysis performance.

Parameters:
  • fetch_remote (bool, optional) – Whether to fetch from remote repositories. Defaults to False.

  • warm_cache (bool, optional) – Whether to pre-warm repository caches. Defaults to False.

  • parallel (bool, optional) – Use parallel processing when available (joblib). Defaults to True.

  • remote_name (str, optional) – Name of remote to fetch from. Defaults to ‘origin’.

  • prune (bool, optional) – Remove remote-tracking branches that no longer exist. Defaults to False.

  • dry_run (bool, optional) – Show what would be fetched without actually fetching. Defaults to False.

  • cache_methods (Optional[List[str]]) – List of methods to use for cache warming. If None, uses default methods. See Repository.warm_cache for available methods.

  • **kwargs – Additional keyword arguments to pass to cache warming methods.

Returns:

Results with keys:
  • success (bool): Whether the overall operation was successful

  • repositories_processed (int): Number of repositories processed

  • fetch_results (dict): Per-repository fetch results (if fetch_remote=True)

  • cache_results (dict): Per-repository cache warming results (if warm_cache=True)

  • execution_time (float): Total execution time in seconds

  • summary (dict): Summary statistics of the operation

Return type:

dict

Note

This method safely handles errors at the repository level, ensuring that failures in one repository don’t affect processing of others. All operations are read-only and will not modify working directories or current branches.

Basic Usage

from gitpandas import ProjectDirectory
from gitpandas.cache import DiskCache

# Create project directory with shared cache
cache = DiskCache('/tmp/project_cache.gz', max_keys=500)
project = ProjectDirectory('/path/to/repos', cache_backend=cache)

# Perform bulk operations
result = project.bulk_fetch_and_warm(
    fetch_remote=True,
    warm_cache=True,
    parallel=True
)

print(f"Processed {result['repositories_processed']} repositories")
print(f"Fetch summary: {result['summary']['fetch_successful']} successful")
print(f"Cache summary: {result['summary']['cache_successful']} successful")

Advanced Bulk Operations

# Customize bulk operations
result = project.bulk_fetch_and_warm(
    fetch_remote=True,
    warm_cache=True,
    parallel=True,
    remote_name='upstream',
    prune=True,
    dry_run=False,
    cache_methods=['commit_history', 'blame'],
    limit=200,
    ignore_globs=['*.log']
)

# Check individual repository results
for repo_name, fetch_result in result['fetch_results'].items():
    if not fetch_result['success']:
        print(f"Fetch failed for {repo_name}: {fetch_result['error']}")

for repo_name, cache_result in result['cache_results'].items():
    print(f"{repo_name}: {cache_result['cache_entries_created']} cache entries")

Parallel Processing

Bulk operations support parallel processing when joblib is available:

# Enable parallel processing (default when joblib available)
result = project.bulk_fetch_and_warm(
    fetch_remote=True,
    warm_cache=True,
    parallel=True  # Uses all available CPU cores
)

# Disable parallel processing for sequential execution
result = project.bulk_fetch_and_warm(
    fetch_remote=True,
    warm_cache=True,
    parallel=False
)

Best Practices

  1. Regular Fetching: Use safe_fetch_remote regularly to keep repositories current

  2. Dry Run First: Use dry runs to preview fetch operations

  3. Error Handling: Always check return values for errors

  4. Remote Validation: Verify remotes exist before fetching

  1. Persistent Caching: Use DiskCache for long-term cache persistence

  2. Appropriate Cache Size: Set reasonable max_keys based on your usage

  3. Selective Warming: Only warm methods you actually use

  4. Regular Warming: Re-warm caches when data becomes stale

  1. Shared Caches: Use shared cache backends across repositories

  2. Parallel Processing: Enable parallel processing for multiple repositories

  3. Custom Parameters: Tailor operations to your specific needs

  4. Error Isolation: Handle errors at the repository level

Error Handling

All remote operations and cache warming methods provide comprehensive error information:

# Safe fetch error handling
result = repo.safe_fetch_remote()
if not result['success']:
    if result['remote_exists']:
        print(f"Fetch failed: {result['error']}")
    else:
        print(f"No remote configured: {result['message']}")

# Cache warming error handling
result = repo.warm_cache()
if not result['success']:
    print(f"Failed methods: {result['methods_failed']}")
    for error in result['errors']:
        print(f"Error: {error}")

# Bulk operation error handling
result = project.bulk_fetch_and_warm(fetch_remote=True, warm_cache=True)
for repo_name, repo_result in result['fetch_results'].items():
    if not repo_result['success']:
        print(f"Repository {repo_name} failed: {repo_result.get('error', 'Unknown error')}")

Examples

Complete examples demonstrating these features can be found in the examples/ directory:

  • examples/remote_fetch_and_cache_warming.py: Comprehensive demonstration of all features

  • examples/cache_timestamps.py: Cache timestamp and metadata examples

Return Value Reference

The safe_fetch_remote method returns a dictionary with these keys:

  • success (bool): Whether the fetch was successful

  • message (str): Status message or description

  • remote_exists (bool): Whether the specified remote exists

  • changes_available (bool): Whether new changes were fetched

  • error (str or None): Error message if fetch failed

The warm_cache method returns a dictionary with these keys:

  • success (bool): Whether cache warming was successful

  • methods_executed (list): List of methods that were executed

  • methods_failed (list): List of methods that failed

  • cache_entries_created (int): Number of cache entries created

  • execution_time (float): Total execution time in seconds

  • errors (list): List of error messages for failed methods

The bulk_fetch_and_warm method returns a dictionary with these keys:

  • success (bool): Whether the overall operation was successful

  • repositories_processed (int): Number of repositories processed

  • fetch_results (dict): Per-repository fetch results

  • cache_results (dict): Per-repository cache warming results

  • execution_time (float): Total execution time in seconds

  • summary (dict): Summary statistics including:

    • fetch_successful (int): Number of successful fetches

    • fetch_failed (int): Number of failed fetches

    • cache_successful (int): Number of successful cache warming operations

    • cache_failed (int): Number of failed cache warming operations

    • repositories_with_remotes (int): Number of repositories with remotes

    • total_cache_entries_created (int): Total cache entries created across all repositories