Repository

The Repository class provides an interface for analyzing a single Git repository. It can be created from either a local or remote repository.

Overview

The Repository class offers methods for:

  • Commit history analysis with filtering options

  • File change tracking and blame information

  • Branch existence checking and repository status

  • Bus factor calculation and repository metrics

  • Punchcard statistics generation

Creating a Repository

You can create a Repository object in two ways:

Local Repository

Create a Repository from a local Git repository:

from gitpandas import Repository
repo = Repository(
    working_dir='/path/to/repo/',
    verbose=True,
    default_branch='main'  # Optional, will auto-detect if not specified
)

The directory must contain a .git directory. Subdirectories are not searched.

Remote Repository

Create a Repository from a remote Git repository:

from gitpandas import Repository
repo = Repository(
    working_dir='git://github.com/user/repo.git',
    verbose=True,
    default_branch='main'  # Optional, will auto-detect if not specified
)

The repository will be cloned locally into a temporary directory. This can be slow for large repositories.

Available Methods

Core Analysis

# Commit history analysis
repo.commit_history(
    branch=None,          # Branch to analyze
    limit=None,           # Maximum number of commits
    days=None,           # Limit to last N days
    ignore_globs=None,   # Files to ignore
    include_globs=None   # Files to include
)

# File change history
repo.file_change_history(
    branch=None,
    limit=None,
    days=None,
    ignore_globs=None,
    include_globs=None
)

# Blame analysis
repo.blame(
    rev="HEAD",          # Revision to analyze
    committer=True,      # Group by committer (False for author)
    by="repository",     # Group by 'repository' or 'file'
    ignore_globs=None,
    include_globs=None
)

# Bus factor analysis
repo.bus_factor(
    by="repository",     # How to group results ('repository' or 'file')
    ignore_globs=None,
    include_globs=None
)

# Commit pattern analysis
repo.punchcard(
    branch=None,
    limit=None,
    days=None,
    by=None,            # Additional grouping
    normalize=None,     # Normalize values
    ignore_globs=None,
    include_globs=None
)

Repository Information

# List files in repository
repo.list_files(rev="HEAD")

# Check branch existence
repo.has_branch(branch)

# Check if repository is bare
repo.is_bare()

# Check for coverage information
repo.has_coverage()
repo.coverage()

# Get specific commit content
repo.get_commit_content(
    rev,                # Revision to analyze
    ignore_globs=None,
    include_globs=None
)

Common Parameters

Most analysis methods support these filtering parameters:

  • branch: Branch to analyze (defaults to repository’s default branch)

  • limit: Maximum number of commits to analyze

  • days: Limit analysis to last N days

  • ignore_globs: List of glob patterns for files to ignore

  • include_globs: List of glob patterns for files to include

  • by: How to group results (usually ‘repository’ or ‘file’)

API Reference

class gitpandas.repository.Repository(working_dir=None, verbose=False, tmp_dir=None, cache_backend=None, labels_to_add=None, default_branch=None)[source]

Bases: object

A class for analyzing a single git repository.

This class provides functionality to analyze a git repository, whether it is a local repository or a remote repository that needs to be cloned. It offers methods for analyzing commit history, blame information, file changes, and other git metrics.

Parameters:
  • working_dir (Optional[str]) – Path to the git repository: - If None: Uses current working directory - If local path: Path must contain a .git directory - If git URL: Repository will be cloned to a temporary directory

  • verbose (bool, optional) – Whether to print verbose output. Defaults to False.

  • tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.

  • cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache

  • labels_to_add (Optional[List[str]]) – Extra labels to add to output DataFrames

  • default_branch (Optional[str]) – Name of the default branch to use. If None, will try to detect ‘main’ or ‘master’, and if neither exists, will raise ValueError.

Variables:
  • verbose (bool) – Whether verbose output is enabled

  • git_dir (str) – Path to the git repository

  • repo (git.Repo) – GitPython Repo instance

  • cache_backend (Optional[object]) – Cache backend being used

  • _labels_to_add (List[str]) – Labels to add to DataFrames

  • _git_repo_name (Optional[str]) – Repository name for remote repos

  • default_branch (str) – Name of the default branch

Raises:

ValueError – If default_branch is None and neither ‘main’ nor ‘master’ branch exists

Examples

>>> # Create from local repository
>>> repo = Repository('/path/to/repo')
>>> # Create from remote repository
>>> repo = Repository('git://github.com/user/repo.git')

Note

When using remote repositories, they will be cloned to temporary directories. This can be slow for large repositories.

__init__(working_dir=None, verbose=False, tmp_dir=None, cache_backend=None, labels_to_add=None, default_branch=None)[source]

Initialize a Repository instance.

Parameters:
  • working_dir (Optional[str]) – Path to the git repository: - If None: Uses current working directory - If local path: Path must contain a .git directory - If git URL: Repository will be cloned to a temporary directory

  • verbose (bool, optional) – Whether to print verbose output. Defaults to False.

  • tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.

  • cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache

  • labels_to_add (Optional[List[str]]) – Extra labels to add to output DataFrames

  • default_branch (Optional[str]) – Name of the default branch to use. If None, will try to detect ‘main’ or ‘master’, and if neither exists, will raise ValueError.

Raises:

ValueError – If default_branch is None and neither ‘main’ nor ‘master’ branch exists

__del__()[source]

Cleanup method called when the object is destroyed.

Cleans up any temporary directories created for cloned repositories.

is_bare(*args, **kwargs)
has_coverage(*args, **kwargs)
coverage(*args, **kwargs)
hours_estimate(*args, **kwargs)
commit_history(*args, **kwargs)
file_change_history(*args, **kwargs)
_process_commit_for_file_history(commit, history, ignore_globs, include_globs, skip_broken)[source]

Helper method to process a commit for file change history.

Parameters:
  • commit – The commit object to process

  • history – List to append the file change data to

  • ignore_globs – List of glob patterns for files to ignore

  • include_globs – List of glob patterns for files to include

  • skip_broken – Whether to skip errors for specific files

file_change_rates(*args, **kwargs)
blame(*args, **kwargs)
revs(*args, **kwargs)
cumulative_blame(*args, **kwargs)
parallel_cumulative_blame(*args, **kwargs)
branches(*args, **kwargs)
get_branches_by_commit(*args, **kwargs)
commits_in_tags(*args, **kwargs)
tags(*args, **kwargs)
property repo_name
_repo_name()[source]

Returns the name of the repository.

For local repositories, uses the name of the directory containing the .git folder. For remote repositories, extracts the name from the URL.

Returns:

Name of the repository, or ‘unknown_repo’ if name can’t be determined

Return type:

str

Note

This is an internal method primarily used to provide consistent repository names in DataFrame outputs.

_add_labels_to_df(df)[source]

Adds configured labels to a DataFrame.

Adds the repository name and any additional configured labels to the DataFrame. This ensures consistent labeling across all DataFrame outputs.

Parameters:

df (pandas.DataFrame) – DataFrame to add labels to

Returns:

The input DataFrame with additional label columns:
  • repository (str): Repository name

  • label0..labelN: Values from labels_to_add

Return type:

pandas.DataFrame

Note

This is an internal helper method used by all public methods that return DataFrames.

__str__()[source]

Returns a human-readable string representation of the repository.

Returns:

String in format ‘git repository: {name} at: {path}’

Return type:

str

get_commit_content(*args, **kwargs)
get_file_content(*args, **kwargs)
list_files(*args, **kwargs)
__repr__()[source]

Returns a unique string representation of the repository.

Returns:

The absolute path to the repository

Return type:

str

bus_factor(*args, **kwargs)
file_owner(*args, **kwargs)
_get_last_edit_date(file_path, rev='HEAD')[source]

Get the last edit date for a file at a given revision.

Parameters:
  • file_path (str) – Path to the file

  • rev (str) – Revision to check

Returns:

Last edit date for the file

Return type:

datetime

punchcard(*args, **kwargs)
has_branch(*args, **kwargs)
file_detail(*args, **kwargs)
time_between_revs(rev1, rev2)[source]

Calculates the time difference in days between two revisions.

Parameters:
  • rev1 (str) – The first revision (commit hash or tag).

  • rev2 (str) – The second revision (commit hash or tag).

Returns:

The absolute time difference in days between the two revisions.

Return type:

float

Note

The result is always non-negative (absolute value).

diff_stats_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)[source]

Computes diff statistics between two revisions.

Calculates the total insertions, deletions, net line change, and number of files changed between two arbitrary revisions (commits or tags). Optionally filters files using glob patterns.

Parameters:
  • rev1 (str) – The base revision (commit hash or tag).

  • rev2 (str) – The target revision (commit hash or tag).

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore.

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include.

Returns:

A dictionary with keys:
  • ’insertions’ (int): Total lines inserted.

  • ’deletions’ (int): Total lines deleted.

  • ’net’ (int): Net lines changed (insertions - deletions).

  • ’files_changed’ (int): Number of files changed.

  • ’files’ (List[str]): List of changed file paths.

Return type:

dict

Note

Binary files or files that cannot be parsed are skipped. If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.

committers_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)[source]

Finds unique committers and authors between two revisions.

Iterates through all commits between two revisions (exclusive of rev1, inclusive of rev2) and returns the unique committers and authors who contributed, filtered by file globs if provided.

Parameters:
  • rev1 (str) – The base revision (commit hash or tag).

  • rev2 (str) – The target revision (commit hash or tag).

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore.

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include.

Returns:

A dictionary with keys:
  • ’committers’ (List[str]): Sorted list of unique committer names.

  • ’authors’ (List[str]): Sorted list of unique author names.

Return type:

dict

Note

Only commits that touch files matching the glob filters are considered. The range is interpreted as Git does: rev1..rev2 means commits reachable from rev2 but not rev1.

files_changed_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)[source]

Lists files changed between two revisions.

Returns a sorted list of all files changed between two arbitrary revisions (commits or tags), optionally filtered by glob patterns.

Parameters:
  • rev1 (str) – The base revision (commit hash or tag).

  • rev2 (str) – The target revision (commit hash or tag).

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore.

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include.

Returns:

Sorted list of file paths changed between the two revisions.

Return type:

List[str]

Note

If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.

release_tag_summary(*args, **kwargs)
safe_fetch_remote(remote_name='origin', prune=False, dry_run=False)[source]

Safely fetch changes from remote repository.

Fetches the latest changes from a remote repository without modifying the working directory. This is a read-only operation that only updates remote-tracking branches.

Parameters:
  • remote_name (str, optional) – Name of remote to fetch from. Defaults to ‘origin’.

  • prune (bool, optional) – Remove remote-tracking branches that no longer exist on remote. Defaults to False.

  • dry_run (bool, optional) – Show what would be fetched without actually fetching. Defaults to False.

Returns:

Fetch results with keys:
  • success (bool): Whether the fetch was successful

  • message (str): Status message or error description

  • remote_exists (bool): Whether the specified remote exists

  • changes_available (bool): Whether new changes were fetched

  • error (Optional[str]): Error message if fetch failed

Return type:

dict

Note

This method is safe as it only fetches remote changes and never modifies the working directory or current branch. It will not perform any merges, rebases, or checkouts.

warm_cache(methods=None, **kwargs)[source]

Pre-populate cache with commonly used data.

Executes a set of commonly used repository analysis methods to populate the cache, improving performance for subsequent calls. Only methods that support caching will be executed.

Parameters:
  • methods (Optional[List[str]]) – List of method names to pre-warm. If None, uses a default set of commonly used methods. Available methods: - ‘commit_history’: Load commit history - ‘branches’: Load branch information - ‘tags’: Load tag information - ‘blame’: Load blame information - ‘file_detail’: Load file details - ‘list_files’: Load file listing - ‘file_change_rates’: Load file change statistics

  • **kwargs – Additional keyword arguments to pass to the methods. Common arguments include: - branch: Branch to analyze (default: repository’s default branch) - limit: Limit number of commits to analyze - ignore_globs: List of glob patterns to ignore - include_globs: List of glob patterns to include

Returns:

Results of cache warming operations with keys:
  • success (bool): Whether cache warming was successful

  • methods_executed (List[str]): List of methods that were executed

  • methods_failed (List[str]): List of methods that failed

  • cache_entries_created (int): Number of cache entries created

  • execution_time (float): Total execution time in seconds

  • errors (List[str]): List of error messages for failed methods

Return type:

dict

Note

This method will only execute methods if a cache backend is configured. If no cache backend is available, it will return immediately with a success status but no methods executed.

invalidate_cache(keys=None, pattern=None)[source]

Invalidate specific cache entries or all cache entries for this repository.

Parameters:
  • keys (Optional[List[str]]) – List of specific cache keys to invalidate

  • pattern (Optional[str]) – Pattern to match cache keys (supports * wildcard)

Returns:

Number of cache entries invalidated

Return type:

int

Note

If both keys and pattern are None, all cache entries for this repository are invalidated. Cache keys are automatically prefixed with repository name.

get_cache_stats()[source]

Get cache statistics for this repository.

Returns:

Cache statistics including repository-specific and global cache information

Return type:

dict

class gitpandas.repository.GitFlowRepository[source]

Bases: Repository

A special case where git flow is followed, so we know something about the branching scheme

__del__()

Cleanup method called when the object is destroyed.

Cleans up any temporary directories created for cloned repositories.

__repr__()

Returns a unique string representation of the repository.

Returns:

The absolute path to the repository

Return type:

str

__str__()

Returns a human-readable string representation of the repository.

Returns:

String in format ‘git repository: {name} at: {path}’

Return type:

str

_add_labels_to_df(df)

Adds configured labels to a DataFrame.

Adds the repository name and any additional configured labels to the DataFrame. This ensures consistent labeling across all DataFrame outputs.

Parameters:

df (pandas.DataFrame) – DataFrame to add labels to

Returns:

The input DataFrame with additional label columns:
  • repository (str): Repository name

  • label0..labelN: Values from labels_to_add

Return type:

pandas.DataFrame

Note

This is an internal helper method used by all public methods that return DataFrames.

_get_last_edit_date(file_path, rev='HEAD')

Get the last edit date for a file at a given revision.

Parameters:
  • file_path (str) – Path to the file

  • rev (str) – Revision to check

Returns:

Last edit date for the file

Return type:

datetime

_process_commit_for_file_history(commit, history, ignore_globs, include_globs, skip_broken)

Helper method to process a commit for file change history.

Parameters:
  • commit – The commit object to process

  • history – List to append the file change data to

  • ignore_globs – List of glob patterns for files to ignore

  • include_globs – List of glob patterns for files to include

  • skip_broken – Whether to skip errors for specific files

_repo_name()

Returns the name of the repository.

For local repositories, uses the name of the directory containing the .git folder. For remote repositories, extracts the name from the URL.

Returns:

Name of the repository, or ‘unknown_repo’ if name can’t be determined

Return type:

str

Note

This is an internal method primarily used to provide consistent repository names in DataFrame outputs.

blame(*args, **kwargs)
branches(*args, **kwargs)
bus_factor(*args, **kwargs)
commit_history(*args, **kwargs)
commits_in_tags(*args, **kwargs)
committers_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)

Finds unique committers and authors between two revisions.

Iterates through all commits between two revisions (exclusive of rev1, inclusive of rev2) and returns the unique committers and authors who contributed, filtered by file globs if provided.

Parameters:
  • rev1 (str) – The base revision (commit hash or tag).

  • rev2 (str) – The target revision (commit hash or tag).

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore.

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include.

Returns:

A dictionary with keys:
  • ’committers’ (List[str]): Sorted list of unique committer names.

  • ’authors’ (List[str]): Sorted list of unique author names.

Return type:

dict

Note

Only commits that touch files matching the glob filters are considered. The range is interpreted as Git does: rev1..rev2 means commits reachable from rev2 but not rev1.

coverage(*args, **kwargs)
cumulative_blame(*args, **kwargs)
diff_stats_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)

Computes diff statistics between two revisions.

Calculates the total insertions, deletions, net line change, and number of files changed between two arbitrary revisions (commits or tags). Optionally filters files using glob patterns.

Parameters:
  • rev1 (str) – The base revision (commit hash or tag).

  • rev2 (str) – The target revision (commit hash or tag).

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore.

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include.

Returns:

A dictionary with keys:
  • ’insertions’ (int): Total lines inserted.

  • ’deletions’ (int): Total lines deleted.

  • ’net’ (int): Net lines changed (insertions - deletions).

  • ’files_changed’ (int): Number of files changed.

  • ’files’ (List[str]): List of changed file paths.

Return type:

dict

Note

Binary files or files that cannot be parsed are skipped. If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.

file_change_history(*args, **kwargs)
file_change_rates(*args, **kwargs)
file_detail(*args, **kwargs)
file_owner(*args, **kwargs)
files_changed_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)

Lists files changed between two revisions.

Returns a sorted list of all files changed between two arbitrary revisions (commits or tags), optionally filtered by glob patterns.

Parameters:
  • rev1 (str) – The base revision (commit hash or tag).

  • rev2 (str) – The target revision (commit hash or tag).

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore.

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include.

Returns:

Sorted list of file paths changed between the two revisions.

Return type:

List[str]

Note

If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.

get_branches_by_commit(*args, **kwargs)
get_cache_stats()

Get cache statistics for this repository.

Returns:

Cache statistics including repository-specific and global cache information

Return type:

dict

get_commit_content(*args, **kwargs)
get_file_content(*args, **kwargs)
has_branch(*args, **kwargs)
has_coverage(*args, **kwargs)
hours_estimate(*args, **kwargs)
invalidate_cache(keys=None, pattern=None)

Invalidate specific cache entries or all cache entries for this repository.

Parameters:
  • keys (Optional[List[str]]) – List of specific cache keys to invalidate

  • pattern (Optional[str]) – Pattern to match cache keys (supports * wildcard)

Returns:

Number of cache entries invalidated

Return type:

int

Note

If both keys and pattern are None, all cache entries for this repository are invalidated. Cache keys are automatically prefixed with repository name.

is_bare(*args, **kwargs)
list_files(*args, **kwargs)
parallel_cumulative_blame(*args, **kwargs)
punchcard(*args, **kwargs)
release_tag_summary(*args, **kwargs)
property repo_name
revs(*args, **kwargs)
safe_fetch_remote(remote_name='origin', prune=False, dry_run=False)

Safely fetch changes from remote repository.

Fetches the latest changes from a remote repository without modifying the working directory. This is a read-only operation that only updates remote-tracking branches.

Parameters:
  • remote_name (str, optional) – Name of remote to fetch from. Defaults to ‘origin’.

  • prune (bool, optional) – Remove remote-tracking branches that no longer exist on remote. Defaults to False.

  • dry_run (bool, optional) – Show what would be fetched without actually fetching. Defaults to False.

Returns:

Fetch results with keys:
  • success (bool): Whether the fetch was successful

  • message (str): Status message or error description

  • remote_exists (bool): Whether the specified remote exists

  • changes_available (bool): Whether new changes were fetched

  • error (Optional[str]): Error message if fetch failed

Return type:

dict

Note

This method is safe as it only fetches remote changes and never modifies the working directory or current branch. It will not perform any merges, rebases, or checkouts.

tags(*args, **kwargs)
time_between_revs(rev1, rev2)

Calculates the time difference in days between two revisions.

Parameters:
  • rev1 (str) – The first revision (commit hash or tag).

  • rev2 (str) – The second revision (commit hash or tag).

Returns:

The absolute time difference in days between the two revisions.

Return type:

float

Note

The result is always non-negative (absolute value).

warm_cache(methods=None, **kwargs)

Pre-populate cache with commonly used data.

Executes a set of commonly used repository analysis methods to populate the cache, improving performance for subsequent calls. Only methods that support caching will be executed.

Parameters:
  • methods (Optional[List[str]]) – List of method names to pre-warm. If None, uses a default set of commonly used methods. Available methods: - ‘commit_history’: Load commit history - ‘branches’: Load branch information - ‘tags’: Load tag information - ‘blame’: Load blame information - ‘file_detail’: Load file details - ‘list_files’: Load file listing - ‘file_change_rates’: Load file change statistics

  • **kwargs – Additional keyword arguments to pass to the methods. Common arguments include: - branch: Branch to analyze (default: repository’s default branch) - limit: Limit number of commits to analyze - ignore_globs: List of glob patterns to ignore - include_globs: List of glob patterns to include

Returns:

Results of cache warming operations with keys:
  • success (bool): Whether cache warming was successful

  • methods_executed (List[str]): List of methods that were executed

  • methods_failed (List[str]): List of methods that failed

  • cache_entries_created (int): Number of cache entries created

  • execution_time (float): Total execution time in seconds

  • errors (List[str]): List of error messages for failed methods

Return type:

dict

Note

This method will only execute methods if a cache backend is configured. If no cache backend is available, it will return immediately with a success status but no methods executed.

__init__()[source]

Initialize a Repository instance.

Parameters:
  • working_dir (Optional[str]) – Path to the git repository: - If None: Uses current working directory - If local path: Path must contain a .git directory - If git URL: Repository will be cloned to a temporary directory

  • verbose (bool, optional) – Whether to print verbose output. Defaults to False.

  • tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.

  • cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache

  • labels_to_add (Optional[List[str]]) – Extra labels to add to output DataFrames

  • default_branch (Optional[str]) – Name of the default branch to use. If None, will try to detect ‘main’ or ‘master’, and if neither exists, will raise ValueError.

Raises:

ValueError – If default_branch is None and neither ‘main’ nor ‘master’ branch exists