Repository¶
The Repository class provides an interface for analyzing a single Git repository. It can be created from either a local or remote repository.
Overview¶
The Repository class offers methods for:
Commit history analysis with filtering options
File change tracking and blame information
Branch existence checking and repository status
Bus factor calculation and repository metrics
Punchcard statistics generation
Creating a Repository¶
You can create a Repository object in two ways:
Local Repository¶
Create a Repository from a local Git repository:
from gitpandas import Repository
repo = Repository(
working_dir='/path/to/repo/',
verbose=True,
default_branch='main' # Optional, will auto-detect if not specified
)
The directory must contain a .git directory. Subdirectories are not searched.
Remote Repository¶
Create a Repository from a remote Git repository:
from gitpandas import Repository
repo = Repository(
working_dir='git://github.com/user/repo.git',
verbose=True,
default_branch='main' # Optional, will auto-detect if not specified
)
The repository will be cloned locally into a temporary directory. This can be slow for large repositories.
Available Methods¶
Core Analysis¶
# Commit history analysis
repo.commit_history(
branch=None, # Branch to analyze
limit=None, # Maximum number of commits
days=None, # Limit to last N days
ignore_globs=None, # Files to ignore
include_globs=None # Files to include
)
# File change history
repo.file_change_history(
branch=None,
limit=None,
days=None,
ignore_globs=None,
include_globs=None
)
# Blame analysis
repo.blame(
rev="HEAD", # Revision to analyze
committer=True, # Group by committer (False for author)
by="repository", # Group by 'repository' or 'file'
ignore_globs=None,
include_globs=None
)
# Bus factor analysis
repo.bus_factor(
by="repository", # How to group results ('repository' or 'file')
ignore_globs=None,
include_globs=None
)
# Commit pattern analysis
repo.punchcard(
branch=None,
limit=None,
days=None,
by=None, # Additional grouping
normalize=None, # Normalize values
ignore_globs=None,
include_globs=None
)
Repository Information¶
# List files in repository
repo.list_files(rev="HEAD")
# Check branch existence
repo.has_branch(branch)
# Check if repository is bare
repo.is_bare()
# Check for coverage information
repo.has_coverage()
repo.coverage()
# Get specific commit content
repo.get_commit_content(
rev, # Revision to analyze
ignore_globs=None,
include_globs=None
)
Common Parameters¶
Most analysis methods support these filtering parameters:
branch: Branch to analyze (defaults to repository’s default branch)
limit: Maximum number of commits to analyze
days: Limit analysis to last N days
ignore_globs: List of glob patterns for files to ignore
include_globs: List of glob patterns for files to include
by: How to group results (usually ‘repository’ or ‘file’)
API Reference¶
- class gitpandas.repository.Repository(working_dir=None, verbose=False, tmp_dir=None, cache_backend=None, labels_to_add=None, default_branch=None)[source]¶
Bases:
objectA class for analyzing a single git repository.
This class provides functionality to analyze a git repository, whether it is a local repository or a remote repository that needs to be cloned. It offers methods for analyzing commit history, blame information, file changes, and other git metrics.
- Parameters:
working_dir (Optional[str]) – Path to the git repository: - If None: Uses current working directory - If local path: Path must contain a .git directory - If git URL: Repository will be cloned to a temporary directory
verbose (bool, optional) – Whether to print verbose output. Defaults to False.
tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.
cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache
labels_to_add (Optional[List[str]]) – Extra labels to add to output DataFrames
default_branch (Optional[str]) – Name of the default branch to use. If None, will try to detect ‘main’ or ‘master’, and if neither exists, will raise ValueError.
- Variables:
verbose (bool) – Whether verbose output is enabled
git_dir (str) – Path to the git repository
repo (git.Repo) – GitPython Repo instance
cache_backend (Optional[object]) – Cache backend being used
_labels_to_add (List[str]) – Labels to add to DataFrames
_git_repo_name (Optional[str]) – Repository name for remote repos
default_branch (str) – Name of the default branch
- Raises:
ValueError – If default_branch is None and neither ‘main’ nor ‘master’ branch exists
Examples
>>> # Create from local repository >>> repo = Repository('/path/to/repo')
>>> # Create from remote repository >>> repo = Repository('git://github.com/user/repo.git')
Note
When using remote repositories, they will be cloned to temporary directories. This can be slow for large repositories.
- __init__(working_dir=None, verbose=False, tmp_dir=None, cache_backend=None, labels_to_add=None, default_branch=None)[source]¶
Initialize a Repository instance.
- Parameters:
working_dir (Optional[str]) – Path to the git repository: - If None: Uses current working directory - If local path: Path must contain a .git directory - If git URL: Repository will be cloned to a temporary directory
verbose (bool, optional) – Whether to print verbose output. Defaults to False.
tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.
cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache
labels_to_add (Optional[List[str]]) – Extra labels to add to output DataFrames
default_branch (Optional[str]) – Name of the default branch to use. If None, will try to detect ‘main’ or ‘master’, and if neither exists, will raise ValueError.
- Raises:
ValueError – If default_branch is None and neither ‘main’ nor ‘master’ branch exists
- __del__()[source]¶
Cleanup method called when the object is destroyed.
Cleans up any temporary directories created for cloned repositories.
- is_bare(*args, **kwargs)¶
- has_coverage(*args, **kwargs)¶
- coverage(*args, **kwargs)¶
- hours_estimate(*args, **kwargs)¶
- commit_history(*args, **kwargs)¶
- file_change_history(*args, **kwargs)¶
- _process_commit_for_file_history(commit, history, ignore_globs, include_globs, skip_broken)[source]¶
Helper method to process a commit for file change history.
- Parameters:
commit – The commit object to process
history – List to append the file change data to
ignore_globs – List of glob patterns for files to ignore
include_globs – List of glob patterns for files to include
skip_broken – Whether to skip errors for specific files
- file_change_rates(*args, **kwargs)¶
- blame(*args, **kwargs)¶
- revs(*args, **kwargs)¶
- cumulative_blame(*args, **kwargs)¶
- parallel_cumulative_blame(*args, **kwargs)¶
- branches(*args, **kwargs)¶
- get_branches_by_commit(*args, **kwargs)¶
- commits_in_tags(*args, **kwargs)¶
- tags(*args, **kwargs)¶
- property repo_name¶
- _repo_name()[source]¶
Returns the name of the repository.
For local repositories, uses the name of the directory containing the .git folder. For remote repositories, extracts the name from the URL.
- Returns:
Name of the repository, or ‘unknown_repo’ if name can’t be determined
- Return type:
Note
This is an internal method primarily used to provide consistent repository names in DataFrame outputs.
- _add_labels_to_df(df)[source]¶
Adds configured labels to a DataFrame.
Adds the repository name and any additional configured labels to the DataFrame. This ensures consistent labeling across all DataFrame outputs.
- Parameters:
df (pandas.DataFrame) – DataFrame to add labels to
- Returns:
- The input DataFrame with additional label columns:
repository (str): Repository name
label0..labelN: Values from labels_to_add
- Return type:
Note
This is an internal helper method used by all public methods that return DataFrames.
- __str__()[source]¶
Returns a human-readable string representation of the repository.
- Returns:
String in format ‘git repository: {name} at: {path}’
- Return type:
- get_commit_content(*args, **kwargs)¶
- get_file_content(*args, **kwargs)¶
- list_files(*args, **kwargs)¶
- __repr__()[source]¶
Returns a unique string representation of the repository.
- Returns:
The absolute path to the repository
- Return type:
- bus_factor(*args, **kwargs)¶
- file_owner(*args, **kwargs)¶
- _get_last_edit_date(file_path, rev='HEAD')[source]¶
Get the last edit date for a file at a given revision.
- punchcard(*args, **kwargs)¶
- has_branch(*args, **kwargs)¶
- file_detail(*args, **kwargs)¶
- time_between_revs(rev1, rev2)[source]¶
Calculates the time difference in days between two revisions.
- Parameters:
- Returns:
The absolute time difference in days between the two revisions.
- Return type:
Note
The result is always non-negative (absolute value).
- diff_stats_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)[source]¶
Computes diff statistics between two revisions.
Calculates the total insertions, deletions, net line change, and number of files changed between two arbitrary revisions (commits or tags). Optionally filters files using glob patterns.
- Parameters:
- Returns:
- A dictionary with keys:
’insertions’ (int): Total lines inserted.
’deletions’ (int): Total lines deleted.
’net’ (int): Net lines changed (insertions - deletions).
’files_changed’ (int): Number of files changed.
’files’ (List[str]): List of changed file paths.
- Return type:
Note
Binary files or files that cannot be parsed are skipped. If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.
- committers_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)[source]¶
Finds unique committers and authors between two revisions.
Iterates through all commits between two revisions (exclusive of rev1, inclusive of rev2) and returns the unique committers and authors who contributed, filtered by file globs if provided.
- Parameters:
- Returns:
- A dictionary with keys:
’committers’ (List[str]): Sorted list of unique committer names.
’authors’ (List[str]): Sorted list of unique author names.
- Return type:
Note
Only commits that touch files matching the glob filters are considered. The range is interpreted as Git does: rev1..rev2 means commits reachable from rev2 but not rev1.
- files_changed_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)[source]¶
Lists files changed between two revisions.
Returns a sorted list of all files changed between two arbitrary revisions (commits or tags), optionally filtered by glob patterns.
- Parameters:
- Returns:
Sorted list of file paths changed between the two revisions.
- Return type:
List[str]
Note
If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.
- release_tag_summary(*args, **kwargs)¶
- safe_fetch_remote(remote_name='origin', prune=False, dry_run=False)[source]¶
Safely fetch changes from remote repository.
Fetches the latest changes from a remote repository without modifying the working directory. This is a read-only operation that only updates remote-tracking branches.
- Parameters:
- Returns:
- Fetch results with keys:
success (bool): Whether the fetch was successful
message (str): Status message or error description
remote_exists (bool): Whether the specified remote exists
changes_available (bool): Whether new changes were fetched
error (Optional[str]): Error message if fetch failed
- Return type:
Note
This method is safe as it only fetches remote changes and never modifies the working directory or current branch. It will not perform any merges, rebases, or checkouts.
- warm_cache(methods=None, **kwargs)[source]¶
Pre-populate cache with commonly used data.
Executes a set of commonly used repository analysis methods to populate the cache, improving performance for subsequent calls. Only methods that support caching will be executed.
- Parameters:
methods (Optional[List[str]]) – List of method names to pre-warm. If None, uses a default set of commonly used methods. Available methods: - ‘commit_history’: Load commit history - ‘branches’: Load branch information - ‘tags’: Load tag information - ‘blame’: Load blame information - ‘file_detail’: Load file details - ‘list_files’: Load file listing - ‘file_change_rates’: Load file change statistics
**kwargs – Additional keyword arguments to pass to the methods. Common arguments include: - branch: Branch to analyze (default: repository’s default branch) - limit: Limit number of commits to analyze - ignore_globs: List of glob patterns to ignore - include_globs: List of glob patterns to include
- Returns:
- Results of cache warming operations with keys:
success (bool): Whether cache warming was successful
methods_executed (List[str]): List of methods that were executed
methods_failed (List[str]): List of methods that failed
cache_entries_created (int): Number of cache entries created
execution_time (float): Total execution time in seconds
errors (List[str]): List of error messages for failed methods
- Return type:
Note
This method will only execute methods if a cache backend is configured. If no cache backend is available, it will return immediately with a success status but no methods executed.
- invalidate_cache(keys=None, pattern=None)[source]¶
Invalidate specific cache entries or all cache entries for this repository.
- Parameters:
- Returns:
Number of cache entries invalidated
- Return type:
Note
If both keys and pattern are None, all cache entries for this repository are invalidated. Cache keys are automatically prefixed with repository name.
- class gitpandas.repository.GitFlowRepository[source]¶
Bases:
RepositoryA special case where git flow is followed, so we know something about the branching scheme
- __del__()¶
Cleanup method called when the object is destroyed.
Cleans up any temporary directories created for cloned repositories.
- __repr__()¶
Returns a unique string representation of the repository.
- Returns:
The absolute path to the repository
- Return type:
- __str__()¶
Returns a human-readable string representation of the repository.
- Returns:
String in format ‘git repository: {name} at: {path}’
- Return type:
- _add_labels_to_df(df)¶
Adds configured labels to a DataFrame.
Adds the repository name and any additional configured labels to the DataFrame. This ensures consistent labeling across all DataFrame outputs.
- Parameters:
df (pandas.DataFrame) – DataFrame to add labels to
- Returns:
- The input DataFrame with additional label columns:
repository (str): Repository name
label0..labelN: Values from labels_to_add
- Return type:
Note
This is an internal helper method used by all public methods that return DataFrames.
- _get_last_edit_date(file_path, rev='HEAD')¶
Get the last edit date for a file at a given revision.
- _process_commit_for_file_history(commit, history, ignore_globs, include_globs, skip_broken)¶
Helper method to process a commit for file change history.
- Parameters:
commit – The commit object to process
history – List to append the file change data to
ignore_globs – List of glob patterns for files to ignore
include_globs – List of glob patterns for files to include
skip_broken – Whether to skip errors for specific files
- _repo_name()¶
Returns the name of the repository.
For local repositories, uses the name of the directory containing the .git folder. For remote repositories, extracts the name from the URL.
- Returns:
Name of the repository, or ‘unknown_repo’ if name can’t be determined
- Return type:
Note
This is an internal method primarily used to provide consistent repository names in DataFrame outputs.
- blame(*args, **kwargs)¶
- branches(*args, **kwargs)¶
- bus_factor(*args, **kwargs)¶
- commit_history(*args, **kwargs)¶
- commits_in_tags(*args, **kwargs)¶
- committers_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)¶
Finds unique committers and authors between two revisions.
Iterates through all commits between two revisions (exclusive of rev1, inclusive of rev2) and returns the unique committers and authors who contributed, filtered by file globs if provided.
- Parameters:
- Returns:
- A dictionary with keys:
’committers’ (List[str]): Sorted list of unique committer names.
’authors’ (List[str]): Sorted list of unique author names.
- Return type:
Note
Only commits that touch files matching the glob filters are considered. The range is interpreted as Git does: rev1..rev2 means commits reachable from rev2 but not rev1.
- coverage(*args, **kwargs)¶
- cumulative_blame(*args, **kwargs)¶
- diff_stats_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)¶
Computes diff statistics between two revisions.
Calculates the total insertions, deletions, net line change, and number of files changed between two arbitrary revisions (commits or tags). Optionally filters files using glob patterns.
- Parameters:
- Returns:
- A dictionary with keys:
’insertions’ (int): Total lines inserted.
’deletions’ (int): Total lines deleted.
’net’ (int): Net lines changed (insertions - deletions).
’files_changed’ (int): Number of files changed.
’files’ (List[str]): List of changed file paths.
- Return type:
Note
Binary files or files that cannot be parsed are skipped. If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.
- file_change_history(*args, **kwargs)¶
- file_change_rates(*args, **kwargs)¶
- file_detail(*args, **kwargs)¶
- file_owner(*args, **kwargs)¶
- files_changed_between_revs(rev1, rev2, ignore_globs=None, include_globs=None)¶
Lists files changed between two revisions.
Returns a sorted list of all files changed between two arbitrary revisions (commits or tags), optionally filtered by glob patterns.
- Parameters:
- Returns:
Sorted list of file paths changed between the two revisions.
- Return type:
List[str]
Note
If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.
- get_branches_by_commit(*args, **kwargs)¶
- get_cache_stats()¶
Get cache statistics for this repository.
- Returns:
Cache statistics including repository-specific and global cache information
- Return type:
- get_commit_content(*args, **kwargs)¶
- get_file_content(*args, **kwargs)¶
- has_branch(*args, **kwargs)¶
- has_coverage(*args, **kwargs)¶
- hours_estimate(*args, **kwargs)¶
- invalidate_cache(keys=None, pattern=None)¶
Invalidate specific cache entries or all cache entries for this repository.
- Parameters:
- Returns:
Number of cache entries invalidated
- Return type:
Note
If both keys and pattern are None, all cache entries for this repository are invalidated. Cache keys are automatically prefixed with repository name.
- is_bare(*args, **kwargs)¶
- list_files(*args, **kwargs)¶
- parallel_cumulative_blame(*args, **kwargs)¶
- punchcard(*args, **kwargs)¶
- release_tag_summary(*args, **kwargs)¶
- property repo_name¶
- revs(*args, **kwargs)¶
- safe_fetch_remote(remote_name='origin', prune=False, dry_run=False)¶
Safely fetch changes from remote repository.
Fetches the latest changes from a remote repository without modifying the working directory. This is a read-only operation that only updates remote-tracking branches.
- Parameters:
- Returns:
- Fetch results with keys:
success (bool): Whether the fetch was successful
message (str): Status message or error description
remote_exists (bool): Whether the specified remote exists
changes_available (bool): Whether new changes were fetched
error (Optional[str]): Error message if fetch failed
- Return type:
Note
This method is safe as it only fetches remote changes and never modifies the working directory or current branch. It will not perform any merges, rebases, or checkouts.
- tags(*args, **kwargs)¶
- time_between_revs(rev1, rev2)¶
Calculates the time difference in days between two revisions.
- Parameters:
- Returns:
The absolute time difference in days between the two revisions.
- Return type:
Note
The result is always non-negative (absolute value).
- warm_cache(methods=None, **kwargs)¶
Pre-populate cache with commonly used data.
Executes a set of commonly used repository analysis methods to populate the cache, improving performance for subsequent calls. Only methods that support caching will be executed.
- Parameters:
methods (Optional[List[str]]) – List of method names to pre-warm. If None, uses a default set of commonly used methods. Available methods: - ‘commit_history’: Load commit history - ‘branches’: Load branch information - ‘tags’: Load tag information - ‘blame’: Load blame information - ‘file_detail’: Load file details - ‘list_files’: Load file listing - ‘file_change_rates’: Load file change statistics
**kwargs – Additional keyword arguments to pass to the methods. Common arguments include: - branch: Branch to analyze (default: repository’s default branch) - limit: Limit number of commits to analyze - ignore_globs: List of glob patterns to ignore - include_globs: List of glob patterns to include
- Returns:
- Results of cache warming operations with keys:
success (bool): Whether cache warming was successful
methods_executed (List[str]): List of methods that were executed
methods_failed (List[str]): List of methods that failed
cache_entries_created (int): Number of cache entries created
execution_time (float): Total execution time in seconds
errors (List[str]): List of error messages for failed methods
- Return type:
Note
This method will only execute methods if a cache backend is configured. If no cache backend is available, it will return immediately with a success status but no methods executed.
- __init__()[source]¶
Initialize a Repository instance.
- Parameters:
working_dir (Optional[str]) – Path to the git repository: - If None: Uses current working directory - If local path: Path must contain a .git directory - If git URL: Repository will be cloned to a temporary directory
verbose (bool, optional) – Whether to print verbose output. Defaults to False.
tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.
cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache
labels_to_add (Optional[List[str]]) – Extra labels to add to output DataFrames
default_branch (Optional[str]) – Name of the default branch to use. If None, will try to detect ‘main’ or ‘master’, and if neither exists, will raise ValueError.
- Raises:
ValueError – If default_branch is None and neither ‘main’ nor ‘master’ branch exists