Project Directory¶
The ProjectDirectory class enables analysis across multiple Git repositories. It can aggregate metrics and insights from multiple repositories into a single output.
Overview¶
The ProjectDirectory class provides:
Analysis across multiple repositories
Aggregated metrics and statistics
Project-level insights
Multi-repository bus factor analysis
Consolidated commit history and blame information
Creating a ProjectDirectory¶
You can create a ProjectDirectory object in three ways:
Directory of Repositories¶
Create a ProjectDirectory from a directory containing multiple repositories:
from gitpandas import ProjectDirectory
project = ProjectDirectory(
working_dir='/path/to/dir/',
ignore_repos=['repo_to_ignore'],
verbose=True,
default_branch='main' # Optional, will auto-detect if not specified
)
The ignore_repos parameter can be a list of repository names to exclude. This method uses os.walk to search for .git directories recursively.
Explicit Local Repositories¶
Create a ProjectDirectory from a list of local repositories:
from gitpandas import ProjectDirectory
project = ProjectDirectory(
working_dir=['/path/to/repo1/', '/path/to/repo2/'],
ignore_repos=['repo_to_ignore'],
verbose=True,
default_branch='main' # Optional, will auto-detect if not specified
)
Explicit Remote Repositories¶
Create a ProjectDirectory from a list of remote repositories:
from gitpandas import ProjectDirectory
project = ProjectDirectory(
working_dir=['git://github.com/user/repo1.git', 'git://github.com/user/repo2.git'],
ignore_repos=['repo_to_ignore'],
verbose=True,
default_branch='main' # Optional, will auto-detect if not specified
)
Available Methods¶
Core Analysis¶
# Commit history across repositories
project.commit_history(
branch=None, # Branch to analyze
limit=None, # Maximum number of commits
days=None, # Limit to last N days
ignore_globs=None, # Files to ignore
include_globs=None # Files to include
)
# File change history across repositories
project.file_change_history(
branch=None,
limit=None,
days=None,
ignore_globs=None,
include_globs=None
)
# Blame analysis across repositories
project.blame(
rev="HEAD", # Revision to analyze
committer=True, # Group by committer (False for author)
by="repository", # Group by 'repository' or 'file'
ignore_globs=None,
include_globs=None
)
# Bus factor analysis across repositories
project.bus_factor(
by="repository", # How to group results ('projectd', 'repository', or 'file')
ignore_globs=None,
include_globs=None
)
Common Parameters¶
Most analysis methods support these filtering parameters:
branch: Branch to analyze (defaults to repository’s default branch)
limit: Maximum number of commits to analyze
days: Limit analysis to last N days
ignore_globs: List of glob patterns for files to ignore
include_globs: List of glob patterns for files to include
by: How to group results (usually ‘repository’ or ‘file’)
API Reference¶
- class gitpandas.project.ProjectDirectory(working_dir=None, ignore_repos=None, verbose=True, tmp_dir=None, cache_backend=None, default_branch='main')[source]¶
Bases:
objectA class for analyzing multiple git repositories in a directory or from explicit paths.
This class provides functionality to analyze multiple git repositories together, whether they are local repositories in a directory, explicitly specified local repositories, or remote repositories that need to be cloned. It offers methods for analyzing commit history, blame information, file changes, and other git metrics across all repositories.
- Parameters:
working_dir (Union[str, List[str], None]) – The source of repositories to analyze: - If None: Uses current working directory to find repositories - If str: Path to directory containing git repositories - If List[str]: List of paths to git repositories or Repository instances
ignore_repos (Optional[List[str]]) – List of repository names to ignore
verbose (bool, optional) – Whether to print verbose output. Defaults to True.
tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.
cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache
default_branch (str, optional) – Name of the default branch to use. Defaults to ‘main’.
- Variables:
repo_dirs (Union[set, list]) – Set of repository directories or list of Repository instances
repos (List[Repository]) – List of Repository objects being analyzed
Examples
>>> # Create from directory containing repos >>> pd = ProjectDirectory(working_dir='/path/to/repos')
>>> # Create from explicit local repos >>> pd = ProjectDirectory(working_dir=['/path/to/repo1', '/path/to/repo2'])
>>> # Create from remote repos >>> pd = ProjectDirectory(working_dir=['git://github.com/user/repo.git'])
Note
When using remote repositories, they will be cloned to temporary directories. This can be slow for large repositories.
Methods
- __init__(working_dir=None, ignore_repos=None, verbose=True, tmp_dir=None, cache_backend=None, default_branch='main')[source]¶
Initialize a ProjectDirectory instance.
- Parameters:
working_dir (Union[str, List[str], None]) – The source of repositories to analyze: - If None: Uses current working directory to find repositories - If str: Path to directory containing git repositories - If List[str]: List of paths to git repositories or Repository instances
ignore_repos (Optional[List[str]]) – List of repository names to ignore
verbose (bool, optional) – Whether to print verbose output. Defaults to True.
tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.
cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache
default_branch (str, optional) – Name of the default branch to use. Defaults to ‘main’.
- repo_name()[source]¶
Returns a DataFrame containing the names of all repositories in the project.
- Returns:
- A DataFrame with a single column:
repository (str): Name of each repository
- Return type:
- is_bare()[source]¶
Returns a dataframe of repo names and whether or not they are bare.
- Returns:
DataFrame
- has_coverage()[source]¶
Returns a DataFrame of repo names and whether or not they have a .coverage file that can be parsed
- Returns:
DataFrame
- coverage()[source]¶
Will return a DataFrame with coverage information (if available) for each repo in the project).
If there is a .coverage file available, this will attempt to form a DataFrame with that information in it, which will contain the columns:
repository
filename
lines_covered
total_lines
coverage
If it can’t be found or parsed, an empty DataFrame of that form will be returned.
- Returns:
DataFrame
- file_change_rates(branch=None, limit=None, coverage=False, days=None, ignore_globs=None, include_globs=None)[source]¶
Will return a DataFrame containing some basic aggregations of the file change history data, and optionally test coverage data from a coverage_data.py .coverage file. The aim here is to identify files in the project which have abnormal edit rates, or the rate of changes without growing the files size. If a file has a high change rate and poor test coverage, then it is a great candidate for writing more tests.
- Parameters:
branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.
limit (Optional[int]) – Maximum number of commits to return, None for no limit
coverage (bool, optional) – Whether to include coverage data. Defaults to False.
days (Optional[int]) – Number of days to return if limit is None
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
DataFrame with file change statistics and optionally coverage data
- Return type:
DataFrame
- hours_estimate(branch=None, grouping_window=0.5, single_commit_hours=0.5, limit=None, days=None, committer=True, by=None, ignore_globs=None, include_globs=None)[source]¶
Returns a DataFrame containing the estimated hours spent by each committer/author.
- Parameters:
branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.
grouping_window (float, optional) – Hours threshold for considering commits part of same session. Defaults to 0.5.
single_commit_hours (float, optional) – Hours to assign to single commits. Defaults to 0.5.
limit (Optional[int]) – Maximum number of commits to analyze
days (Optional[int]) – If provided, only analyze commits from last N days
committer (bool, optional) – If True use committer, if False use author. Defaults to True.
by (Optional[str]) – How to group results. One of None, ‘committer’, ‘author’
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
DataFrame with hours estimates
- Return type:
DataFrame
- commit_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None)[source]¶
Returns a DataFrame containing the commit history for all repositories.
- Parameters:
branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.
limit (Optional[int]) – Maximum number of commits to return
days (Optional[int]) – If provided, only return commits from last N days
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
DataFrame with commit history
- Return type:
DataFrame
- file_change_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None)[source]¶
Returns a DataFrame containing the file change history for all repositories.
- Parameters:
branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.
limit (Optional[int]) – Maximum number of commits to analyze
days (Optional[int]) – If provided, only analyze commits from last N days
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
DataFrame with file change history
- Return type:
DataFrame
- blame(committer=True, by='repository', ignore_globs=None, include_globs=None)[source]¶
Analyzes blame information across all repositories.
Retrieves blame information from the current HEAD of each repository and aggregates it based on the specified grouping. Can group results by committer/author and either repository or file.
- Parameters:
committer (bool, optional) – If True, group by committer name. If False, group by author name. Defaults to True.
by (str, optional) – How to group the results. One of: - ‘repository’: Group by repository (default) - ‘file’: Group by individual file
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
- A DataFrame with columns depending on the ‘by’ parameter:
- If by=’repository’:
committer/author (str): Name of the committer/author
loc (int): Lines of code attributed to that person
- If by=’file’:
committer/author (str): Name of the committer/author
file (str): File path
loc (int): Lines of code attributed to that person in that file
- Return type:
Note
Results are sorted by lines of code in descending order. If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.
- file_detail(rev='HEAD', committer=True, ignore_globs=None, include_globs=None)[source]¶
Provides detailed information about all files in the repositories.
Analyzes each file in the repositories at the specified revision, gathering information about size, ownership, and last modification.
- Parameters:
rev (str, optional) – Revision to analyze. Defaults to ‘HEAD’.
committer (bool, optional) – If True, use committer info. If False, use author. Defaults to True.
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
- A DataFrame indexed by (file, repository) with columns:
committer/author (str): Name of primary committer/author
last_change (datetime): When file was last modified
loc (int): Lines of code in file
extension (str): File extension
directory (str): Directory containing file
filename (str): Name of file without path
pct_blame (float): Percentage of file attributed to primary committer/author
- Return type:
Note
The primary committer/author is the person responsible for the most lines in the current version of the file.
- branches()[source]¶
Returns information about all branches across repositories.
Retrieves a list of all branches (both local and remote) from each repository in the project directory.
- Returns:
- A DataFrame with columns:
repository (str): Repository name
local (bool): Whether the branch is local
branch (str): Name of the branch
- Return type:
- revs(branch=None, limit=None, skip=None, num_datapoints=None)[source]¶
Returns a DataFrame containing revision information for all repositories.
- Parameters:
- Returns:
DataFrame with revision information
- Return type:
DataFrame
- cumulative_blame(branch=None, by='committer', limit=None, skip=None, num_datapoints=None, committer=True, ignore_globs=None, include_globs=None)[source]¶
Returns a DataFrame containing cumulative blame information for all repositories.
- Parameters:
branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.
by (str, optional) – How to group results. Defaults to ‘committer’.
limit (Optional[int]) – Maximum number of revisions to analyze
skip (Optional[int]) – Number of revisions to skip between samples
num_datapoints (Optional[int]) – If provided, evenly sample this many revisions
committer (bool, optional) – If True use committer, if False use author. Defaults to True.
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
DataFrame with cumulative blame information
- Return type:
DataFrame
- commits_in_tags(**kwargs)[source]¶
Analyze each tag, and trace backwards from the tag to all commits that make up that tag. This method looks at the commit for the tag, and then works backwards to that commits parents, and so on and so, until it hits another tag, is out of the time range, or hits the root commit. It returns a DataFrame with the branches:
- Parameters:
kwargs – kwargs to pass to
Repository.commits_in_tags- Returns:
DataFrame
- tags()[source]¶
Returns a data frame of all tags in origin. The DataFrame will have the columns:
repository
tag
- Returns:
DataFrame
- repo_information()[source]¶
Returns detailed metadata about each repository.
Retrieves various properties and references from each repository’s Git object model.
- Returns:
- A DataFrame with columns:
local_directory (str): Path to the repository
branches (list): List of branches
bare (bool): Whether it’s a bare repository
remotes (list): List of remote references
description (str): Repository description
references (list): List of all references
heads (list): List of branch heads
submodules (list): List of submodules
tags (list): List of tags
active_branch (str): Currently checked out branch
- Return type:
- bus_factor(ignore_globs=None, include_globs=None, by='projectd')[source]¶
Calculates the “bus factor” for the repositories.
The bus factor is a measure of risk based on how concentrated the codebase knowledge is among contributors. It is calculated as the minimum number of contributors whose combined contributions account for at least 50% of the codebase’s lines of code.
- Parameters:
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
by (str, optional) – How to calculate the bus factor. One of: - ‘projectd’: Calculate for entire project directory (default) - ‘repository’: Calculate separately for each repository - ‘file’: Calculate separately for each file across all repositories
- Returns:
- A DataFrame with columns depending on the ‘by’ parameter:
- If by=’projectd’:
projectd (str): Always ‘projectd’
bus factor (int): Bus factor for entire project
- If by=’repository’:
repository (str): Repository name
bus factor (int): Bus factor for that repository
- If by=’file’:
file (str): File path
bus factor (int): Bus factor for that file
repository (str): Repository name
- Return type:
Note
A low bus factor (e.g. 1-2) indicates high risk as knowledge is concentrated among few contributors. A higher bus factor indicates knowledge is better distributed.
- punchcard(branch=None, limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None)[source]¶
Returns a DataFrame containing punchcard data for all repositories.
- Parameters:
branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.
limit (Optional[int]) – Maximum number of commits to analyze
days (Optional[int]) – If provided, only analyze commits from last N days
by (Optional[str]) – How to group results. One of None, ‘committer’, ‘author’
normalize (Optional[int]) – If provided, normalize values to this maximum
ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore
include_globs (Optional[List[str]]) – List of glob patterns for files to include
- Returns:
DataFrame with punchcard data
- Return type:
DataFrame
- __del__()[source]¶
Cleanup method called when the object is destroyed.
Ensures proper cleanup of all repository objects, including temporary directories for cloned repositories.
- bulk_fetch_and_warm(fetch_remote=False, warm_cache=False, parallel=True, remote_name='origin', prune=False, dry_run=False, cache_methods=None, **kwargs)[source]¶
Safely fetch remote changes and pre-warm cache for all repositories.
Performs bulk operations across all repositories in the project directory, optionally fetching from remote repositories and pre-warming caches to improve subsequent analysis performance.
- Parameters:
fetch_remote (bool, optional) – Whether to fetch from remote repositories. Defaults to False.
warm_cache (bool, optional) – Whether to pre-warm repository caches. Defaults to False.
parallel (bool, optional) – Use parallel processing when available (joblib). Defaults to True.
remote_name (str, optional) – Name of remote to fetch from. Defaults to ‘origin’.
prune (bool, optional) – Remove remote-tracking branches that no longer exist. Defaults to False.
dry_run (bool, optional) – Show what would be fetched without actually fetching. Defaults to False.
cache_methods (Optional[List[str]]) – List of methods to use for cache warming. If None, uses default methods. See Repository.warm_cache for available methods.
**kwargs – Additional keyword arguments to pass to cache warming methods.
- Returns:
- Results with keys:
success (bool): Whether the overall operation was successful
repositories_processed (int): Number of repositories processed
fetch_results (dict): Per-repository fetch results (if fetch_remote=True)
cache_results (dict): Per-repository cache warming results (if warm_cache=True)
execution_time (float): Total execution time in seconds
summary (dict): Summary statistics of the operation
- Return type:
Note
This method safely handles errors at the repository level, ensuring that failures in one repository don’t affect processing of others. All operations are read-only and will not modify working directories or current branches.