Project Directory

The ProjectDirectory class enables analysis across multiple Git repositories. It can aggregate metrics and insights from multiple repositories into a single output.

Overview

The ProjectDirectory class provides:

  • Analysis across multiple repositories

  • Aggregated metrics and statistics

  • Project-level insights

  • Multi-repository bus factor analysis

  • Consolidated commit history and blame information

Creating a ProjectDirectory

You can create a ProjectDirectory object in three ways:

Directory of Repositories

Create a ProjectDirectory from a directory containing multiple repositories:

from gitpandas import ProjectDirectory
project = ProjectDirectory(
    working_dir='/path/to/dir/',
    ignore_repos=['repo_to_ignore'],
    verbose=True,
    default_branch='main'  # Optional, will auto-detect if not specified
)

The ignore_repos parameter can be a list of repository names to exclude. This method uses os.walk to search for .git directories recursively.

Explicit Local Repositories

Create a ProjectDirectory from a list of local repositories:

from gitpandas import ProjectDirectory
project = ProjectDirectory(
    working_dir=['/path/to/repo1/', '/path/to/repo2/'],
    ignore_repos=['repo_to_ignore'],
    verbose=True,
    default_branch='main'  # Optional, will auto-detect if not specified
)

Explicit Remote Repositories

Create a ProjectDirectory from a list of remote repositories:

from gitpandas import ProjectDirectory
project = ProjectDirectory(
    working_dir=['git://github.com/user/repo1.git', 'git://github.com/user/repo2.git'],
    ignore_repos=['repo_to_ignore'],
    verbose=True,
    default_branch='main'  # Optional, will auto-detect if not specified
)

Available Methods

Core Analysis

# Commit history across repositories
project.commit_history(
    branch=None,          # Branch to analyze
    limit=None,           # Maximum number of commits
    days=None,           # Limit to last N days
    ignore_globs=None,   # Files to ignore
    include_globs=None   # Files to include
)

# File change history across repositories
project.file_change_history(
    branch=None,
    limit=None,
    days=None,
    ignore_globs=None,
    include_globs=None
)

# Blame analysis across repositories
project.blame(
    rev="HEAD",          # Revision to analyze
    committer=True,      # Group by committer (False for author)
    by="repository",     # Group by 'repository' or 'file'
    ignore_globs=None,
    include_globs=None
)

# Bus factor analysis across repositories
project.bus_factor(
    by="repository",     # How to group results ('projectd', 'repository', or 'file')
    ignore_globs=None,
    include_globs=None
)

Common Parameters

Most analysis methods support these filtering parameters:

  • branch: Branch to analyze (defaults to repository’s default branch)

  • limit: Maximum number of commits to analyze

  • days: Limit analysis to last N days

  • ignore_globs: List of glob patterns for files to ignore

  • include_globs: List of glob patterns for files to include

  • by: How to group results (usually ‘repository’ or ‘file’)

API Reference

class gitpandas.project.ProjectDirectory(working_dir=None, ignore_repos=None, verbose=True, tmp_dir=None, cache_backend=None, default_branch='main')[source]

Bases: object

A class for analyzing multiple git repositories in a directory or from explicit paths.

This class provides functionality to analyze multiple git repositories together, whether they are local repositories in a directory, explicitly specified local repositories, or remote repositories that need to be cloned. It offers methods for analyzing commit history, blame information, file changes, and other git metrics across all repositories.

Parameters:
  • working_dir (Union[str, List[str], None]) – The source of repositories to analyze: - If None: Uses current working directory to find repositories - If str: Path to directory containing git repositories - If List[str]: List of paths to git repositories or Repository instances

  • ignore_repos (Optional[List[str]]) – List of repository names to ignore

  • verbose (bool, optional) – Whether to print verbose output. Defaults to True.

  • tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.

  • cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache

  • default_branch (str, optional) – Name of the default branch to use. Defaults to ‘main’.

Variables:
  • repo_dirs (Union[set, list]) – Set of repository directories or list of Repository instances

  • repos (List[Repository]) – List of Repository objects being analyzed

Examples

>>> # Create from directory containing repos
>>> pd = ProjectDirectory(working_dir='/path/to/repos')
>>> # Create from explicit local repos
>>> pd = ProjectDirectory(working_dir=['/path/to/repo1', '/path/to/repo2'])
>>> # Create from remote repos
>>> pd = ProjectDirectory(working_dir=['git://github.com/user/repo.git'])

Note

When using remote repositories, they will be cloned to temporary directories. This can be slow for large repositories.

Methods

__init__(working_dir=None, ignore_repos=None, verbose=True, tmp_dir=None, cache_backend=None, default_branch='main')[source]

Initialize a ProjectDirectory instance.

Parameters:
  • working_dir (Union[str, List[str], None]) – The source of repositories to analyze: - If None: Uses current working directory to find repositories - If str: Path to directory containing git repositories - If List[str]: List of paths to git repositories or Repository instances

  • ignore_repos (Optional[List[str]]) – List of repository names to ignore

  • verbose (bool, optional) – Whether to print verbose output. Defaults to True.

  • tmp_dir (Optional[str]) – Directory to clone remote repositories into. Created if not provided.

  • cache_backend (Optional[object]) – Cache backend instance from gitpandas.cache

  • default_branch (str, optional) – Name of the default branch to use. Defaults to ‘main’.

repo_name()[source]

Returns a DataFrame containing the names of all repositories in the project.

Returns:

A DataFrame with a single column:
  • repository (str): Name of each repository

Return type:

pandas.DataFrame

is_bare()[source]

Returns a dataframe of repo names and whether or not they are bare.

Returns:

DataFrame

has_coverage()[source]

Returns a DataFrame of repo names and whether or not they have a .coverage file that can be parsed

Returns:

DataFrame

coverage()[source]

Will return a DataFrame with coverage information (if available) for each repo in the project).

If there is a .coverage file available, this will attempt to form a DataFrame with that information in it, which will contain the columns:

  • repository

  • filename

  • lines_covered

  • total_lines

  • coverage

If it can’t be found or parsed, an empty DataFrame of that form will be returned.

Returns:

DataFrame

file_change_rates(branch=None, limit=None, coverage=False, days=None, ignore_globs=None, include_globs=None)[source]

Will return a DataFrame containing some basic aggregations of the file change history data, and optionally test coverage data from a coverage_data.py .coverage file. The aim here is to identify files in the project which have abnormal edit rates, or the rate of changes without growing the files size. If a file has a high change rate and poor test coverage, then it is a great candidate for writing more tests.

Parameters:
  • branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.

  • limit (Optional[int]) – Maximum number of commits to return, None for no limit

  • coverage (bool, optional) – Whether to include coverage data. Defaults to False.

  • days (Optional[int]) – Number of days to return if limit is None

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

DataFrame with file change statistics and optionally coverage data

Return type:

DataFrame

hours_estimate(branch=None, grouping_window=0.5, single_commit_hours=0.5, limit=None, days=None, committer=True, by=None, ignore_globs=None, include_globs=None)[source]

Returns a DataFrame containing the estimated hours spent by each committer/author.

Parameters:
  • branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.

  • grouping_window (float, optional) – Hours threshold for considering commits part of same session. Defaults to 0.5.

  • single_commit_hours (float, optional) – Hours to assign to single commits. Defaults to 0.5.

  • limit (Optional[int]) – Maximum number of commits to analyze

  • days (Optional[int]) – If provided, only analyze commits from last N days

  • committer (bool, optional) – If True use committer, if False use author. Defaults to True.

  • by (Optional[str]) – How to group results. One of None, ‘committer’, ‘author’

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

DataFrame with hours estimates

Return type:

DataFrame

commit_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None)[source]

Returns a DataFrame containing the commit history for all repositories.

Parameters:
  • branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.

  • limit (Optional[int]) – Maximum number of commits to return

  • days (Optional[int]) – If provided, only return commits from last N days

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

DataFrame with commit history

Return type:

DataFrame

file_change_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None)[source]

Returns a DataFrame containing the file change history for all repositories.

Parameters:
  • branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.

  • limit (Optional[int]) – Maximum number of commits to analyze

  • days (Optional[int]) – If provided, only analyze commits from last N days

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

DataFrame with file change history

Return type:

DataFrame

blame(committer=True, by='repository', ignore_globs=None, include_globs=None)[source]

Analyzes blame information across all repositories.

Retrieves blame information from the current HEAD of each repository and aggregates it based on the specified grouping. Can group results by committer/author and either repository or file.

Parameters:
  • committer (bool, optional) – If True, group by committer name. If False, group by author name. Defaults to True.

  • by (str, optional) – How to group the results. One of: - ‘repository’: Group by repository (default) - ‘file’: Group by individual file

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

A DataFrame with columns depending on the ‘by’ parameter:
If by=’repository’:
  • committer/author (str): Name of the committer/author

  • loc (int): Lines of code attributed to that person

If by=’file’:
  • committer/author (str): Name of the committer/author

  • file (str): File path

  • loc (int): Lines of code attributed to that person in that file

Return type:

pandas.DataFrame

Note

Results are sorted by lines of code in descending order. If both ignore_globs and include_globs are provided, files must match an include pattern and not match any ignore patterns to be included.

file_detail(rev='HEAD', committer=True, ignore_globs=None, include_globs=None)[source]

Provides detailed information about all files in the repositories.

Analyzes each file in the repositories at the specified revision, gathering information about size, ownership, and last modification.

Parameters:
  • rev (str, optional) – Revision to analyze. Defaults to ‘HEAD’.

  • committer (bool, optional) – If True, use committer info. If False, use author. Defaults to True.

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

A DataFrame indexed by (file, repository) with columns:
  • committer/author (str): Name of primary committer/author

  • last_change (datetime): When file was last modified

  • loc (int): Lines of code in file

  • extension (str): File extension

  • directory (str): Directory containing file

  • filename (str): Name of file without path

  • pct_blame (float): Percentage of file attributed to primary committer/author

Return type:

pandas.DataFrame

Note

The primary committer/author is the person responsible for the most lines in the current version of the file.

branches()[source]

Returns information about all branches across repositories.

Retrieves a list of all branches (both local and remote) from each repository in the project directory.

Returns:

A DataFrame with columns:
  • repository (str): Repository name

  • local (bool): Whether the branch is local

  • branch (str): Name of the branch

Return type:

pandas.DataFrame

revs(branch=None, limit=None, skip=None, num_datapoints=None)[source]

Returns a DataFrame containing revision information for all repositories.

Parameters:
  • branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.

  • limit (Optional[int]) – Maximum number of revisions to return

  • skip (Optional[int]) – Number of revisions to skip between samples

  • num_datapoints (Optional[int]) – If provided, evenly sample this many revisions

Returns:

DataFrame with revision information

Return type:

DataFrame

cumulative_blame(branch=None, by='committer', limit=None, skip=None, num_datapoints=None, committer=True, ignore_globs=None, include_globs=None)[source]

Returns a DataFrame containing cumulative blame information for all repositories.

Parameters:
  • branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.

  • by (str, optional) – How to group results. Defaults to ‘committer’.

  • limit (Optional[int]) – Maximum number of revisions to analyze

  • skip (Optional[int]) – Number of revisions to skip between samples

  • num_datapoints (Optional[int]) – If provided, evenly sample this many revisions

  • committer (bool, optional) – If True use committer, if False use author. Defaults to True.

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

DataFrame with cumulative blame information

Return type:

DataFrame

commits_in_tags(**kwargs)[source]

Analyze each tag, and trace backwards from the tag to all commits that make up that tag. This method looks at the commit for the tag, and then works backwards to that commits parents, and so on and so, until it hits another tag, is out of the time range, or hits the root commit. It returns a DataFrame with the branches:

Parameters:

kwargs – kwargs to pass to Repository.commits_in_tags

Returns:

DataFrame

tags()[source]

Returns a data frame of all tags in origin. The DataFrame will have the columns:

  • repository

  • tag

Returns:

DataFrame

repo_information()[source]

Returns detailed metadata about each repository.

Retrieves various properties and references from each repository’s Git object model.

Returns:

A DataFrame with columns:
  • local_directory (str): Path to the repository

  • branches (list): List of branches

  • bare (bool): Whether it’s a bare repository

  • remotes (list): List of remote references

  • description (str): Repository description

  • references (list): List of all references

  • heads (list): List of branch heads

  • submodules (list): List of submodules

  • tags (list): List of tags

  • active_branch (str): Currently checked out branch

Return type:

pandas.DataFrame

bus_factor(ignore_globs=None, include_globs=None, by='projectd')[source]

Calculates the “bus factor” for the repositories.

The bus factor is a measure of risk based on how concentrated the codebase knowledge is among contributors. It is calculated as the minimum number of contributors whose combined contributions account for at least 50% of the codebase’s lines of code.

Parameters:
  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

  • by (str, optional) – How to calculate the bus factor. One of: - ‘projectd’: Calculate for entire project directory (default) - ‘repository’: Calculate separately for each repository - ‘file’: Calculate separately for each file across all repositories

Returns:

A DataFrame with columns depending on the ‘by’ parameter:
If by=’projectd’:
  • projectd (str): Always ‘projectd’

  • bus factor (int): Bus factor for entire project

If by=’repository’:
  • repository (str): Repository name

  • bus factor (int): Bus factor for that repository

If by=’file’:
  • file (str): File path

  • bus factor (int): Bus factor for that file

  • repository (str): Repository name

Return type:

pandas.DataFrame

Note

A low bus factor (e.g. 1-2) indicates high risk as knowledge is concentrated among few contributors. A higher bus factor indicates knowledge is better distributed.

punchcard(branch=None, limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None)[source]

Returns a DataFrame containing punchcard data for all repositories.

Parameters:
  • branch (Optional[str]) – Branch to analyze. Defaults to default_branch if None.

  • limit (Optional[int]) – Maximum number of commits to analyze

  • days (Optional[int]) – If provided, only analyze commits from last N days

  • by (Optional[str]) – How to group results. One of None, ‘committer’, ‘author’

  • normalize (Optional[int]) – If provided, normalize values to this maximum

  • ignore_globs (Optional[List[str]]) – List of glob patterns for files to ignore

  • include_globs (Optional[List[str]]) – List of glob patterns for files to include

Returns:

DataFrame with punchcard data

Return type:

DataFrame

__del__()[source]

Cleanup method called when the object is destroyed.

Ensures proper cleanup of all repository objects, including temporary directories for cloned repositories.

_is_valid_git_repo(path)[source]

Helper method to check if a path is a valid git repository.

Parameters:

path (str) – Path to check

Returns:

True if path is a valid git repository, False otherwise

Return type:

bool

_get_repo_name_from_path(path)[source]

Helper method to get repository name from path.

Parameters:

path (str) – Path to repository

Returns:

Repository name (last component of path)

Return type:

str

bulk_fetch_and_warm(fetch_remote=False, warm_cache=False, parallel=True, remote_name='origin', prune=False, dry_run=False, cache_methods=None, **kwargs)[source]

Safely fetch remote changes and pre-warm cache for all repositories.

Performs bulk operations across all repositories in the project directory, optionally fetching from remote repositories and pre-warming caches to improve subsequent analysis performance.

Parameters:
  • fetch_remote (bool, optional) – Whether to fetch from remote repositories. Defaults to False.

  • warm_cache (bool, optional) – Whether to pre-warm repository caches. Defaults to False.

  • parallel (bool, optional) – Use parallel processing when available (joblib). Defaults to True.

  • remote_name (str, optional) – Name of remote to fetch from. Defaults to ‘origin’.

  • prune (bool, optional) – Remove remote-tracking branches that no longer exist. Defaults to False.

  • dry_run (bool, optional) – Show what would be fetched without actually fetching. Defaults to False.

  • cache_methods (Optional[List[str]]) – List of methods to use for cache warming. If None, uses default methods. See Repository.warm_cache for available methods.

  • **kwargs – Additional keyword arguments to pass to cache warming methods.

Returns:

Results with keys:
  • success (bool): Whether the overall operation was successful

  • repositories_processed (int): Number of repositories processed

  • fetch_results (dict): Per-repository fetch results (if fetch_remote=True)

  • cache_results (dict): Per-repository cache warming results (if warm_cache=True)

  • execution_time (float): Total execution time in seconds

  • summary (dict): Summary statistics of the operation

Return type:

dict

Note

This method safely handles errors at the repository level, ensuring that failures in one repository don’t affect processing of others. All operations are read-only and will not modify working directories or current branches.

invalidate_cache(keys=None, pattern=None, repositories=None)[source]

Invalidate cache entries across multiple repositories.

Parameters:
  • keys (Optional[List[str]]) – List of specific cache keys to invalidate

  • pattern (Optional[str]) – Pattern to match cache keys (supports * wildcard)

  • repositories (Optional[List[str]]) – List of repository names to target. If None, all repositories are targeted.

Returns:

Results with total invalidated and per-repository breakdown

Return type:

dict

get_cache_stats()[source]

Get comprehensive cache statistics across all repositories.

Returns:

Aggregated cache statistics and per-repository breakdown

Return type:

dict