Use Cases and Examples
Git-Pandas provides a powerful interface for analyzing Git repositories using pandas DataFrames. This guide demonstrates common use cases and provides practical examples.
Basic Repository Analysis
Repository Attributes
Get basic information about a repository:
from gitpandas import Repository
repo = Repository('/path/to/repo')
# Get repository name
print(repo.repo_name)
# Check if repository is bare
print(repo.is_bare())
# Get all tags
print(repo.tags())
# Get all branches
print(repo.branches())
# Get all revisions
print(repo.revs())
# Get blame information
print(repo.blame(include_globs=['*.py']))
Commit History Analysis
Analyze commit patterns and history:
# Get commit history
commits_df = repo.commit_history()
# Get file change history
changes_df = repo.file_change_history()
# Filter by file extension
python_changes = repo.file_change_history(include_globs=['*.py'])
# Filter by directory
src_changes = repo.file_change_history(include_globs=['src/*'])
# Get commits in tags
tag_commits = repo.commits_in_tags()
Project-Level Analysis
Multiple Repository Analysis
Analyze multiple repositories simultaneously:
from gitpandas import ProjectDirectory
# Create project from multiple repositories
project = ProjectDirectory([
'git://github.com/user/repo1.git',
'git://github.com/user/repo2.git'
])
# Get repository information
print(project.repo_information())
# Calculate bus factor
print(project.bus_factor())
# Get file change history
print(project.file_change_history())
# Get blame information
print(project.blame())
Advanced Analysis
Cumulative Blame Analysis
Track code ownership over time:
# Get cumulative blame
blame_df = repo.cumulative_blame()
# Plot cumulative blame using pandas plotting
import matplotlib.pyplot as plt
blame_df.plot(x='date', y='loc', title='Cumulative Blame Over Time')
plt.show()
Bus Factor Analysis
Analyze project sustainability:
# Calculate bus factor for repository
bus_factor = repo.bus_factor()
# Get detailed blame information
blame_df = repo.blame(by='file') # Get file-level blame details
# Analyze ownership patterns
ownership_patterns = repo.blame(committer=True, by='repository')
Performance Optimization
Using Caching
Optimize performance with caching:
from gitpandas import Repository
from gitpandas.cache import EphemeralCache, RedisDFCache
# Use in-memory caching
cache = EphemeralCache()
repo = Repository('/path/to/repo', cache_backend=cache)
# Or use Redis for persistent caching
redis_cache = RedisDFCache(
host='localhost',
port=6379,
db=12,
ttl=3600 # Cache entries expire after 1 hour
)
repo = Repository('/path/to/repo', cache_backend=redis_cache)
Visualization Examples
Commit Analysis
Visualize commit patterns:
# Get commit history
commit_df = repo.commit_history()
# Plot commits over time using pandas
commit_df.resample('D').size().plot(
kind='bar',
title='Commits per Day'
)
plt.show()
File Change Analysis
Visualize file changes:
# Get file change history
changes_df = repo.file_change_history()
# Plot changes over time using pandas
changes_df.groupby('filename')['insertions'].sum().plot(
kind='bar',
title='Lines Added by File'
)
plt.show()
Best Practices
Use caching for expensive operations like blame analysis
Filter data early using include_globs/ignore_globs
Leverage pandas operations for analysis
Consider memory usage with large repositories
Use appropriate branch names (main/master)
Handle repository cleanup properly when using remote repositories
For more examples and detailed API documentation, see the Repository and Project Directory pages.