Usage¶
Installation¶
Install from PyPI:
pip install repo-people
Or with Poetry:
poetry add repo-people
Quick Start¶
The simplest end-to-end call collects users from all nine role categories, fetches their full GitHub profiles, and returns a dictionary keyed by username:
from repo_people import RepoPeople
rp = RepoPeople("octocat", "Hello-World", token="ghp_...")
user_data = rp.get_users()
# {'octocat': {'login': 'octocat', 'followers': 9001, ...}, ...}
Authentication¶
A GitHub personal-access token is strongly recommended. Without one, the GitHub API rate-limit is only 60 requests per hour. With a token it rises to 5 000 requests per hour.
rp = RepoPeople("owner", "repo", token="ghp_YOUR_TOKEN_HERE")
Alternatively, export the token as an environment variable and pass it in:
import os
from repo_people import RepoPeople
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
Tip — store your token in a .env file and load it with python-dotenv:
from dotenv import load_dotenv
load_dotenv()
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
Token Validation¶
The token is validated immediately when RepoPeople is instantiated. If the
token is invalid or expired, a ConnectionError is raised right away with a
descriptive message rather than failing silently on the first API call:
try:
rp = RepoPeople("owner", "repo", token="invalid_token")
except ConnectionError as e:
print(e) # GitHub connection failed — verify your token. (...)
Input Validation¶
The owner and repo parameters are validated at construction time.
Both must contain only [A-Za-z0-9_.-] characters. Any other characters
raise a ValueError immediately:
try:
rp = RepoPeople("owner with spaces", "repo")
except ValueError as e:
print(e) # Invalid owner: 'owner with spaces'. Must match [A-Za-z0-9_.-]+
Choosing an Output Directory¶
By default, exported files are written to the current working directory. Use
outdir to specify a different location:
rp = RepoPeople("owner", "repo", token="...", outdir="/path/to/output")
Filtering by Role¶
The roles parameter accepts a list of one or more of the nine valid roles.
All nine roles are collected when roles is not specified:
# Only contributors and stargazers
user_data = rp.get_users(roles=["contributors", "stargazers"])
Available roles:
contributorsmaintainers(CODEOWNERS + collaborators)stargazerswatchersissue_authorspr_authorsfork_ownerscommit_authorsdependents
# Inspect the full set at runtime
print(RepoPeople.VALID_ROLES)
Role names are validated before any API calls are made. Passing an
unrecognised name raises ValueError immediately, listing every invalid name
and the full set of valid ones:
rp.get_users(roles=["typo_role"])
# ValueError: Invalid role(s): ['typo_role'].
# Valid roles are: ['commit_authors', 'contributors', ...]
A bare string is also accepted and treated as a single-item list:
user_data = rp.get_users(roles="contributors")
Skipping CODEOWNERS or Collaborators¶
When collecting maintainers the package looks up both the CODEOWNERS
file and the repository’s collaborator list. Either source can be disabled:
rp = RepoPeople("owner", "repo", token="...",
skip_codeowners=True,
skip_collaborators=True)
Limiting the Number of Results¶
limit caps the total number of user profiles fetched. Useful for quickly
testing on large repositories:
user_data = rp.get_users(limit=50)
Excluding Users¶
Pass a list of usernames to exclude to skip specific accounts:
user_data = rp.get_users(exclude=["dependabot", "github-actions[bot]"])
To automatically skip all bot accounts (those whose GitHub type field is
"Bot" or whose login matches common bot patterns):
user_data = rp.get_users(exclude_bots=True)
Incremental Fetching (Resume Support)¶
For large repositories the fetch can take a long time. Use
save_each_iteration=True to persist progress in batches of 10 user
profiles. If the process is interrupted, restart with resume=True to
pick up from where you left off:
# First run — saves after every user
user_data = rp.get_users(save_each_iteration=True, export=True)
# Restart after interruption — skips users already in the output file
user_data = rp.get_users(save_each_iteration=True, export=True, resume=True)
Filtering the Output Fields¶
By default all 30+ fields are included for every user. Pass a list of field
names to fields to limit what appears in exports and the returned dict:
user_data = rp.get_users(
fields=["login", "name", "location", "followers", "public_repos"]
)
A bare string is also accepted and is treated as a single-item list:
user_data = rp.get_users(fields="login")
Field names are validated against UserSnapshot
before any API calls are made. Passing an unrecognised name raises a
ValueError immediately, listing every invalid name and the full set of
valid ones:
rp.get_users(fields=["login", "typo_field"])
# ValueError: Invalid field(s): ['typo_field'].
# Valid fields are: ['account_age_days', 'avatar_url', 'bio', ...]
See the API Reference page for the complete field list.
Roles in Output Records¶
Every user dict returned by get_users always has a "roles" key listing
the role(s) the user appeared under, regardless of any fields= filter:
user_data = rp.get_users(roles=["contributors", "stargazers"], fields=["login"])
print(user_data["octocat"])
# {'login': 'octocat', 'roles': ['contributors', 'stargazers']}
Exporting Results¶
Export to JSON¶
Pass export=True to write a JSON file automatically. The file is saved to
outdir (or the current directory) as user_details.json:
user_data = rp.get_users(export=True)
To export manually after the fact:
rp.export_to_json(user_data, filename="my_output.json")
Export to CSV¶
Pass export_csv=True to write a CSV file:
user_data = rp.get_users(export_csv=True)
Or manually:
rp.export_to_csv(user_data, filename="my_output.csv")
Export to Markdown¶
Generate a Markdown table with (optionally) a subset of fields:
rp.export_to_markdown(
user_data,
filename="users.md",
fields=["login", "name", "location", "followers"]
)
Both export=True and export_csv=True can be combined:
user_data = rp.get_users(export=True, export_csv=True)
Analysis Helpers¶
summarise¶
Returns aggregate statistics for the collected user data:
stats = rp.summarise(user_data, top_n=5)
# {
# 'total_users': 134,
# 'users_with_email': 42,
# 'users_with_blog': 61,
# 'top_locations': [('San Francisco', 18), ...],
# 'top_companies': [('GitHub', 9), ...],
# 'top_languages': [('Python', 54), ...],
# ...
# }
top_users¶
Returns the top n users ranked by a given field:
# Top 10 by follower count
leaders = rp.top_users(user_data, n=10, by="followers")
for u in leaders:
print(u["login"], u["followers"])
# Top 5 by number of public repos
prolific = rp.top_users(user_data, n=5, by="public_repos")
Using the Lower-Level API¶
The two-step pipeline is available directly if you need more control:
from repo_people import RepoPeople
rp = RepoPeople("owner", "repo", token="...")
# Step 1 — collect all usernames grouped by role
all_usernames = rp.collect_all_usernames(roles=["contributors", "stargazers"])
# {'contributors': ['alice', 'bob'], 'stargazers': ['carol', ...], ...}
# Flatten to a unique set
unique = list({u for users in all_usernames.values() for u in users})
# Step 2 — fetch full profiles
user_data = rp.get_user_details(
unique,
limit=100,
exclude_bots=True,
verbose=True,
)
Rate-Limit Tips¶
Always use a token — it gives you 5 000 requests/hour vs 60 unauthenticated.
Use
limitduring development to avoid exhausting the rate limit on large repos.Use
exclude_bots=Trueto skip bot accounts that do not need enrichment.Use
save_each_iteration=Trueon very large repos so partial progress is persisted (every 10 profiles) if the rate limit is hit mid-run.resume=Trueallows you to continue after hitting a rate limit without re-fetching profiles already collected.* A progress line is printed automatically every 50 users and at the end of the fetch, showing the current rate-limit headroom:[Progress: 50/134 | Rate limit: 4820/5000 remaining, resets in 42m]
Any users that fail to fetch are collected and a summary is printed at the end rather than stopping the whole run:
Skipped 2 user(s): ['ghost', 'deleted-account']
Concurrent Fetching¶
The workers parameter controls how many profiles are fetched in parallel
(default 1 = sequential). Increasing it reduces wall-clock time on repos
with many users:
# Fetch up to 8 profiles simultaneously
user_data = rp.get_users(workers=8)
Or pass it directly to the lower-level method:
user_data = rp.get_user_details(logins, workers=4)
Note
Concurrent requests still count against your rate limit. workers
reduces wall-clock time by overlapping requests, not by increasing the
total request budget.
The maximum value for workers is 32. If a higher value is passed,
it is silently capped to 32 and a UserWarning is emitted.
Async Fetching¶
For very high concurrency use the async pipeline. This requires the optional
aiohttp dependency:
pip install "repo-people[async]"
Then:
import asyncio
user_data = asyncio.run(rp.get_users_async(concurrency=10))
Exporting as JSON Lines (JSONL)¶
Pass lines=True to export_to_json() to write
one JSON object per line (JSONL / JSON Lines format). This is useful for
streaming large outputs:
path = rp.export_to_json(user_data, lines=True)
# Writes <outdir>/<prefix>user_details.jsonl
You can also specify a custom filename:
path = rp.export_to_json(user_data, filename="users.jsonl", lines=True)