API Reference¶
RepoPeople¶
- class repo_people.RepoPeople(owner, repo, token=None, outdir=None, skip_codeowners=False, skip_collaborators=False)[source]¶
Bases:
objectCollects and exports all user data for a given GitHub repository.
Gathers users across every repo role (contributors, maintainers, stargazers, watchers, issue/PR authors, fork owners, commit authors, dependents), then fetches full GitHub profile details for each unique user via the GitHub API.
Basic usage:
rp = RepoPeople("owner", "repo", token="ghp_...") user_data = rp.get_users(export_json=True)
- Parameters:
- VALID_ROLES: Set[str] = {'commit_authors', 'contributors', 'dependents', 'fork_owners', 'issue_authors', 'maintainers', 'pr_authors', 'stargazers', 'watchers'}¶
- collect_all_usernames(roles=None)[source]¶
Fetch usernames from each repo role and return them grouped by role.
Returns a dict with keys: contributors, maintainers, stargazers, watchers, issue_authors, pr_authors, fork_owners, commit_authors, dependents. Each value is a list of GitHub login strings.
If roles is provided, only the specified roles are collected.
- compare(other, user_data_self, user_data_other)[source]¶
Compare user populations between this repo and another
RepoPeopleinstance.Returns a dict with three keys:
"only_in_self"— logins present in this repo but not the other."only_in_other"— logins present in the other repo but not this one."in_both"— logins that appear in both repos.
Example:
rp_a = RepoPeople("owner", "repo-a", token="ghp_...") rp_b = RepoPeople("owner", "repo-b", token="ghp_...") data_a = rp_a.get_users() data_b = rp_b.get_users() diff = rp_a.compare(rp_b, data_a, data_b) print(diff["in_both"])
- export_to_csv(user_data, filename=None)[source]¶
Write flattened user data to a CSV file in outdir.
List/tuple fields are serialised as semicolon-separated strings. Returns the output path, or an empty string if user_data is empty.
- export_to_json(user_data, filename=None, lines=False)[source]¶
Write user data dict to a JSON file in outdir. Returns the output path.
Parameters¶
- lines:
When
True, writes one JSON object per line (JSON Lines / JSONL format) instead of a single pretty-printed JSON object. Useful for streaming large datasets to downstream tools. The output filename will end in.jsonlinstead of.jsonunless filename is explicitly set.
- export_to_markdown(user_data, filename=None, fields=None)[source]¶
Write user data as a Markdown table to a file in outdir.
Defaults to a concise set of columns; pass fields to override. Returns the output path, or an empty string if user_data is empty.
- get_user_details(usernames, save_each_iteration=False, limit=None, exclude=None, exclude_bots=False, resume=False, verbose=True, include_social_accounts=False, workers=1)[source]¶
Fetch full GitHub profile details for each username via the GitHub API.
Returns a dict keyed by login containing all available user fields (profile info, counters, orgs, computed metrics, etc.). Users that cannot be fetched are skipped with a warning.
If save_each_iteration is True, user_details.json is updated after every 10 successful fetches so progress is preserved if the process is interrupted (batched to reduce I/O overhead). If limit is set, only the first N usernames are fetched. Note: usernames are sorted alphabetically before any limit is applied, so results are deterministic. If exclude is provided, those logins are skipped. If exclude_bots is True, logins ending in ‘[bot]’ or ‘-bot’ are skipped. If resume is True, any logins already present in user_details.json are skipped. If verbose is False, per-user fetch messages are suppressed. If include_social_accounts is True, an extra REST call fetches each user’s linked social accounts (LinkedIn, Mastodon, YouTube, npm, etc.). workers controls the number of concurrent fetches (default 1 = sequential). Maximum supported value is 32; higher values are capped with a warning.
- async get_user_details_async(usernames, save_each_iteration=False, limit=None, exclude=None, exclude_bots=False, resume=False, verbose=True, concurrency=10)[source]¶
Async version of get_user_details using aiohttp.
Fetches raw user profiles directly from the GitHub REST API (GET /users/{login}) using an asyncio.Semaphore to cap simultaneous connections. Supports the same filtering params as the sync path.
- Parameters:
fetch. (save_each_iteration -- persist user_details.json after each)
fetch.
fetched. (limit -- cap the number of profiles)
skip. (exclude -- logins to)
'[bot]'. (exclude_bots -- skip logins ending in)
user_details.json. (resume -- skip logins already in)
user. (verbose -- print a line per fetched)
requests (concurrency -- max simultaneous aiohttp)
save_each_iteration (bool)
limit (int | None)
exclude_bots (bool)
resume (bool)
verbose (bool)
concurrency (int)
- Return type:
Returns a dict keyed by login with profile data dicts.
- get_users(export=False, export_csv=False, save_each_iteration=False, limit=None, roles=None, exclude=None, exclude_bots=False, resume=False, verbose=True, fields=None, include_social_accounts=False, workers=1)[source]¶
Full pipeline: collect all repo usernames -> fetch user details -> export.
- Steps:
Collect usernames from every repo role (contributors, stargazers, …).
Deduplicate across all roles.
Fetch the full GitHub profile for each unique user.
Optionally export to user_details.json / user_details.csv inside outdir.
- Parameters:
True. (export_csv -- save results to user_details.csv when)
True.
fetch. (save_each_iteration -- write user_details.json after every successful)
profiles. (limit -- stop after fetching this many user)
categories (roles -- only collect users from these role) – (e.g. [“contributors”, “stargazers”]).
entirely. (exclude -- list of logins to skip)
is_bot=True. (exclude_bots -- skip logins ending in '[bot]' and profiles with)
users. (resume -- load existing user_details.json and skip already-fetched)
fetched. (verbose -- print a line for each user being)
set (fields -- if) – (e.g. [“login”, “type”, “updated_at”]).
output (only these attributes are kept per user in the) – (e.g. [“login”, “type”, “updated_at”]).
accounts (include_social_accounts -- fetch each user's linked social) – (LinkedIn, Mastodon, YouTube, npm, …). Costs one extra API call per user.
threads (workers -- number of concurrent fetch)
export (bool)
export_csv (bool)
save_each_iteration (bool)
limit (int | None)
exclude_bots (bool)
resume (bool)
verbose (bool)
include_social_accounts (bool)
workers (int)
- Return type:
UserDataView
Returns a dict keyed by GitHub login with full user profile data. Each record always includes a “roles” key listing the role(s) the user appeared under, regardless of the fields parameter.
- async get_users_async(export=False, export_csv=False, save_each_iteration=False, limit=None, roles=None, exclude=None, exclude_bots=False, resume=False, verbose=True, fields=None, concurrency=10)[source]¶
Async version of get_users.
Collects usernames synchronously (same as get_users), then fetches all profiles concurrently via aiohttp. Accepts the same parameters as get_users except workers is replaced by concurrency.
- Parameters:
user_details.json. (resume -- skip logins already in)
user_details.csv. (export_csv -- save results to)
fetch. (save_each_iteration -- persist after every)
fetched. (limit -- cap the number of profiles)
collected. (roles -- restrict which role categories are)
entirely. (exclude -- logins to skip)
accounts. (exclude_bots -- skip bot)
user_details.json.
progress. (verbose -- print per-user)
dict. (fields -- restrict which fields appear in the output)
connections (concurrency -- max simultaneous aiohttp)
export (bool)
export_csv (bool)
save_each_iteration (bool)
limit (int | None)
exclude_bots (bool)
resume (bool)
verbose (bool)
concurrency (int)
- Return type:
UserDataView
Returns a dict keyed by GitHub login with profile data, including a ‘roles’ key on every record.
- print_markdown(user_data, fields=None)[source]¶
Print a Markdown table of user data to stdout.
Produces the same table format as
export_to_markdown()but writes to stdout instead of a file. Useful for quick inspection in a terminal or notebook. Does nothing when user_data is empty.
- summarise(user_data, top_n=5)[source]¶
Print and return a summary breakdown of the fetched user data.
Covers: total users, bot vs human split, top locations, top companies, and account age distribution (by quartile). Pass top_n to control how many top locations/companies are shown.
Valid roles
RepoPeople.VALID_ROLES == {
"contributors",
"maintainers",
"stargazers",
"watchers",
"issue_authors",
"pr_authors",
"fork_owners",
"commit_authors",
"dependents",
}
Representation
repr(rp) returns a concise summary of the instance:
repr(rp)
# "RepoPeople(owner='alice', repo='myrepo', outdir='outputs/alice_myrepo', valid_roles=9)"
roles key in output
get_users() always adds a "roles" key to
every user record, regardless of any fields= filter. It lists the
role(s) that user appeared under:
user_data = rp.get_users()
user_data["octocat"]["roles"] # e.g. ['contributors', 'stargazers']
Note
"roles" is not a UserSnapshot field —
it is injected by get_users after profile fetching. It will therefore
not appear in the snapshot field table below.
UserSnapshot¶
- class repo_people.users.UserSnapshot(login: 'str', id: 'Optional[int]', node_id: 'str', type: 'str', name: 'str', company: 'str', location: 'str', email_public: 'str', email_domain: 'str', blog: 'str', blog_host: 'str', twitter: 'str', bio: 'str', avatar_url: 'str', html_url: 'str', hireable: 'bool', site_admin: 'bool', created_at: 'str', updated_at: 'str', followers: 'int', following: 'int', public_repos: 'int', public_gists: 'int', public_orgs: 'List[str]', orgs_public_count: 'int', is_bot: 'bool', last_public_event_at: 'str', has_public_email: 'bool' = False, has_blog: 'bool' = False, has_twitter: 'bool' = False, company_normalized: 'str' = '', location_normalized: 'str' = '', account_age_days: 'int' = 0, followers_following_ratio: 'float' = 0.0, repos_per_year: 'float' = 0.0, recently_active: 'bool' = False, top_languages: 'Optional[List[Tuple[str, int]]]' = None, total_public_stars_sampled: 'Optional[int]' = None, total_public_forks_sampled: 'Optional[int]' = None, ssh_keys_count: 'Optional[int]' = None, gpg_keys_count: 'Optional[int]' = None, starred_repos_sampled: 'Optional[int]' = None, social_accounts: 'Optional[Dict[str, str]]' = None, is_collaborator: 'Optional[bool]' = None, permission_on_repo: 'Optional[str]' = None)[source]¶
Bases:
object- Parameters:
login (str)
id (int | None)
node_id (str)
type (str)
name (str)
company (str)
location (str)
email_public (str)
email_domain (str)
blog (str)
blog_host (str)
twitter (str)
bio (str)
avatar_url (str)
html_url (str)
hireable (bool)
site_admin (bool)
created_at (str)
updated_at (str)
followers (int)
following (int)
public_repos (int)
public_gists (int)
orgs_public_count (int)
is_bot (bool)
last_public_event_at (str)
has_public_email (bool)
has_blog (bool)
has_twitter (bool)
company_normalized (str)
location_normalized (str)
account_age_days (int)
followers_following_ratio (float)
repos_per_year (float)
recently_active (bool)
total_public_stars_sampled (int | None)
total_public_forks_sampled (int | None)
ssh_keys_count (int | None)
gpg_keys_count (int | None)
starred_repos_sampled (int | None)
is_collaborator (bool | None)
permission_on_repo (str | None)
The following table lists every field returned in a UserSnapshot (and in
every dict entry of the user_data mapping produced by
get_users() /
get_user_details()).
Field |
Type |
Description |
|---|---|---|
|
|
GitHub username. |
|
|
Numeric GitHub user ID. |
|
|
Global node ID (GraphQL). |
|
|
Account type — |
|
|
Display name on their profile. |
|
|
Raw company string from their profile. |
|
|
Company with leading |
|
|
Raw location string from their profile. |
|
|
Location stripped of trailing country codes. |
|
|
Public e-mail address (empty string if not set). |
|
|
Domain part of |
|
|
|
|
|
Blog / website URL from their profile. |
|
|
Hostname extracted from |
|
|
|
|
|
Twitter / X username from their profile. |
|
|
|
|
|
Profile bio text. |
|
|
URL of their profile avatar image. |
|
|
URL of their GitHub profile page. |
|
|
Whether they have marked themselves as hireable. |
|
|
Whether they are a GitHub staff/site admin. |
|
|
ISO-8601 timestamp of account creation. |
|
|
ISO-8601 timestamp of last profile update. |
|
|
Number of GitHub followers. |
|
|
Number of accounts they follow. |
|
|
|
|
|
Number of public repositories. |
|
|
Number of public gists. |
|
|
Logins of their public organisations. |
|
|
Length of |
|
|
|
|
|
ISO-8601 timestamp of their most recent public event. |
|
|
Days since account creation. |
|
|
|
|
|
|
|
|
Sampled (language, byte-count) pairs from their public repos. |
|
|
Sum of stargazer counts across a sample of their public repos. |
|
|
Sum of fork counts across a sample of their public repos. |
|
|
Number of public SSH keys on their account. |
|
|
Number of GPG keys on their account. |
|
|
Number of repos they have starred (sampled). |
|
|
Whether they have collaborator access on the queried repository. |
|
|
Their permission level on the queried repo (e.g. |
GitHubUserInfo¶
- class repo_people.users.GitHubUserInfo(gh=None, username=None, user_obj=None, token=None)[source]¶
Bases:
objectWrapper around a GitHub user (PyGithub NamedUser) that exposes cached, easy-to-use accessors and a single ‘snapshot()’ to dump all attributes as a dataclass.
- Parameters:
- snapshot(*, include_langs=False, include_star_fork_sums=False, langs_max_repos=50, sums_max_repos=50, include_keys_counts=False, include_star_sample=False, include_social_accounts=False, recent_days=90, repo=None)[source]¶
Collects all lightweight fields + optional aggregates into a dataclass.
Parameters¶
- include_langs:
Collect top-3 languages from the user’s repositories. Expensive — makes one API call per repository up to langs_max_repos. Off by default.
- include_star_fork_sums:
Sum stars and forks across the user’s repositories. Expensive — same cost as include_langs. Off by default.
Export Module¶
- repo_people.export.export_commit_authors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]¶
Export all unique commit authors (usernames) for a repository.
Pages through /commits and collects unique author.login values, so there is no hard cap on the number of results returned. Always returns the list of logins; the
return_dataparameter is kept for backwards compatibility but is ignored.Note
export_contributorsandexport_commit_authorswalk the same/commitsendpoint and return equivalent results. They are aliases of each other.
- repo_people.export.export_contributors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]¶
Export all unique contributors (usernames) for a repository.
Bypasses the /contributors endpoint’s hard 100-item cap by paging through /commits and collecting unique author.login values — the same commit-walk approach used by
export_commit_authors. Both functions return equivalent sets of usernames and are aliases of each other.Always returns the list of logins;
return_datais kept for backwards compatibility but is ignored.
- repo_people.export.export_dependents(owner, repo, outdir, return_data=True, export_csv=False, limit=None, sleep=1.0)[source]¶
Scrape and export the list of dependent users (usernames) for a repo.
Always returns the list of logins;
return_datais kept for backwards compatibility but is ignored. Uses exponential back-off on non-200 responses.Parameters¶
- limit:
Maximum number of unique dependent repositories to collect before stopping.
None(default) collects all pages. Pass0for an empty result.- sleep:
Base sleep interval (seconds) between pages. Doubles on each failed page request up to a maximum of 60 seconds.
- repo_people.export.export_fork_owners(owner, repo, token=None, outdir=None, return_data=True, export_csv=False)[source]¶
Export the owners of all forks for a repository to CSV and/or return as list.
- repo_people.export.export_issue_authors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]¶
- repo_people.export.export_maintainers(owner, repo, token, outdir, skip_codeowners, skip_collaborators, return_data=True, export_csv=False)[source]¶
Export maintainers for a repository to CSV and/or return as list.
- Collects maintainers from two sources (both can be toggled off):
CODEOWNERS file: parses @-mentions from .github/CODEOWNERS, docs/CODEOWNERS, or CODEOWNERS.
Collaborators API: includes users with admin, maintain, or push permissions.
Deduplicates across both sources before returning.
- repo_people.export.export_pr_authors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]¶
- repo_people.export.export_stargazers(owner, repo, token, outdir, return_data=True, export_csv=False)[source]¶
- repo_people.export.export_watchers(owner, repo, token, outdir, return_data=True, export_csv=False)[source]¶
Each function returns a list of strings (usernames) for one specific role. All nine functions share the same signature:
export_<role>(gh: Github, owner: str, repo: str) -> list[str]
Function |
Returns |
|---|---|
|
Usernames of repository contributors. |
|
CODEOWNERS + collaborator usernames. |
|
Usernames who have starred the repository. |
|
Usernames watching the repository. |
|
Usernames who have opened issues. |
|
Usernames who have opened pull requests. |
|
Usernames who have forked the repository. |
|
Usernames extracted from commit history. |
|
Usernames of repositories that depend on this one. |