API Reference

RepoPeople

class repo_people.RepoPeople(owner, repo, token=None, outdir=None, skip_codeowners=False, skip_collaborators=False)[source]

Bases: object

Collects and exports all user data for a given GitHub repository.

Gathers users across every repo role (contributors, maintainers, stargazers, watchers, issue/PR authors, fork owners, commit authors, dependents), then fetches full GitHub profile details for each unique user via the GitHub API.

Basic usage:

rp = RepoPeople("owner", "repo", token="ghp_...")
user_data = rp.get_users(export_json=True)
Parameters:
  • owner (str)

  • repo (str)

  • token (str | None)

  • outdir (str | None)

  • skip_codeowners (bool)

  • skip_collaborators (bool)

VALID_ROLES: Set[str] = {'commit_authors', 'contributors', 'dependents', 'fork_owners', 'issue_authors', 'maintainers', 'pr_authors', 'stargazers', 'watchers'}
collect_all_usernames(roles=None)[source]

Fetch usernames from each repo role and return them grouped by role.

Returns a dict with keys: contributors, maintainers, stargazers, watchers, issue_authors, pr_authors, fork_owners, commit_authors, dependents. Each value is a list of GitHub login strings.

If roles is provided, only the specified roles are collected.

Parameters:

roles (List[str] | None)

Return type:

Dict[str, List[str]]

compare(other, user_data_self, user_data_other)[source]

Compare user populations between this repo and another RepoPeople instance.

Returns a dict with three keys:

  • "only_in_self" — logins present in this repo but not the other.

  • "only_in_other" — logins present in the other repo but not this one.

  • "in_both" — logins that appear in both repos.

Example:

rp_a = RepoPeople("owner", "repo-a", token="ghp_...")
rp_b = RepoPeople("owner", "repo-b", token="ghp_...")
data_a = rp_a.get_users()
data_b = rp_b.get_users()
diff = rp_a.compare(rp_b, data_a, data_b)
print(diff["in_both"])
Parameters:
Return type:

Dict[str, object]

export_to_csv(user_data, filename=None)[source]

Write flattened user data to a CSV file in outdir.

List/tuple fields are serialised as semicolon-separated strings. Returns the output path, or an empty string if user_data is empty.

Parameters:
Return type:

str

export_to_json(user_data, filename=None, lines=False)[source]

Write user data dict to a JSON file in outdir. Returns the output path.

Parameters

lines:

When True, writes one JSON object per line (JSON Lines / JSONL format) instead of a single pretty-printed JSON object. Useful for streaming large datasets to downstream tools. The output filename will end in .jsonl instead of .json unless filename is explicitly set.

Parameters:
Return type:

str

export_to_markdown(user_data, filename=None, fields=None)[source]

Write user data as a Markdown table to a file in outdir.

Defaults to a concise set of columns; pass fields to override. Returns the output path, or an empty string if user_data is empty.

Parameters:
Return type:

str

get_user_details(usernames, save_each_iteration=False, limit=None, exclude=None, exclude_bots=False, resume=False, verbose=True, include_social_accounts=False, workers=1)[source]

Fetch full GitHub profile details for each username via the GitHub API.

Returns a dict keyed by login containing all available user fields (profile info, counters, orgs, computed metrics, etc.). Users that cannot be fetched are skipped with a warning.

If save_each_iteration is True, user_details.json is updated after every 10 successful fetches so progress is preserved if the process is interrupted (batched to reduce I/O overhead). If limit is set, only the first N usernames are fetched. Note: usernames are sorted alphabetically before any limit is applied, so results are deterministic. If exclude is provided, those logins are skipped. If exclude_bots is True, logins ending in ‘[bot]’ or ‘-bot’ are skipped. If resume is True, any logins already present in user_details.json are skipped. If verbose is False, per-user fetch messages are suppressed. If include_social_accounts is True, an extra REST call fetches each user’s linked social accounts (LinkedIn, Mastodon, YouTube, npm, etc.). workers controls the number of concurrent fetches (default 1 = sequential). Maximum supported value is 32; higher values are capped with a warning.

Parameters:
Return type:

Dict[str, dict]

async get_user_details_async(usernames, save_each_iteration=False, limit=None, exclude=None, exclude_bots=False, resume=False, verbose=True, concurrency=10)[source]

Async version of get_user_details using aiohttp.

Fetches raw user profiles directly from the GitHub REST API (GET /users/{login}) using an asyncio.Semaphore to cap simultaneous connections. Supports the same filtering params as the sync path.

Parameters:
  • fetch. (save_each_iteration -- persist user_details.json after each)

  • fetch.

  • fetched. (limit -- cap the number of profiles)

  • skip. (exclude -- logins to)

  • '[bot]'. (exclude_bots -- skip logins ending in)

  • user_details.json. (resume -- skip logins already in)

  • user. (verbose -- print a line per fetched)

  • requests (concurrency -- max simultaneous aiohttp)

  • usernames (List[str])

  • save_each_iteration (bool)

  • limit (int | None)

  • exclude (List[str] | None)

  • exclude_bots (bool)

  • resume (bool)

  • verbose (bool)

  • concurrency (int)

Return type:

Dict[str, dict]

Returns a dict keyed by login with profile data dicts.

get_users(export=False, export_csv=False, save_each_iteration=False, limit=None, roles=None, exclude=None, exclude_bots=False, resume=False, verbose=True, fields=None, include_social_accounts=False, workers=1)[source]

Full pipeline: collect all repo usernames -> fetch user details -> export.

Steps:
  1. Collect usernames from every repo role (contributors, stargazers, …).

  2. Deduplicate across all roles.

  3. Fetch the full GitHub profile for each unique user.

  4. Optionally export to user_details.json / user_details.csv inside outdir.

Parameters:
  • True. (export_csv -- save results to user_details.csv when)

  • True.

  • fetch. (save_each_iteration -- write user_details.json after every successful)

  • profiles. (limit -- stop after fetching this many user)

  • categories (roles -- only collect users from these role) – (e.g. [“contributors”, “stargazers”]).

  • entirely. (exclude -- list of logins to skip)

  • is_bot=True. (exclude_bots -- skip logins ending in '[bot]' and profiles with)

  • users. (resume -- load existing user_details.json and skip already-fetched)

  • fetched. (verbose -- print a line for each user being)

  • set (fields -- if) – (e.g. [“login”, “type”, “updated_at”]).

  • output (only these attributes are kept per user in the) – (e.g. [“login”, “type”, “updated_at”]).

  • accounts (include_social_accounts -- fetch each user's linked social) – (LinkedIn, Mastodon, YouTube, npm, …). Costs one extra API call per user.

  • threads (workers -- number of concurrent fetch)

  • export (bool)

  • export_csv (bool)

  • save_each_iteration (bool)

  • limit (int | None)

  • roles (List[str] | None)

  • exclude (List[str] | None)

  • exclude_bots (bool)

  • resume (bool)

  • verbose (bool)

  • fields (List[str] | None)

  • include_social_accounts (bool)

  • workers (int)

Return type:

UserDataView

Returns a dict keyed by GitHub login with full user profile data. Each record always includes a “roles” key listing the role(s) the user appeared under, regardless of the fields parameter.

async get_users_async(export=False, export_csv=False, save_each_iteration=False, limit=None, roles=None, exclude=None, exclude_bots=False, resume=False, verbose=True, fields=None, concurrency=10)[source]

Async version of get_users.

Collects usernames synchronously (same as get_users), then fetches all profiles concurrently via aiohttp. Accepts the same parameters as get_users except workers is replaced by concurrency.

Parameters:
  • user_details.json. (resume -- skip logins already in)

  • user_details.csv. (export_csv -- save results to)

  • fetch. (save_each_iteration -- persist after every)

  • fetched. (limit -- cap the number of profiles)

  • collected. (roles -- restrict which role categories are)

  • entirely. (exclude -- logins to skip)

  • accounts. (exclude_bots -- skip bot)

  • user_details.json.

  • progress. (verbose -- print per-user)

  • dict. (fields -- restrict which fields appear in the output)

  • connections (concurrency -- max simultaneous aiohttp)

  • export (bool)

  • export_csv (bool)

  • save_each_iteration (bool)

  • limit (int | None)

  • roles (List[str] | None)

  • exclude (List[str] | None)

  • exclude_bots (bool)

  • resume (bool)

  • verbose (bool)

  • fields (List[str] | None)

  • concurrency (int)

Return type:

UserDataView

Returns a dict keyed by GitHub login with profile data, including a ‘roles’ key on every record.

print_markdown(user_data, fields=None)[source]

Print a Markdown table of user data to stdout.

Produces the same table format as export_to_markdown() but writes to stdout instead of a file. Useful for quick inspection in a terminal or notebook. Does nothing when user_data is empty.

Parameters:
Return type:

None

summarise(user_data, top_n=5)[source]

Print and return a summary breakdown of the fetched user data.

Covers: total users, bot vs human split, top locations, top companies, and account age distribution (by quartile). Pass top_n to control how many top locations/companies are shown.

Parameters:
Return type:

dict

property token: str | None

GitHub personal access token (private; store via constructor only).

top_users(user_data, n=10, by='followers')[source]

Return the top N users ranked by a numeric profile field.

Common values for ‘by’: followers, public_repos, account_age_days, following, public_gists, total_public_stars_sampled. Users missing the field are ranked last.

Parameters:
Return type:

List[dict]

Valid roles

RepoPeople.VALID_ROLES == {
    "contributors",
    "maintainers",
    "stargazers",
    "watchers",
    "issue_authors",
    "pr_authors",
    "fork_owners",
    "commit_authors",
    "dependents",
}

Representation

repr(rp) returns a concise summary of the instance:

repr(rp)
# "RepoPeople(owner='alice', repo='myrepo', outdir='outputs/alice_myrepo', valid_roles=9)"

roles key in output

get_users() always adds a "roles" key to every user record, regardless of any fields= filter. It lists the role(s) that user appeared under:

user_data = rp.get_users()
user_data["octocat"]["roles"]  # e.g. ['contributors', 'stargazers']

Note

"roles" is not a UserSnapshot field — it is injected by get_users after profile fetching. It will therefore not appear in the snapshot field table below.

UserSnapshot

class repo_people.users.UserSnapshot(login: 'str', id: 'Optional[int]', node_id: 'str', type: 'str', name: 'str', company: 'str', location: 'str', email_public: 'str', email_domain: 'str', blog: 'str', blog_host: 'str', twitter: 'str', bio: 'str', avatar_url: 'str', html_url: 'str', hireable: 'bool', site_admin: 'bool', created_at: 'str', updated_at: 'str', followers: 'int', following: 'int', public_repos: 'int', public_gists: 'int', public_orgs: 'List[str]', orgs_public_count: 'int', is_bot: 'bool', last_public_event_at: 'str', has_public_email: 'bool' = False, has_blog: 'bool' = False, has_twitter: 'bool' = False, company_normalized: 'str' = '', location_normalized: 'str' = '', account_age_days: 'int' = 0, followers_following_ratio: 'float' = 0.0, repos_per_year: 'float' = 0.0, recently_active: 'bool' = False, top_languages: 'Optional[List[Tuple[str, int]]]' = None, total_public_stars_sampled: 'Optional[int]' = None, total_public_forks_sampled: 'Optional[int]' = None, ssh_keys_count: 'Optional[int]' = None, gpg_keys_count: 'Optional[int]' = None, starred_repos_sampled: 'Optional[int]' = None, social_accounts: 'Optional[Dict[str, str]]' = None, is_collaborator: 'Optional[bool]' = None, permission_on_repo: 'Optional[str]' = None)[source]

Bases: object

Parameters:
  • login (str)

  • id (int | None)

  • node_id (str)

  • type (str)

  • name (str)

  • company (str)

  • location (str)

  • email_public (str)

  • email_domain (str)

  • blog (str)

  • blog_host (str)

  • twitter (str)

  • bio (str)

  • avatar_url (str)

  • html_url (str)

  • hireable (bool)

  • site_admin (bool)

  • created_at (str)

  • updated_at (str)

  • followers (int)

  • following (int)

  • public_repos (int)

  • public_gists (int)

  • public_orgs (List[str])

  • orgs_public_count (int)

  • is_bot (bool)

  • last_public_event_at (str)

  • has_public_email (bool)

  • has_blog (bool)

  • has_twitter (bool)

  • company_normalized (str)

  • location_normalized (str)

  • account_age_days (int)

  • followers_following_ratio (float)

  • repos_per_year (float)

  • recently_active (bool)

  • top_languages (List[Tuple[str, int]] | None)

  • total_public_stars_sampled (int | None)

  • total_public_forks_sampled (int | None)

  • ssh_keys_count (int | None)

  • gpg_keys_count (int | None)

  • starred_repos_sampled (int | None)

  • social_accounts (Dict[str, str] | None)

  • is_collaborator (bool | None)

  • permission_on_repo (str | None)

account_age_days: int = 0
avatar_url: str
bio: str
blog: str
blog_host: str
company: str
company_normalized: str = ''
created_at: str
email_domain: str
email_public: str
followers: int
followers_following_ratio: float = 0.0
following: int
gpg_keys_count: int | None = None
has_blog: bool = False
has_public_email: bool = False
has_twitter: bool = False
hireable: bool
html_url: str
id: int | None
is_bot: bool
is_collaborator: bool | None = None
last_public_event_at: str
location: str
location_normalized: str = ''
login: str
name: str
node_id: str
orgs_public_count: int
permission_on_repo: str | None = None
public_gists: int
public_orgs: List[str]
public_repos: int
recently_active: bool = False
repos_per_year: float = 0.0
site_admin: bool
social_accounts: Dict[str, str] | None = None
ssh_keys_count: int | None = None
starred_repos_sampled: int | None = None
top_languages: List[Tuple[str, int]] | None = None
total_public_forks_sampled: int | None = None
total_public_stars_sampled: int | None = None
twitter: str
type: str
updated_at: str

The following table lists every field returned in a UserSnapshot (and in every dict entry of the user_data mapping produced by get_users() / get_user_details()).

UserSnapshot fields

Field

Type

Description

login

str

GitHub username.

id

int | None

Numeric GitHub user ID.

node_id

str

Global node ID (GraphQL).

type

str

Account type — "User" or "Bot".

name

str

Display name on their profile.

company

str

Raw company string from their profile.

company_normalized

str

Company with leading @ stripped and lowercased.

location

str

Raw location string from their profile.

location_normalized

str

Location stripped of trailing country codes.

email_public

str

Public e-mail address (empty string if not set).

email_domain

str

Domain part of email_public, e.g. "gmail.com".

has_public_email

bool

True when email_public is non-empty.

blog

str

Blog / website URL from their profile.

blog_host

str

Hostname extracted from blog.

has_blog

bool

True when blog is non-empty.

twitter

str

Twitter / X username from their profile.

has_twitter

bool

True when twitter is non-empty.

bio

str

Profile bio text.

avatar_url

str

URL of their profile avatar image.

html_url

str

URL of their GitHub profile page.

hireable

bool

Whether they have marked themselves as hireable.

site_admin

bool

Whether they are a GitHub staff/site admin.

created_at

str

ISO-8601 timestamp of account creation.

updated_at

str

ISO-8601 timestamp of last profile update.

followers

int

Number of GitHub followers.

following

int

Number of accounts they follow.

followers_following_ratio

float

followers / following (0 when following is 0).

public_repos

int

Number of public repositories.

public_gists

int

Number of public gists.

public_orgs

list[str]

Logins of their public organisations.

orgs_public_count

int

Length of public_orgs.

is_bot

bool

True when the account is detected as a bot.

last_public_event_at

str

ISO-8601 timestamp of their most recent public event.

account_age_days

int

Days since account creation.

repos_per_year

float

public_repos / (account_age_days / 365).

recently_active

bool

True when last_public_event_at is within the last 90 days.

top_languages

list[tuple[str, int]] | None

Sampled (language, byte-count) pairs from their public repos.

total_public_stars_sampled

int | None

Sum of stargazer counts across a sample of their public repos.

total_public_forks_sampled

int | None

Sum of fork counts across a sample of their public repos.

ssh_keys_count

int | None

Number of public SSH keys on their account.

gpg_keys_count

int | None

Number of GPG keys on their account.

starred_repos_sampled

int | None

Number of repos they have starred (sampled).

is_collaborator

bool | None

Whether they have collaborator access on the queried repository.

permission_on_repo

str | None

Their permission level on the queried repo (e.g. "push").


GitHubUserInfo

class repo_people.users.GitHubUserInfo(gh=None, username=None, user_obj=None, token=None)[source]

Bases: object

Wrapper around a GitHub user (PyGithub NamedUser) that exposes cached, easy-to-use accessors and a single ‘snapshot()’ to dump all attributes as a dataclass.

Parameters:
  • gh (Optional[Github])

  • username (Optional[str])

  • user_obj (Optional[NamedUser])

  • token (Optional[str])

property avatar_url: str
property bio: str
property blog: str
property blog_host: str
property company: str
property created_at: str
classmethod csv_headers()[source]

Return CSV headers for to_csv_row method.

Return type:

List[str]

property email_domain: str
property email_public: str
property followers: int
property following: int
gpg_keys_count(cap=50)[source]
Parameters:

cap (int)

Return type:

int

property hireable: bool
property html_url: str
property id: int | None
property is_bot: bool
property last_public_event_at: str
property location: str
property login: str
property name: str
property node_id: str
property orgs_public_count: int
property public_gists: int
property public_orgs: List[str]
property public_repos: int
repo_relationship(repo, check_permission=True)[source]
Parameters:
  • repo (Repository)

  • check_permission (bool)

Return type:

Dict[str, Union[bool, str, None]]

property site_admin: bool
snapshot(*, include_langs=False, include_star_fork_sums=False, langs_max_repos=50, sums_max_repos=50, include_keys_counts=False, include_star_sample=False, include_social_accounts=False, recent_days=90, repo=None)[source]

Collects all lightweight fields + optional aggregates into a dataclass.

Parameters

include_langs:

Collect top-3 languages from the user’s repositories. Expensive — makes one API call per repository up to langs_max_repos. Off by default.

include_star_fork_sums:

Sum stars and forks across the user’s repositories. Expensive — same cost as include_langs. Off by default.

Parameters:
  • include_langs (bool)

  • include_star_fork_sums (bool)

  • langs_max_repos (int)

  • sums_max_repos (int)

  • include_keys_counts (bool)

  • include_star_sample (bool)

  • include_social_accounts (bool)

  • recent_days (int)

Return type:

UserSnapshot

social_accounts()[source]

Fetch social accounts via the GitHub REST API; returns provider -> url dict.

Uses requests.get directly rather than the private PyGithub requester so the call is stable across PyGithub versions.

Return type:

Dict[str, str]

ssh_keys_count(cap=50)[source]
Parameters:

cap (int)

Return type:

int

star_fork_sums(max_repos=50)[source]
Parameters:

max_repos (int)

Return type:

Tuple[int, int]

starred_repos_sampled(cap=200)[source]
Parameters:

cap (int)

Return type:

int

to_csv_row(**snapshot_kwargs)[source]

Export user data as CSV row.

Return type:

List[str]

to_dict(**snapshot_kwargs)[source]
Return type:

Dict[str, Any]

to_json(**snapshot_kwargs)[source]

Export user data as JSON string.

Return type:

str

top_languages(max_repos=50)[source]
Parameters:

max_repos (int)

Return type:

List[Tuple[str, int]]

property twitter: str
property type: str
property updated_at: str

Export Module

repo_people.export.export_commit_authors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]

Export all unique commit authors (usernames) for a repository.

Pages through /commits and collects unique author.login values, so there is no hard cap on the number of results returned. Always returns the list of logins; the return_data parameter is kept for backwards compatibility but is ignored.

Note

export_contributors and export_commit_authors walk the same /commits endpoint and return equivalent results. They are aliases of each other.

Parameters:
Return type:

List[str]

repo_people.export.export_contributors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]

Export all unique contributors (usernames) for a repository.

Bypasses the /contributors endpoint’s hard 100-item cap by paging through /commits and collecting unique author.login values — the same commit-walk approach used by export_commit_authors. Both functions return equivalent sets of usernames and are aliases of each other.

Always returns the list of logins; return_data is kept for backwards compatibility but is ignored.

Parameters:
Return type:

List[str]

repo_people.export.export_dependents(owner, repo, outdir, return_data=True, export_csv=False, limit=None, sleep=1.0)[source]

Scrape and export the list of dependent users (usernames) for a repo.

Always returns the list of logins; return_data is kept for backwards compatibility but is ignored. Uses exponential back-off on non-200 responses.

Parameters

limit:

Maximum number of unique dependent repositories to collect before stopping. None (default) collects all pages. Pass 0 for an empty result.

sleep:

Base sleep interval (seconds) between pages. Doubles on each failed page request up to a maximum of 60 seconds.

Parameters:
Return type:

List[str]

repo_people.export.export_fork_owners(owner, repo, token=None, outdir=None, return_data=True, export_csv=False)[source]

Export the owners of all forks for a repository to CSV and/or return as list.

Parameters:
Return type:

List[str]

repo_people.export.export_issue_authors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]
Parameters:
Return type:

List[str]

repo_people.export.export_maintainers(owner, repo, token, outdir, skip_codeowners, skip_collaborators, return_data=True, export_csv=False)[source]

Export maintainers for a repository to CSV and/or return as list.

Collects maintainers from two sources (both can be toggled off):
  • CODEOWNERS file: parses @-mentions from .github/CODEOWNERS, docs/CODEOWNERS, or CODEOWNERS.

  • Collaborators API: includes users with admin, maintain, or push permissions.

Deduplicates across both sources before returning.

Parameters:
  • owner (str)

  • repo (str)

  • token (str | None)

  • outdir (str)

  • skip_codeowners (bool)

  • skip_collaborators (bool)

  • return_data (bool)

  • export_csv (bool)

Return type:

List[str]

repo_people.export.export_pr_authors(owner, repo, token, outdir, return_data=True, export_csv=False)[source]
Parameters:
Return type:

List[str]

repo_people.export.export_stargazers(owner, repo, token, outdir, return_data=True, export_csv=False)[source]
Parameters:
Return type:

List[str]

repo_people.export.export_watchers(owner, repo, token, outdir, return_data=True, export_csv=False)[source]
Parameters:
Return type:

List[str]

repo_people.export.fetch_codeowners(owner, repo, token)[source]
Parameters:
Return type:

Tuple[str | None, str | None]

repo_people.export.parse_codeowners_owners(text)[source]
Parameters:

text (str)

Return type:

List[str]

Each function returns a list of strings (usernames) for one specific role. All nine functions share the same signature:

export_<role>(gh: Github, owner: str, repo: str) -> list[str]
Export functions

Function

Returns

export_contributors

Usernames of repository contributors.

export_maintainers

CODEOWNERS + collaborator usernames.

export_stargazers

Usernames who have starred the repository.

export_watchers

Usernames watching the repository.

export_issue_authors

Usernames who have opened issues.

export_pr_authors

Usernames who have opened pull requests.

export_fork_owners

Usernames who have forked the repository.

export_commit_authors

Usernames extracted from commit history.

export_dependents

Usernames of repositories that depend on this one.