Usage ===== Installation ------------ Install from PyPI: .. code-block:: console pip install repo-people Or with Poetry: .. code-block:: console poetry add repo-people Quick Start ----------- The simplest end-to-end call collects users from all nine role categories, fetches their full GitHub profiles, and returns a dictionary keyed by username: .. code-block:: python from repo_people import RepoPeople rp = RepoPeople("octocat", "Hello-World", token="ghp_...") user_data = rp.get_users() # {'octocat': {'login': 'octocat', 'followers': 9001, ...}, ...} Authentication -------------- A GitHub personal-access token is strongly recommended. Without one, the GitHub API rate-limit is only 60 requests per hour. With a token it rises to 5 000 requests per hour. .. code-block:: python rp = RepoPeople("owner", "repo", token="ghp_YOUR_TOKEN_HERE") Alternatively, export the token as an environment variable and pass it in: .. code-block:: python import os from repo_people import RepoPeople rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"]) Tip — store your token in a ``.env`` file and load it with ``python-dotenv``: .. code-block:: python from dotenv import load_dotenv load_dotenv() rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"]) Token Validation ---------------- The token is validated immediately when ``RepoPeople`` is instantiated. If the token is invalid or expired, a ``ConnectionError`` is raised right away with a descriptive message rather than failing silently on the first API call: .. code-block:: python try: rp = RepoPeople("owner", "repo", token="invalid_token") except ConnectionError as e: print(e) # GitHub connection failed — verify your token. (...) Input Validation ---------------- The ``owner`` and ``repo`` parameters are validated at construction time. Both must contain only ``[A-Za-z0-9_.-]`` characters. Any other characters raise a ``ValueError`` immediately: .. code-block:: python try: rp = RepoPeople("owner with spaces", "repo") except ValueError as e: print(e) # Invalid owner: 'owner with spaces'. Must match [A-Za-z0-9_.-]+ Choosing an Output Directory ----------------------------- By default, exported files are written to the current working directory. Use ``outdir`` to specify a different location: .. code-block:: python rp = RepoPeople("owner", "repo", token="...", outdir="/path/to/output") Filtering by Role ----------------- The ``roles`` parameter accepts a list of one or more of the nine valid roles. All nine roles are collected when ``roles`` is not specified: .. code-block:: python # Only contributors and stargazers user_data = rp.get_users(roles=["contributors", "stargazers"]) Available roles: * ``contributors`` * ``maintainers`` (CODEOWNERS + collaborators) * ``stargazers`` * ``watchers`` * ``issue_authors`` * ``pr_authors`` * ``fork_owners`` * ``commit_authors`` * ``dependents`` .. code-block:: python # Inspect the full set at runtime print(RepoPeople.VALID_ROLES) Role names are validated **before any API calls** are made. Passing an unrecognised name raises ``ValueError`` immediately, listing every invalid name and the full set of valid ones: .. code-block:: python rp.get_users(roles=["typo_role"]) # ValueError: Invalid role(s): ['typo_role']. # Valid roles are: ['commit_authors', 'contributors', ...] A bare string is also accepted and treated as a single-item list: .. code-block:: python user_data = rp.get_users(roles="contributors") Skipping CODEOWNERS or Collaborators -------------------------------------- When collecting ``maintainers`` the package looks up both the ``CODEOWNERS`` file and the repository's collaborator list. Either source can be disabled: .. code-block:: python rp = RepoPeople("owner", "repo", token="...", skip_codeowners=True, skip_collaborators=True) Limiting the Number of Results ------------------------------- ``limit`` caps the total number of user profiles fetched. Useful for quickly testing on large repositories: .. code-block:: python user_data = rp.get_users(limit=50) Excluding Users --------------- Pass a list of usernames to ``exclude`` to skip specific accounts: .. code-block:: python user_data = rp.get_users(exclude=["dependabot", "github-actions[bot]"]) To automatically skip all bot accounts (those whose GitHub ``type`` field is ``"Bot"`` or whose login matches common bot patterns): .. code-block:: python user_data = rp.get_users(exclude_bots=True) Incremental Fetching (Resume Support) -------------------------------------- For large repositories the fetch can take a long time. Use ``save_each_iteration=True`` to persist progress in batches of 10 user profiles. If the process is interrupted, restart with ``resume=True`` to pick up from where you left off: .. code-block:: python # First run — saves after every user user_data = rp.get_users(save_each_iteration=True, export=True) # Restart after interruption — skips users already in the output file user_data = rp.get_users(save_each_iteration=True, export=True, resume=True) Filtering the Output Fields ----------------------------- By default all 30+ fields are included for every user. Pass a list of field names to ``fields`` to limit what appears in exports and the returned dict: .. code-block:: python user_data = rp.get_users( fields=["login", "name", "location", "followers", "public_repos"] ) A bare string is also accepted and is treated as a single-item list: .. code-block:: python user_data = rp.get_users(fields="login") Field names are validated against :class:`~repo_people.users.UserSnapshot` **before any API calls are made**. Passing an unrecognised name raises a ``ValueError`` immediately, listing every invalid name and the full set of valid ones: .. code-block:: python rp.get_users(fields=["login", "typo_field"]) # ValueError: Invalid field(s): ['typo_field']. # Valid fields are: ['account_age_days', 'avatar_url', 'bio', ...] See the :doc:`api` page for the complete field list. Roles in Output Records ----------------------- Every user dict returned by ``get_users`` always has a ``"roles"`` key listing the role(s) the user appeared under, regardless of any ``fields=`` filter: .. code-block:: python user_data = rp.get_users(roles=["contributors", "stargazers"], fields=["login"]) print(user_data["octocat"]) # {'login': 'octocat', 'roles': ['contributors', 'stargazers']} Exporting Results ----------------- Export to JSON ~~~~~~~~~~~~~~ Pass ``export=True`` to write a JSON file automatically. The file is saved to ``outdir`` (or the current directory) as ``user_details.json``: .. code-block:: python user_data = rp.get_users(export=True) To export manually after the fact: .. code-block:: python rp.export_to_json(user_data, filename="my_output.json") Export to CSV ~~~~~~~~~~~~~ Pass ``export_csv=True`` to write a CSV file: .. code-block:: python user_data = rp.get_users(export_csv=True) Or manually: .. code-block:: python rp.export_to_csv(user_data, filename="my_output.csv") Export to Markdown ~~~~~~~~~~~~~~~~~~ Generate a Markdown table with (optionally) a subset of fields: .. code-block:: python rp.export_to_markdown( user_data, filename="users.md", fields=["login", "name", "location", "followers"] ) Both ``export=True`` and ``export_csv=True`` can be combined: .. code-block:: python user_data = rp.get_users(export=True, export_csv=True) Analysis Helpers ---------------- summarise ~~~~~~~~~ Returns aggregate statistics for the collected user data: .. code-block:: python stats = rp.summarise(user_data, top_n=5) # { # 'total_users': 134, # 'users_with_email': 42, # 'users_with_blog': 61, # 'top_locations': [('San Francisco', 18), ...], # 'top_companies': [('GitHub', 9), ...], # 'top_languages': [('Python', 54), ...], # ... # } top_users ~~~~~~~~~ Returns the top *n* users ranked by a given field: .. code-block:: python # Top 10 by follower count leaders = rp.top_users(user_data, n=10, by="followers") for u in leaders: print(u["login"], u["followers"]) # Top 5 by number of public repos prolific = rp.top_users(user_data, n=5, by="public_repos") Using the Lower-Level API -------------------------- The two-step pipeline is available directly if you need more control: .. code-block:: python from repo_people import RepoPeople rp = RepoPeople("owner", "repo", token="...") # Step 1 — collect all usernames grouped by role all_usernames = rp.collect_all_usernames(roles=["contributors", "stargazers"]) # {'contributors': ['alice', 'bob'], 'stargazers': ['carol', ...], ...} # Flatten to a unique set unique = list({u for users in all_usernames.values() for u in users}) # Step 2 — fetch full profiles user_data = rp.get_user_details( unique, limit=100, exclude_bots=True, verbose=True, ) Rate-Limit Tips --------------- * Always use a token — it gives you 5 000 requests/hour vs 60 unauthenticated. * Use ``limit`` during development to avoid exhausting the rate limit on large repos. * Use ``exclude_bots=True`` to skip bot accounts that do not need enrichment. * Use ``save_each_iteration=True`` on very large repos so partial progress is persisted (every 10 profiles) if the rate limit is hit mid-run. * ``resume=True`` allows you to continue after hitting a rate limit without re-fetching profiles already collected.* A progress line is printed automatically every 50 users and at the end of the fetch, showing the current rate-limit headroom:: [Progress: 50/134 | Rate limit: 4820/5000 remaining, resets in 42m] * Any users that fail to fetch are collected and a summary is printed at the end rather than stopping the whole run:: Skipped 2 user(s): ['ghost', 'deleted-account'] Concurrent Fetching ------------------- The ``workers`` parameter controls how many profiles are fetched in parallel (default ``1`` = sequential). Increasing it reduces wall-clock time on repos with many users: .. code-block:: python # Fetch up to 8 profiles simultaneously user_data = rp.get_users(workers=8) Or pass it directly to the lower-level method: .. code-block:: python user_data = rp.get_user_details(logins, workers=4) .. note:: Concurrent requests still count against your rate limit. ``workers`` reduces wall-clock time by overlapping requests, not by increasing the total request budget. The maximum value for ``workers`` is **32**. If a higher value is passed, it is silently capped to 32 and a :class:`UserWarning` is emitted. Async Fetching -------------- For very high concurrency use the async pipeline. This requires the optional ``aiohttp`` dependency: .. code-block:: console pip install "repo-people[async]" Then: .. code-block:: python import asyncio user_data = asyncio.run(rp.get_users_async(concurrency=10)) Exporting as JSON Lines (JSONL) ------------------------------- Pass ``lines=True`` to :meth:`~repo_people.RepoPeople.export_to_json` to write one JSON object per line (JSONL / JSON Lines format). This is useful for streaming large outputs: .. code-block:: python path = rp.export_to_json(user_data, lines=True) # Writes /user_details.jsonl You can also specify a custom filename: .. code-block:: python path = rp.export_to_json(user_data, filename="users.jsonl", lines=True)