Tools in Data Science - Project 1

Deadline:

Using the GitHub API, scrape all users in the city of ${city} with over ${followers} followers, and their repositories.

Create a GitHub repo with these files in the main branch:

  1. users.csv. See below. Use the SAME values as in the API response. For booleans, use true and false and empty strings for null.
  2. repositories.csv. See below. Use the SAME values as in the API response. For booleans, use true and false and empty strings for null.
  3. README.md. See below.
  4. Optional but recommended: your code and/or spreadsheet, in whichever language you analyzed the data in.

users.csv has following information about each user in ${city} with over ${followers} followers, with fields:

repositories.csv has these users' public repositories. For each user in users.csv, fetch up to the 500 most recently pushed repositories, with fields:

README.md must begin with 3 bullet points. Each bullet must be one sentence no more than 50 words.

  1. An explanation of how you scraped the data
  2. The most interesting and surprising fact you found after analyzing the the data
  3. An actionable recommendation for developers based on your analysis

Your peers will rank your README.md subjectively. You can add anything else you like in the README.md but your peers will only focus on the 3 bullet points.

We'll distribute 5 repos to each peer to rank based on:

Peer scores are calculated as follows:

Paste the link to your GitHub repo here. It should look like this: https://github.com/[login]/[repository]


Now, answer these questions using your dataset. Each correct answer gets 1%.

1. Who are the top 5 users in ${city} with the highest number of followers? List their login in order, comma-separated.

2. Who are the 5 earliest registered GitHub users in ${city}? List their login in ascending order of created_at, comma-separated.

3. What are the 3 most popular license among these users? Ignore missing licenses. List the license_name in order, comma-separated.

4. Which company do the majority of these developers work at?

5. Which programming language is most popular among these users?

6. Which programming language is the second most popular among users who joined on or after 1 Jan 2020?

7. Which language has the highest average number of stars per repository?

8. Let's define leader_strength as followers / (1 + following). Who are the top 5 in terms of leader_strength? List their login in order, comma-separated.

9. What is the correlation between the number of followers and the number of public repositories among users in ${city}?

10. Does creating more repos help users get more followers? Using regression, estimate how many additional followers a user gets per additional public repository.

11. Do people typically enable projects and wikis together? What is the correlation between a repo having projects enabled and having wiki enabled?

12. Do hireable users follow more people than those who are not hireable?

13. Some developers write long bios. Does that help them get more followers? What's the impact of the length of their bio (in Unicode words, split by whitespace) with followers? (Ignore people without bios)

14. Who created the most repositories on weekends (UTC)? List the top 5 users' login in order, comma-separated

15. Do people who are hireable share their email addresses more often?

16. Let's assume that the last word in a user's name is their surname (ignore missing names, trim and split by whitespace.) What's the most common surname? (If there's a tie, list them all, comma-separated, alphabetically)