Tools in Data Science - Project 1

Deadline:

Using the GitHub API, scrape all users in the city of ${city} with over ${followers} followers, and their repositories.

Create a GitHub repo with these files in the main branch:

users.csv. See below. Use the SAME values as in the API response. For booleans, use true and false and empty strings for null.
repositories.csv. See below. Use the SAME values as in the API response. For booleans, use true and false and empty strings for null.
README.md. See below.
Optional but recommended: your code and/or spreadsheet, in whichever language you analyzed the data in.

users.csv has following information about each user in ${city} with over ${followers} followers, with fields:

login: Their Github user ID
name: Their full name
company: The company they work at. Clean up company names. At least make sure:
1. They're trimmed of whitespace
2. Leading @ symbol is stripped (Note: ONLY the first one is stripped)
3. They are converted to UPPERCASE
location: The city they are in
email: Their email address
hireable: Whether they are open to being hired
bio: A short bio about them
public_repos: The number of public repositories they have
followers: The number of followers they have
following: The number of people they are following
created_at: When they joined Github

repositories.csv has these users' public repositories. For each user in users.csv, fetch up to the 500 most recently pushed repositories, with fields:

login: The Github user ID (login) of the owner, which, BTW, is not directly in the API response.)
full_name: Full name of the repository
created_at: When the repository was created
stargazers_count: Number of stars the repository has
watchers_count: Number of watchers the repository has
language: The programming language the repository is written in
has_projects: Whether the repository has projects enabled
has_wiki: Whether the repository has a wiki
license_name: Name of the license the repository is under (This is under license.key)

README.md must begin with 3 bullet points. Each bullet must be one sentence no more than 50 words.

An explanation of how you scraped the data
The most interesting and surprising fact you found after analyzing the the data
An actionable recommendation for developers based on your analysis

Your peers will rank your README.md subjectively. You can add anything else you like in the README.md but your peers will only focus on the 3 bullet points.

We'll distribute 5 repos to each peer to rank based on:

Whose PROCESS of analysis looks best to you? (Peers wil rank from 1 - best to 5 - worst)
Whose RESULTS did you find most interesting? (Peers wil rank from 1 - best to 5 - worst)

Peer scores are calculated as follows:

PROCESS SCORE = 2% of project grade for rank 1, 1.5% for rank 2, 1% for rank 3, 0.5% for rank 4, 0% for rank 5
RESULT SCORE = 2% of project grade for rank 1, 1.5% for rank 2, 1% for rank 3, 0.5% for rank 4, 0% for rank 5

Paste the link to your GitHub repo here. It should look like this: `https://github.com/[login]/[repository]`

Your GitHub repo URL

Now, answer these questions using your dataset. Each correct answer gets 1%.

1. Who are the top 5 users in `${city}` with the highest number of followers? List their `login` in order, comma-separated.

Users

2. Who are the 5 earliest registered GitHub users in `${city}`? List their `login` in ascending order of `created_at`, comma-separated.

Users

3. What are the 3 most popular license among these users? Ignore missing licenses. List the `license_name` in order, comma-separated.

Licenses

4. Which company do the majority of these developers work at?

Company (cleaned up as explained above)

5. Which programming language is most popular among these users?

Language

6. Which programming language is the second most popular among users who joined on or after 1 Jan 2020?

Language

7. Which language has the highest average number of stars per repository?

Language

8. Let's define `leader_strength` as `followers / (1 + following)`. Who are the top 5 in terms of `leader_strength`? List their `login` in order, comma-separated.

User login

9. What is the correlation between the number of followers and the number of public repositories among users in `${city}`?

Correlation between followers and repos (to 3 decimal places, e.g. 0.123 or -0.123)

10. Does creating more repos help users get more followers? Using regression, estimate how many additional followers a user gets per additional public repository.

Regression slope of followers on repos (to 3 decimal places, e.g. 0.123 or -0.123)

11. Do people typically enable projects and wikis together? What is the correlation between a repo having projects enabled and having wiki enabled?

Correlation between projects and wiki enabled (to 3 decimal places, e.g. 0.123 or -0.123)

12. Do hireable users follow more people than those who are not hireable?

Average of following per user for hireable=true minus the average following for the rest (to 3 decimal places, e.g. 12.345 or -12.345)

13. Some developers write long bios. Does that help them get more followers? What's the impact of the length of their bio (in Unicode words, split by whitespace) with `followers`? (Ignore people without bios)

Regression slope of followers on bio word count (to 3 decimal places, e.g. 12.345 or -12.345)

14. Who created the most repositories on weekends (UTC)? List the top 5 users' `login` in order, comma-separated

Users login

15. Do people who are hireable share their email addresses more often?

[fraction of users with email when hireable=true] minus [fraction of users with email for the rest] (to 3 decimal places, e.g. 0.123 or -0.123)

16. Let's assume that the last word in a user's `name` is their surname (ignore missing names, trim and split by whitespace.) What's the most common surname? (If there's a tie, list them all, comma-separated, alphabetically)

Most common surname(s)

Tools in Data Science - Project 1

Deadline:

Paste the link to your GitHub repo here. It should look like this: https://github.com/[login]/[repository]

1. Who are the top 5 users in ${city} with the highest number of followers? List their login in order, comma-separated.

2. Who are the 5 earliest registered GitHub users in ${city}? List their login in ascending order of created_at, comma-separated.

3. What are the 3 most popular license among these users? Ignore missing licenses. List the license_name in order, comma-separated.

4. Which company do the majority of these developers work at?

5. Which programming language is most popular among these users?

6. Which programming language is the second most popular among users who joined on or after 1 Jan 2020?

7. Which language has the highest average number of stars per repository?

8. Let's define leader_strength as followers / (1 + following). Who are the top 5 in terms of leader_strength? List their login in order, comma-separated.

9. What is the correlation between the number of followers and the number of public repositories among users in ${city}?

10. Does creating more repos help users get more followers? Using regression, estimate how many additional followers a user gets per additional public repository.

11. Do people typically enable projects and wikis together? What is the correlation between a repo having projects enabled and having wiki enabled?

12. Do hireable users follow more people than those who are not hireable?

13. Some developers write long bios. Does that help them get more followers? What's the impact of the length of their bio (in Unicode words, split by whitespace) with followers? (Ignore people without bios)

14. Who created the most repositories on weekends (UTC)? List the top 5 users' login in order, comma-separated

15. Do people who are hireable share their email addresses more often?

16. Let's assume that the last word in a user's name is their surname (ignore missing names, trim and split by whitespace.) What's the most common surname? (If there's a tie, list them all, comma-separated, alphabetically)

Paste the link to your GitHub repo here. It should look like this: `https://github.com/[login]/[repository]`

1. Who are the top 5 users in `${city}` with the highest number of followers? List their `login` in order, comma-separated.

2. Who are the 5 earliest registered GitHub users in `${city}`? List their `login` in ascending order of `created_at`, comma-separated.

3. What are the 3 most popular license among these users? Ignore missing licenses. List the `license_name` in order, comma-separated.

8. Let's define `leader_strength` as `followers / (1 + following)`. Who are the top 5 in terms of `leader_strength`? List their `login` in order, comma-separated.

9. What is the correlation between the number of followers and the number of public repositories among users in `${city}`?

13. Some developers write long bios. Does that help them get more followers? What's the impact of the length of their bio (in Unicode words, split by whitespace) with `followers`? (Ignore people without bios)

14. Who created the most repositories on weekends (UTC)? List the top 5 users' `login` in order, comma-separated

16. Let's assume that the last word in a user's `name` is their surname (ignore missing names, trim and split by whitespace.) What's the most common surname? (If there's a tie, list them all, comma-separated, alphabetically)