{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"toc_visible":true},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","metadata":{"id":"Y30FK5CHTRZ9"},"source":["#**Workshop 6: Tabular Data**\n"]},{"cell_type":"markdown","metadata":{"id":"-69E-7YeedYt"},"source":["It is recommended that you download the Gapminder Dataset: https://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip"]},{"cell_type":"markdown","metadata":{"id":"cMHKRoQeTVbj"},"source":["##**Objectives**\n","At the end of this workshop, you will be able to:\n","* Import the Pandas library.\n","* Use Pandas to load a simple CSV data set.\n","* Get some basic information about a Pandas DataFrame.\n","* Select individual values from a Pandas dataframe.\n","* Select entire rows or entire columns from a dataframe.\n","* Select a subset of both rows and columns from a dataframe in a single operation.\n","\n","\n"]},{"cell_type":"markdown","metadata":{"id":"L1e6_Zo-Tj9O"},"source":["##**Loading Data into Python**\n","Pandas is a widely-used Python library for statistics, particularly on tabular data. Load it with the code below.\n","\n","Note: The alias pd is commonly used for Panda (you can also import as \"just\" pandas, but importing as pd allows for less typing later).\n"]},{"cell_type":"code","metadata":{"id":"zCsma8oNbvXm"},"source":["import pandas as pd"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"3jMP7smFaXF3"},"source":["##**Where is your data coming from? Getting it into Google Colab**\n","\n","###**GitHub**\n","The easiest way to upload a CSV file is from your GitHub repository.\n","1. Click on the dataset in your repository\n","2. Click on View Raw\n","3. Copy the link to the raw dataset and store it as a string variable called url in Colab as shown below (a cleaner method but it’s not necessary).\n","4. Load the url into Pandas read_csv to get the dataframe using the code below:\n","\n","```\n","data = pd.read_csv(url)\n","```\n","\n","###**Your Desktop**\n","The following code will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.\n","```\n","from google.colab import files\n","uploaded = files.upload()\n","data = pd.read_csv('Filename.csv')\n","```\n","Note: Make sure the filename matches the name of the uploaded file\n","\n","###**Google Drive**\n","This is the most complicated of the three methods, but works very well if you are storing your data in google sheets.\n","\n","```\n","# Code to read a google sheet into Colaboratory:\n","from google.colab import auth\n","auth.authenticate_user()\n","import gspread\n","from oauth2client.client import GoogleCredentials\n","gc = gspread.authorize(GoogleCredentials.get_application_default())\n","worksheet = gc.open('Your Sheet Name').sheet1\n","# get_all_values gives a list of rows.\n","rows = worksheet.get_all_values()\n","# Convert to a DataFrame and render.\n","import pandas as pd\n","data = pd.DataFrame.from_records(rows)\n","```\n","\n","###**Comparing the upload process to using Jupyter Notebooks**\n","\n","```\n","data = pd.read_csv('folder_location/file_name.csv')\n","```\n","Note: If you're importing a file from a folder on your desktop, it will look like this:\n","```\n","data = pd.read_csv('data/gapminder_gdp_oceania.csv')\n","```\n","If your file is in more than one folder, you will need to secify the entire file path:\n","```\n","data = pd.read_csv('documents/pythonPractice/data/gapminder_gdp_oceania.csv')\n","```"]},{"cell_type":"code","metadata":{"id":"SVcZ3ClJeVFd"},"source":["#The code below is to upload the data from your computer,\n","#but feel free to use the file upload of your choice below\n","\n","from google.colab import files\n","uploaded = files.upload()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"mXDtmv4Qf0YM"},"source":["#Make sure 'Filename.csv' matches the file you've chosen to upload\n","#The next few lines of code work with the oceania data set\n","#Feel free to use any set as long as you modify the prefilled code to match\n","\n","data = pd.read_csv('gapminder_gdp_oceania.csv')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"pI842Laiam5z"},"source":["###**Your Data is in Colab!**\n","\n","We uploaded our file(s) as the variable data and used **pd.read_csv** to store the data that was read.\n","\n","\n","The columns in a dataframe are the observed variables, and the rows are the observations.\n","Pandas uses backslash \\ to show wrapped lines when output is too wide to fit the screen."]},{"cell_type":"code","metadata":{"id":"9oerHazwZT8G"},"source":["print(data)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"07GovO2PgT_C"},"source":["The columns in a dataframe are the observed variables, and the rows are the observations.\n","Pandas uses backslash \\ to show wrapped lines when output is too wide to fit the screen.\n"]},{"cell_type":"markdown","metadata":{"id":"b1j5JwcQUCaj"},"source":["##**Use index_col to specify that a column’s values should be used as row headings**\n","\n"]},{"cell_type":"code","metadata":{"id":"QQ6uq9G3UQnE"},"source":["#Really want to index by country\n","data = pd.read_csv('gapminder_gdp_oceania.csv', index_col='country')\n","print(data)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"d41v0qQZlHxS"},"source":["##**Use the data.info() method to find out more about a data frame.**"]},{"cell_type":"code","metadata":{"id":"4UFlaIY-f27x"},"source":["data.info()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Y3Xka7Rbf9K5"},"source":["data.info() tells us this data set:\n","* Is a data frame\n","* Has two rows named 'Australia' and 'New Zealand'\n","* Has twelve columns, each of which has two actual 64-bit floating point values.\n","* Contains only \"Non-null\" (means that we are not missing data values)\n","* Uses 208 bytes of memory.\n"]},{"cell_type":"markdown","metadata":{"id":"lrhI1UIdgXxJ"},"source":["##**The data.columns variable stores information about the data frame’s columns.**"]},{"cell_type":"code","metadata":{"id":"12A861Z4gvon"},"source":["print(data.columns)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"g-wgtnn7g5Jg"},"source":["##**Use data.T to transpose or change the view of a data frame**"]},{"cell_type":"code","metadata":{"id":"jcO5sbbzg62q"},"source":["print(data.T)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Zv9IeKtwg9Tt"},"source":["##**Data.describe() gets the summary statistics of only the columns that have numerical data.**\n","\n","Note: All other columns are ignored, unless you use the argument include='all'"]},{"cell_type":"code","metadata":{"id":"iNDmIwathEwV"},"source":["print(data.describe())"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"VFj0gX9rhKs6"},"source":["##**Use data.iloc[..., ...] to select values by their (entry) position**\n","This is similar to using indexes to locate specific values in a list!"]},{"cell_type":"code","metadata":{"id":"CzK5rjZ7hTL7"},"source":["#import new data set (this code will use \"europe\")\n","from google.colab import files\n","uploaded = files.upload()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"GmxMv7NchxE3"},"source":["data = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')\n","print(data.iloc[0, 0])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"BFm3zL5ZoQ--"},"source":["##**Use data.loc[..., ...] to select values by their (entry) label.**"]},{"cell_type":"code","metadata":{"id":"wc56MPYDobPR"},"source":["print(data.loc[\"Albania\", \"gdpPercap_1952\"])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Lbn8ZubdoxCu"},"source":["##**Use : on its own to mean all columns or all rows.**"]},{"cell_type":"code","metadata":{"id":"5iV-Qseho129"},"source":["print(data.loc[\"Albania\", :])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"dn4qdIg3o7-Y"},"source":["##**Select multiple columns or rows using data.loc and a named slice.**"]},{"cell_type":"code","metadata":{"id":"MEp98Yv2o_gr"},"source":["print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"XpeBBD0vpBlf"},"source":["##**All the statistical operators that work on entire data frames work the same way on slices.**"]},{"cell_type":"code","metadata":{"id":"-fC81qr6pB7u"},"source":["print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"DbOUbV5vpLv4"},"source":["print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"2OaVCyrDpN4u"},"source":["##**Use comparisons to select data based on value.**\n","Note: Returns a dataframe of similar shape of \"true false\""]},{"cell_type":"code","metadata":{"id":"as7FbD8MpPgf"},"source":["# Create a subset of data to keep output readable.\n","subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']\n","print('Subset of data:\\n', subset)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"TEGg-R5OpTz9"},"source":["# Which values were greater than 10000 ?\n","print('\\nWhere are values large?\\n', subset > 10000)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"w4KDET9JqkKI"},"source":["##**Try it yourself!**\n","1. Import a new dataset into colab\n","2. Try some of the codes we worked with today\n","\n","Bonus: Take a look at the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) and try some of the other built in commands!"]},{"cell_type":"markdown","metadata":{"id":"yRVlFpi2ep7J"},"source":["##**Sources**\n","1. Google colaboratory. Accessed August 12, 2021. https://colab.research.google.com/notebooks/snippets/sheets.ipynb#scrollTo=JiJVCmu3dhFa\n","2. Reading tabular data into dataframes – plotting and programming in python. Accessed May 17, 2021. http://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html\n","3. Pandas dataframes – plotting and programming in python. Accessed May 17, 2021. https://swcarpentry.github.io/python-novice-gapminder/08-data-frames/index.html\n","\n"]}]}