Common Data Types in Python
Working with data requires some basic information on the type of data you may run into in Python. This tutorial is going to be focused on data types as they apply to data analysis and specifically how their parallels in R.
We will start with the built in data types in python and then expand to some data types in numpy and pandas. Finally, we will finish covering how data types work in a dataframe.
Numbers in Python
Numbers can exist in a few different formats:
- Integers - This is your classic whole number class.
- Float - These are numbers with a decimnal point.
- Complex Numbers - These are numbers with an imaginary part.
Strings in Python
These are arrays of bytes representing unicode characters. They can be created by putting characters into single, double, or triple quotes.
There is no character class in python. It is represented by the str class. You can actually index a string to get a particular letter.
string = "python"
string[0]
Last Output:
'p'
Lists
Lists in python are kind of like vectors in R, ie those things that are created like c()
. They are ordered arrays of items. Lists are flexible, so items do not need to be of the same type.
See below for how to create a list and access some of its values:
# creating a list
list = ["lets", "learn", "python", "today"]
# will return "lets"
list[0]
# will return "python"
list[-1]
# this will return 1-2
# remember the last digit is not included!
list[0:2]
# this will also return the first two
list[:2]
# this will return the last 2
list[-2:]
Last Output:
['python', 'today']
You can add and remove elements to a list. Unlike in R, all of these commands will directly edit the list so you don't need to assign the list
# adding an element to the end of a list
# this will add a 1 to the end of the list
list.append(1)
# this will add "test" to the 3rd position in the list
list.insert(3, "test")
# this will add multiple elements to the end of a list
list.extend(["more", "stuff"])
list
Last Output:
['lets', 'learn', 'python', 'test', 1, 'more', 'stuff']
You can also remove elements from a list.
# you can remove an element by name
list.remove(1)
# this will remove everything from the list
list.clear()
list
Last Output:
[]
Dictionaries
Dictionaries are unordered collection of key values pairs are are most similar to lists in R, and bare some resemblance to JSON in web development.
# you create a dictionary with the curly brackets
dict = {"brand": "Ford", "model": "Mustang", "year": 1964}
A key value pair means you access the value by calling the key. The key goes before the : and the value goes after.
dict["brand"]
Last Output:
'Ford'
You can update a dictionary like so.
dict["year"] = 2020
Dataframes
Dataframes in python are just like dataframes in R. They are two dimensional data structures with rows and columns. Looking at it, we can see that we are essentially creating a dictionary where the values are lists. Everything within those lists must be the same type.
Lets create our first dataframe!
import pandas as pd
data = {
"Name": ["Goku", "Piccolo", "Vegeta"],
"Powerlevel": [9000, 200, 1000],
"Friendship Rating": [100, 100, 0],
"Drawability": [100, 80, 62]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df
Name | Powerlevel | Friendship Rating | Drawability | |
---|---|---|---|---|
0 | Goku | 9000 | 100 | 100 |
1 | Piccolo | 200 | 100 | 80 |
2 | Vegeta | 1000 | 0 | 62 |
Lets say that we want to observe either the head or tail of the data.
We would use the df.head()
or df.tail()
method to do that. You can of course put the number of rows you want within the method.
df.head(5)
df.tail(2)
Name | Powerlevel | Friendship Rating | Drawability | |
---|---|---|---|---|
1 | Piccolo | 200 | 100 | 80 |
2 | Vegeta | 1000 | 0 | 62 |
Accessing Columns
Columns are essentially lists! You can access them in much the same way:
# give me the column names
df.columns
# give me the values of a specific column
# we are limiting to the first two
# DONT FORGET INDEX STARTS AT 0
df["Name"][0:2]
# you can do the below for the same result
df.Name[0:2]
# multiple columns
df[["Name", "Powerlevel"]][0:2]
Name | Powerlevel | |
---|---|---|
0 | Goku | 9000 |
1 | Piccolo | 200 |
Accessing Rows and Columns
Use iloc
to access a particular row and column. This is similar to the df[r,c]
notation in R.
It does unfortunately seems to make it not a df anymore
# this will give us the first three rows and the SECOND column
new = df.iloc[0:3,1]
# you have to coerce this back into a df
new = pd.DataFrame(new)
new
Powerlevel | |
---|---|
0 | 9000 |
1 | 200 |
2 | 1000 |
Describing the Data
You can also get basic summary stats using df.describe()
df.describe()
Powerlevel | Friendship Rating | Drawability | |
---|---|---|---|
count | 3.000000 | 3.000000 | 3.000000 |
mean | 3400.000000 | 66.666667 | 80.666667 |
std | 4866.210024 | 57.735027 | 19.008770 |
min | 200.000000 | 0.000000 | 62.000000 |
25% | 600.000000 | 50.000000 | 71.000000 |
50% | 1000.000000 | 100.000000 | 80.000000 |
75% | 5000.000000 | 100.000000 | 90.000000 |
max | 9000.000000 | 100.000000 | 100.000000 |