Common Data Types in Python

Working with data requires some basic information on the type of data you may run into in Python. This tutorial is going to be focused on data types as they apply to data analysis and specifically how their parallels in R.

We will start with the built in data types in python and then expand to some data types in numpy and pandas. Finally, we will finish covering how data types work in a dataframe.

Numbers in Python

Numbers can exist in a few different formats:

Strings in Python

These are arrays of bytes representing unicode characters. They can be created by putting characters into single, double, or triple quotes.

There is no character class in python. It is represented by the str class. You can actually index a string to get a particular letter.

string = "python"
string[0]

Last Output:

'p'

Lists

Lists in python are kind of like vectors in R, ie those things that are created like c(). They are ordered arrays of items. Lists are flexible, so items do not need to be of the same type.

See below for how to create a list and access some of its values:

# creating a list
list = ["lets", "learn", "python", "today"]
# will return "lets"
list[0]
# will return "python"
list[-1]

# this will return 1-2
# remember the last digit is not included!
list[0:2]
# this will also return the first two
list[:2]
# this will return the last 2
list[-2:]

Last Output:

['python', 'today']

You can add and remove elements to a list. Unlike in R, all of these commands will directly edit the list so you don't need to assign the list

# adding an element to the end of a list

# this will add a 1 to the end of the list
list.append(1)

# this will add "test" to the 3rd position in the list
list.insert(3, "test")

# this will add multiple elements to the end of a list
list.extend(["more", "stuff"])

list

Last Output:

['lets', 'learn', 'python', 'test', 1, 'more', 'stuff']

You can also remove elements from a list.

# you can remove an element by name
list.remove(1)
# this will remove everything from the list
list.clear()
list

Last Output:

[]

Dictionaries

Dictionaries are unordered collection of key values pairs are are most similar to lists in R, and bare some resemblance to JSON in web development.

# you create a dictionary with the curly brackets
dict = {"brand": "Ford", "model": "Mustang", "year": 1964}

A key value pair means you access the value by calling the key. The key goes before the : and the value goes after.

dict["brand"]

Last Output:

'Ford'

You can update a dictionary like so.

dict["year"] = 2020

Dataframes

Dataframes in python are just like dataframes in R. They are two dimensional data structures with rows and columns. Looking at it, we can see that we are essentially creating a dictionary where the values are lists. Everything within those lists must be the same type.

Lets create our first dataframe!

import pandas as pd

data = {
  "Name": ["Goku", "Piccolo", "Vegeta"],
  "Powerlevel": [9000, 200, 1000],
  "Friendship Rating": [100, 100, 0],
  "Drawability": [100, 80, 62]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

df
Name Powerlevel Friendship Rating Drawability
0 Goku 9000 100 100
1 Piccolo 200 100 80
2 Vegeta 1000 0 62

Lets say that we want to observe either the head or tail of the data.

We would use the df.head() or df.tail() method to do that. You can of course put the number of rows you want within the method.

df.head(5)
df.tail(2)
Name Powerlevel Friendship Rating Drawability
1 Piccolo 200 100 80
2 Vegeta 1000 0 62

Accessing Columns

Columns are essentially lists! You can access them in much the same way:

# give me the column names
df.columns

# give me the values of a specific column
# we are limiting to the first two
# DONT FORGET INDEX STARTS AT 0
df["Name"][0:2]

# you can do the below for the same result
df.Name[0:2]

# multiple columns
df[["Name", "Powerlevel"]][0:2]
Name Powerlevel
0 Goku 9000
1 Piccolo 200

Accessing Rows and Columns

Use iloc to access a particular row and column. This is similar to the df[r,c] notation in R.

It does unfortunately seems to make it not a df anymore

# this will give us the first three rows and the SECOND column
new = df.iloc[0:3,1]

# you have to coerce this back into a df
new = pd.DataFrame(new)
new
Powerlevel
0 9000
1 200
2 1000

Describing the Data

You can also get basic summary stats using df.describe()

df.describe()
Powerlevel Friendship Rating Drawability
count 3.000000 3.000000 3.000000
mean 3400.000000 66.666667 80.666667
std 4866.210024 57.735027 19.008770
min 200.000000 0.000000 62.000000
25% 600.000000 50.000000 71.000000
50% 1000.000000 100.000000 80.000000
75% 5000.000000 100.000000 90.000000
max 9000.000000 100.000000 100.000000