Common Data Types in Python

Working with data requires some basic information on the type of data you may run into in Python. This tutorial is going to be focused on data types as they apply to data analysis and specifically how their parallels in R.

We will start with the built in data types in python and then expand to some data types in numpy and pandas. Finally, we will finish covering how data types work in a dataframe.

Numbers in Python

Numbers can exist in a few different formats:

Integers - This is your classic whole number class.
Float - These are numbers with a decimnal point.
Complex Numbers - These are numbers with an imaginary part.

Strings in Python

These are arrays of bytes representing unicode characters. They can be created by putting characters into single, double, or triple quotes.

There is no character class in python. It is represented by the str class. You can actually index a string to get a particular letter.

string = "python"
string[0]

Last Output:

'p'

Lists

Lists in python are kind of like vectors in R, ie those things that are created like c(). They are ordered arrays of items. Lists are flexible, so items do not need to be of the same type.

See below for how to create a list and access some of its values:

# creating a list
list = ["lets", "learn", "python", "today"]
# will return "lets"
list[0]
# will return "python"
list[-1]

# this will return 1-2
# remember the last digit is not included!
list[0:2]
# this will also return the first two
list[:2]
# this will return the last 2
list[-2:]

Last Output:

['python', 'today']

You can add and remove elements to a list. Unlike in R, all of these commands will directly edit the list so you don't need to assign the list

# adding an element to the end of a list

# this will add a 1 to the end of the list
list.append(1)

# this will add "test" to the 3rd position in the list
list.insert(3, "test")

# this will add multiple elements to the end of a list
list.extend(["more", "stuff"])

list

Last Output:

['lets', 'learn', 'python', 'test', 1, 'more', 'stuff']

You can also remove elements from a list.

# you can remove an element by name
list.remove(1)
# this will remove everything from the list
list.clear()
list

Last Output:

[]

Dictionaries

Dictionaries are unordered collection of key values pairs are are most similar to lists in R, and bare some resemblance to JSON in web development.

# you create a dictionary with the curly brackets
dict = {"brand": "Ford", "model": "Mustang", "year": 1964}

A key value pair means you access the value by calling the key. The key goes before the : and the value goes after.

dict["brand"]

Last Output:

'Ford'

You can update a dictionary like so.

dict["year"] = 2020

Dataframes

Dataframes in python are just like dataframes in R. They are two dimensional data structures with rows and columns. Looking at it, we can see that we are essentially creating a dictionary where the values are lists. Everything within those lists must be the same type.

Lets create our first dataframe!

import pandas as pd

data = {
  "Name": ["Goku", "Piccolo", "Vegeta"],
  "Powerlevel": [9000, 200, 1000],
  "Friendship Rating": [100, 100, 0],
  "Drawability": [100, 80, 62]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

df

	Name	Powerlevel	Friendship Rating	Drawability
0	Goku	9000	100	100
1	Piccolo	200	100	80
2	Vegeta	1000	0	62

Lets say that we want to observe either the head or tail of the data.

We would use the df.head() or df.tail() method to do that. You can of course put the number of rows you want within the method.

df.head(5)
df.tail(2)

	Name	Powerlevel	Friendship Rating	Drawability
1	Piccolo	200	100	80
2	Vegeta	1000	0	62

Accessing Columns

Columns are essentially lists! You can access them in much the same way:

# give me the column names
df.columns

# give me the values of a specific column
# we are limiting to the first two
# DONT FORGET INDEX STARTS AT 0
df["Name"][0:2]

# you can do the below for the same result
df.Name[0:2]

# multiple columns
df[["Name", "Powerlevel"]][0:2]

	Name	Powerlevel
0	Goku	9000
1	Piccolo	200

Accessing Rows and Columns

Use iloc to access a particular row and column. This is similar to the df[r,c] notation in R.

It does unfortunately seems to make it not a df anymore

# this will give us the first three rows and the SECOND column
new = df.iloc[0:3,1]

# you have to coerce this back into a df
new = pd.DataFrame(new)
new

	Powerlevel
0	9000
1	200
2	1000

Describing the Data

You can also get basic summary stats using df.describe()

df.describe()

	Powerlevel	Friendship Rating	Drawability
count	3.000000	3.000000	3.000000
mean	3400.000000	66.666667	80.666667
std	4866.210024	57.735027	19.008770
min	200.000000	0.000000	62.000000
25%	600.000000	50.000000	71.000000
50%	1000.000000	100.000000	80.000000
75%	5000.000000	100.000000	90.000000
max	9000.000000	100.000000	100.000000