Create Your Own Kotlin Playground (and Get a Data Science Head Start) with Jupyter Notebook
Learn the basics of Jupyter Notebook and how to turn it into an interactive interpreter for Kotlin. You’ll also learn about Data Frames, an important data structure for data science applications. By Joey deVilla.
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Contents
Create Your Own Kotlin Playground (and Get a Data Science Head Start) with Jupyter Notebook
30 mins
- Kotlin? For Data Science?
- Introducing Jupyter Notebook
- Getting Started
- Creating Your First Notebook
- Understanding Code Cells
- Working With Markdown Cells
- Initializing krangl
- Diving into Data Frames
- Introducing Data Frames
- Creating a Data Frame from Scratch
- Getting the Data Frame’s Schema
- Getting the Data Frame’s Dimensions and Column Names
- Examining the Data Frame’s Columns
- Examining the Data Frame’s Rows
- Accessing Data Frame “Cells” by Column and Row
- Where to Go From Here?
Getting the Data Frame’s Schema
In the world of databases, the term “schema” has a specific meaning: It’s a description of how the data in a database is organized. In a DataFrame
, a schema is a description of how the data in the data frame is organized, accompanied by a small sample of the data. You can see the schema of a data frame with DataFrame
‘s schema()
method.
Look at df
‘s schema. Run the following in a new code cell:
df.schema()
You’ll see the following output:
DataFrame with 5 observations language [Str] Kotlin, Java, Swift, Objective-C, Dart developer [Str] JetBrains, James Gosling, Chris Lattner et al., Tom Love and Brad Cox, Lars Bak and Kasper Lund year_first_appeared [Int] 2011, 1995, 2014, 1984, 2011 preferred [Bol] true, false, true, false, true
schema()
is useful for getting a general idea about the data contained within a DataFrame
. It prints the following:
- The number of rows in the data frame, which schema() refers to as “observations”.
- The name of each column in the data frame.
- The type of each column in the data frame.
- The first values stored in each column. Because df is a small data frame, schema() printed out all the values for all the columns.
You might remember that when you instantiated df
, you never specified the column types. But schema()
clearly shows each column has a type: language
and developer
are columns that contain string values, year_first_appeared
contains integers, and preferred
is a column of Booleans!
krangl’s dataFrameOf()
method inferred the column types. You can specify column types when creating a DataFrame
, but krangl uses the data you provide to determine the appropriate types so you don’t have to. This feature makes the krangl feel more dynamically typed — like pandas and deplyr — providing a more Python- or R-like experience.
Getting the Data Frame’s Dimensions and Column Names
schema()
is good for diagnostics, but it isn’t useful if you want to programatically find how many rows and columns are in a DataFrame
or what its column names are. Fortunately, DataFrame
has useful properties for this purpose:
-
nrow
: The number of rows in the data frame. -
ncol
: The number of columns in the data frame. -
names
: A list of strings specifying the names of the columns, going from left to right.
Use these properties. Run the following in a new code cell:
println("The data frame has ${df.nrow} rows and ${df.ncol} columns.") println("The column indices and names are:") df.names.forEachIndexed { index, name -> println("$index: $name") }
You’ll see this output:
The data frame has 5 rows and 4 columns. The column indices and names are: 0: language 1: developer 2: year_first_appeared 3: preferred
Examining the Data Frame’s Columns
The cols
property of DataFrame
returns a list of objects representing each column, going from left to right. Use it to take a closer look at df
‘s columns.
Run the following in a new code cell:
df.cols.forEachIndexed { index, column -> println("$index: $column") }
You’ll see this result:
0: language [Str][5]: Kotlin, Java, Swift, Objective-C, Dart 1: developer [Str][5]: JetBrains, James Gosling, Chris Lattner et al., Tom Love and Brad Cox, Lars Bak ... 2: year_first_appeared [Int][5]: 2011, 1995, 2014, 1984, 2011 3: preferred [Bol][5]: true, false, true, false, true
Each column object in the list returned by the col
property is an instance of the DataCol
class. DataCol
has properties and methods that let you examine a column in greater detail and even perform some analysis on its contents.
For now, stick to using two DataCol
properties:
-
name
: The name of the column. -
length
: The number of items or rows in the column.
Run the following in a new code cell:
df.cols.forEachIndexed { index, column -> println("$index: name: ${column.name} length: ${column.length}") }
It will produce the following output:
0: name: language length: 5 1: name: developer length: 5 2: name: year_first_appeared length: 5 3: name: preferred length: 5
DataFrame
has some syntactic sugar that makes it easier to work with columns. Although you could access df
‘s first column using the syntax df.cols[0]
, it’s much simpler to access it using array syntax:
df[0] // Same thing as df.cols[0]
If you’d rather access a column by name, DataFrame
also implements map syntax. For example, to access df
‘s first column, which is named language
, you can use this code:
df["language"] // Column 0's name is language, // so this is equivalent to // df.cols[0] and df[0]
Examining the Data Frame’s Rows
Like DataFrame
has a cols
property to access its columns, it also has a rows
property. It returns an Iterable
that lets you access a collection object representing each row, going from top to bottom. Use it to take a closer look at df
‘s rows.
Run the following in a new code cell:
df.rows.forEachIndexed { index, row -> println("$index: $row") }
You should see this output:
0: {language=Kotlin, developer=JetBrains, year_first_appeared=2011, preferred=true} 1: {language=Java, developer=James Gosling, year_first_appeared=1995, preferred=false} 2: {language=Swift, developer=Chris Lattner et al., year_first_appeared=2014, preferred=true} 3: {language=Objective-C, developer=Tom Love and Brad Cox, year_first_appeared=1984, preferred=false} 4: {language=Dart, developer=Lars Bak and Kasper Lund, year_first_appeared=2011, preferred=true}
Each row object is an instance of DataFrameRow
, which is simply an alias for Map<String, Any?>
, where each key-value pair represents the name of a column and its corresponding value. For example, you could modify the loop you just ran to print only each programming language and the year in which it first appeared using this code:
df.rows.forEachIndexed { index, row -> println("$index: name: ${row["language"]} premiered: ${row["year_first_appeared"]}") }
You’ll see this output:
0: name: Kotlin premiered: 2011 1: name: Java premiered: 1995 2: name: Swift premiered: 2014 3: name: Objective-C premiered: 1984 4: name: Dart premiered: 2011
Because rows
returns an Iterable
rather than a List
, you need to use the elementAt()
method to access a row by its index number. For example, the following code retrieves row 1 of df
:
df.rows.elementAt(1) // Retrieve row 1
Accessing Data Frame “Cells” by Column and Row
DataFrame
provides a convenient column-row syntax for accessing individual “cells”.
Suppose you wanted to get the value in the year_first_appeared
column for row 3. As mentioned before, you could access that column in several ways:
// These all produce the same result df.cols[2] df[2] df["year_first_appeared"]
By adding a subscript to any of the lines above, you can access a specific row for that column. Here’s how you can access row 3 of the year_first_appeared
column:
// These all access the value in the "year_first_appeared" column // of row 3 df.cols[2][3] df[2][3] df["year_first_appeared"][3]