「入門機械学習」手習い、今日は「2章データの調査」です。

数値によるデータの要約と、可視化手法を学びます。

テスト用データの読み込み

> setwd("02-Exploration/")
> data.file <- file.path('data', '01_heights_weights_genders.csv')
> heights.weights <- read.csv(data.file, header = TRUE, sep = ',') 
> head(heights.weights)
  Gender   Height   Weight
1   Male 73.84702 241.8936
2   Male 68.78190 162.3105
3   Male 74.11011 212.7409
4   Male 71.73098 220.0425
5   Male 69.88180 206.3498
6   Male 67.25302 152.2122

データの数値による要約

summaryでベクトルの数値を要約します

> summary(heights.weights$Height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54.26   63.51   66.32   66.37   69.17   79.00

左から、

Min .. 最小値
1st Qu .. 第一四分位(データ全体の下から25%の位置にあたる値)
Median .. 中央値(データ全体の50%の位置にあたる値)
Mean .. 平均値
3rd Qu. .. (データ全体の下から75%の位置にあたる値)
Max .. 最大値

が表示されます。

最小値、最大値を求める

min/maxを使って、最小値/最大値を算出できます

# Heightだけを含むベクトルを作成
> heights <- with(heights.weights, Height)
> head(heights)
[1] 73.84702 68.78190 74.11011 71.73098 69.88180 67.25302
> min(heights)
[1] 54.26313
> max(heights)
[1] 78.99874

rangeで、両方をまとめて計算することもできます。

> range(heights)
[1] 54.26313 78.99874

分位数を求める

quantile で、データ中の各位置のデータを出力できます。

> quantile(heights)
      0%      25%      50%      75%     100% 
54.26313 63.50562 66.31807 69.17426 78.99874

分割幅を指定することもできます。

> quantile(heights, probs = seq(0, 1, by = 0.20))
      0%      20%      40%      60%      80%     100% 
54.26313 62.85901 65.19422 67.43537 69.81162 78.99874

分散と標準偏差を求める

var,sd を使います。

# 標準偏差
> var(heights)
[1] 14.80347
# 分散
> sd(heights)
[1] 3.847528

データの可視化

必要なライブラリを読み込み。

> library('ggplot2')

ヒストグラム

> plot = ggplot(heights.weights, aes(x = Height)) + geom_histogram(binwidth = 1)
> ggsave(plot = plot, filename = "histgram.png", width = 6, height = 8)

f:id:unageanu:20160110141407p:plain

密度プロットにしてみます。少ないデータ量でも、データセットの形状が分かりやすいのがメリット。

> plot = ggplot(heights.weights, aes(x = Height)) + geom_density()
> ggsave(plot = plot, filename = "kde_histgram.png", width = 6, height = 8)

f:id:unageanu:20160110141408p:plain

性別ごとの特徴をみるため、性別ごとのヒストグラムを表示してみます。

> plot = ggplot(heights.weights, aes(x = Height, fill = Gender)) + geom_density() + facet_grid(Gender ~ .)
> ggsave(plot = plot, filename = "gender_histgram.png", width = 6, height = 8)

f:id:unageanu:20160110141405p:plain

ヒストグラムの分類を整理。詳しくはWikipediaで。

正規分布
- ピーク(=最頻値)が1つしかない、単峰分布
- 左右が対称
- 裾が薄い(データのばらつきが小さい)
コーシー分布
- ピーク(=最頻値)が1つしかない、単峰分布
- 左右が対称
- 裾が厚い(データのばらつきが大きい)
ガンマ分布
- 左右が非対称で、平均値と中央値が大きく異なる
指数分布
- 左右が非対称で、最頻値がゼロ。

正規分布の例。

> set.seed(1)

> normal.values <- rnorm(250, 0, 1)
> plot = ggplot(data.frame(X = normal.values), aes(x = X)) + geom_density()
> ggsave(plot = plot, filename = "normal_histgram.png", width = 6, height = 8)

f:id:unageanu:20160110141409p:plain

コーシー分布。

> cauchy.values <- rcauchy(250, 0, 1)
> plot = ggplot(data.frame(X = cauchy.values), aes(x = X)) + geom_density()
> ggsave(plot = plot, filename = "cauchy_histgram.png", width = 6, height = 8)

f:id:unageanu:20160110141403p:plain

ガンマ分布。

> gamma.values <- rgamma(100000, 1, 0.001)
> plot = ggplot(data.frame(X = gamma.values), aes(x = X)) + geom_density()
> ggsave(plot = plot, filename = "gamma_histgram.png", width = 6, height = 8)

f:id:unageanu:20160110141404p:plain

指数分布、はない。

散布図

身長と体重の散布図を描きます。

> plot = ggplot(heights.weights, aes(x = Height, y = Weight)) + geom_point()
> ggsave(plot = plot, filename = "scatterplots.png", width = 6, height = 8)

f:id:unageanu:20160110141410p:plain

身長、体重には相関関係がありそう。geom_smooth()を使って、妥当な予測領域を表示してみます。

> plot = ggplot(heights.weights, aes(x = Height, y = Weight)) + geom_point() + geom_smooth()
> ggsave(plot = plot, filename = "scatterplots2.png", width = 6, height = 8)

f:id:unageanu:20160110141411p:plain

最後に、男女別の散布図を描いて終わり。

> plot = ggplot(heights.weights, aes(x = Height, y = Weight, color = Gender)) + geom_point()
> ggsave(plot = plot, filename = "gender_scatterplots.png", width = 6, height = 8)