機器學習介紹 (Machine learning introduction)

$X$ : feature domain (input), $Y$ range (output)
$f: X \rightarrow Y$ , 將資料由feature轉換到classification (如果 $Y \in \mathbb{R}$ 為regression)的函數，其解析形式未知，只有部份資料集合 $D$ 被觀測到。
假設資料集合 $D$ 的特徵 $\{ x_1, x_2, \cdots, X_N\}$ 的定態(stationary)機率分佈為 $P$ 。
訓練資料的由未知的資料分佈 $P$ 隨機抽取樣本而產生 $\mathbf{x} = \{x_1, x_2, \cdots, x_N\}$ ，然後 $\mathbf{x}$ 及其目標值 $y$ 被提供給學習器，學習器在學習目標函數 $f$ 時的定義域為假設集合 $H$ (set of function)。
在觀察了一系列訓練資料 $\mathbf{x}$ 後，學習器需要從假設集合 $H$ 中得到最終的假設(函數) $g$ ，這是對資料集合 $D$ 未知分佈的目標函數 $f$ 的理想估計函數。
最後，我們通過訓練出來的假設 $g$ 對 $X$ 中新的資料的性能來評估訓練器(leave-one-out, K-fold cross-validation or other unobserved samples)。

Example: 核定信用卡問題

因此input $x = \lbrace 23, female, 1000000, 1 year, 3 year, 200000 \rbrace$ $x = {23, f e m a l e, 1000000, 1 y e a r, 3 y e a r, 200000}$ .
- output $y = 1$ ：核準， $y=0$ ：拒絕。
- unknown target function $f: x \rightarrow y$ .
- hypothesis: $g: x \rightarrow y$ .
- 目標是使機器學習模型的函數 $h$ 逼近未知的真正目標函數 $f$ .

ML: use data to compute hypothesis that approximates target f.
DM: use (huge) data to find property that is interesting.
If "interesting property" == "hypothesis that approximate target"
- then ML == DM

AI: compute something that shows intelligent behavior.
g approximates f is something that shows intelligent behavior
- ML can realize AI, among other routes.
- eg. chess playing.

statistics: use data to make inference about an unknown process.
g is an inference outcome;
- f is something unknown
- statistics can be used to achieve ML
- traditional statistics also fcous on provable results with math assumptions, and care less about computation.