Project 1 — Dataset Exploration
Goal: get comfortable reading a dataset before modeling. I focus on shape, missing values, class balance, distributions, and correlations — then I write down practical observations.
Dataset: Breast Cancer Wisconsin (built into scikit-learn).
Quick stats
Rows: 569
Columns: 31
Missing values (total): 0
Diagnosis counts: {1: 357, 0: 212}
My observations
  • The dataset has a clean structure (no missing values), which makes it good for a first ML pipeline.
  • Some features appear strongly related (high correlations), so later I need to be careful about multicollinearity for linear models.
  • The class balance is not extreme, which is helpful for a beginner baseline (but I still won’t ignore evaluation beyond accuracy).
Preview (first 8 rows)
This is just a quick look to understand what the table looks like.
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension diagnosis
17.99 10.38 122.8 1001.0 0.118 0.278 0.3 0.147 0.242 0.079 1.095 0.905 8.589 153.4 0.006 0.049 0.054 0.016 0.03 0.006 25.38 17.33 184.6 2019.0 0.162 0.666 0.712 0.265 0.46 0.119 0
20.57 17.77 132.9 1326.0 0.085 0.079 0.087 0.07 0.181 0.057 0.544 0.734 3.398 74.08 0.005 0.013 0.019 0.013 0.014 0.004 24.99 23.41 158.8 1956.0 0.124 0.187 0.242 0.186 0.275 0.089 0
19.69 21.25 130.0 1203.0 0.11 0.16 0.197 0.128 0.207 0.06 0.746 0.787 4.585 94.03 0.006 0.04 0.038 0.021 0.022 0.005 23.57 25.53 152.5 1709.0 0.144 0.424 0.45 0.243 0.361 0.088 0
11.42 20.38 77.58 386.1 0.142 0.284 0.241 0.105 0.26 0.097 0.496 1.156 3.445 27.23 0.009 0.075 0.057 0.019 0.06 0.009 14.91 26.5 98.87 567.7 0.21 0.866 0.687 0.258 0.664 0.173 0
20.29 14.34 135.1 1297.0 0.1 0.133 0.198 0.104 0.181 0.059 0.757 0.781 5.438 94.44 0.011 0.025 0.057 0.019 0.018 0.005 22.54 16.67 152.2 1575.0 0.137 0.205 0.4 0.162 0.236 0.077 0
12.45 15.7 82.57 477.1 0.128 0.17 0.158 0.081 0.209 0.076 0.334 0.89 2.217 27.19 0.008 0.033 0.037 0.011 0.022 0.005 15.47 23.75 103.4 741.6 0.179 0.525 0.536 0.174 0.398 0.124 0
18.25 19.98 119.6 1040.0 0.095 0.109 0.113 0.074 0.179 0.057 0.447 0.773 3.18 53.91 0.004 0.014 0.023 0.01 0.014 0.002 22.88 27.66 153.2 1606.0 0.144 0.258 0.378 0.193 0.306 0.084 0
13.71 20.83 90.2 577.9 0.119 0.164 0.094 0.06 0.22 0.075 0.584 1.377 3.856 50.96 0.009 0.03 0.025 0.014 0.015 0.005 17.06 28.14 110.6 897.0 0.165 0.368 0.268 0.156 0.32 0.115 0
Charts
These charts are generated by the app and saved into static/figures.
Class balance
Class balance chart
Feature distributions (selected columns)
Distributions chart
Correlation heatmap (top features)
Correlation heatmap chart