-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathUnsupervised_met_ml.Rmd
More file actions
1313 lines (938 loc) · 60.3 KB
/
Unsupervised_met_ml.Rmd
File metadata and controls
1313 lines (938 loc) · 60.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Unsupervised methods in machine learning"
author: "Anton Barrera Mora (me@antonio-barrera.cyou)"
date: "March 2023"
output:
github_document:
preserve_yaml: true
word_document: default
html_document:
highlight: default
number_sections: yes
theme: cosmo
toc: yes
toc_depth: 3
includes:
in_header: p_brand.html
pdf_document:
highlight: zenburn
toc: yes
editor_options:
markdown:
wrap: sentence
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
## Presentation
In this project we will work with unsupervised machine learning models, with special emphasis on the modification of parameters to refine the results.
## Objetives
The objectives are generation, interpretation, and evaluation of a k-means clustering model and a DBSCAN model.
In doing so, we will not overlook the phases of data preparation, model quality, and initial knowledge extraction.
# Example 1
## k-means aggregation method with real data
We will use the 'Hawks' dataset, which is part of the 'Stat2Data' package that includes datasets used in the book 'Stat2: Building Models for a World of Data' by @cannon2013
The dataset originated from a research project as referenced earlier @minería2023 where measurements were taken on various characteristics of captured and subsequently released hawks.
The data source is not clearly specified or documented in the book or on the Oberlin University website @digital .
The dataset package is used in the aforementioned book to provide an example of nonparametric statistics @cannon2013, p. 86.
It is widely used in statistics courses.
We have selected this dataset because of its potential for unsupervised data mining algorithms.
The numerical variables on which we will be based are Wing, Weight, Culmen, Hallux
```{r message= FALSE, warning=FALSE}
if (!require('Stat2Data')) install.packages('Stat2Data')
library(Stat2Data)
data("Hawks")
summary(Hawks)
```
The 'summary()' function in the R language provides a basic statistical summary of variables in a dataset.
For numeric variables, it returns statistics such as minimum, first quartile (Q1), median, mean, third quartile (Q3), and maximum.
For categorical variables, it returns a frequency table.
We will complete a first analysis using the head() function that will allow us to obtain a snapshot of the first records in the database:
```{r first_rows_analysis, echo=TRUE, message=FALSE, warning=FALSE}
head(Hawks)
```
From the analysis of the attributes, we obtain the following information about the variables:
- Age: Age of the hawk - young (I) or adult (A) -.
- Species: Species of the hawk - Cooper's Hawk (CH), Red-Tailed Hawk (RT), or Sharp-Shinned Hawk (SS) -.
- Sex: Gender of the hawk - male (M) or female (F) -.
- Wing: Wing length of the hawk - in millimeters -.
- Weight: Weight of the hawk - in grams -.
- Culmen: Culmen length of the hawk's beak - in millimeters -.
- Hallux: Hallux length of the hawk's claw - in millimeters -.
At this point, there are other variables or columns in the dataset that are irrelevant for the requirements of this work, so we will filter them later.
Additionally, thanks to the information provided by the summary function, we know the following:
1. The 'BandNumber' variable contains unique records that help identify specific individuals and can be used as a primary key.
We observe that there are two blank records without a number, so we will opt to create a new numeric variable.
2. The 'Wing' variable contains 1 infinite or NA value.
Considering that there are noticeable size differences depending on the age and species of the specimens, but this data does not correspond to a continuous variable but a discrete one, we will assign the mean wing size based on the age of the specific specimen and its species.
3. The 'Weight' variable also has 10 instances with NA values.
We will apply the same criterion as the 'Wing' variable.
That is, we will assign the mean weight corresponding to its age group to each record.
4. The 'Culmen' variable contains 7 records with NA values.
We will apply the previously described criterion of assigning means according to age.
5. The 'Hallux' variable also contains 6 NA records to which we will apply the corrective policies described earlier.
We will select the dataset of interest from the current dataset.
```{r library_load, echo=FALSE, message=FALSE, warning=FALSE}
# Para otras librerias necesarias
if(!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')
if(!require('Rmisc')) install.packages('Rmisc'); library('Rmisc')
if(!require('dplyr')) install.packages('dplyr'); library('dplyr')
if(!require('xfun')) install.packages('xfun'); library('xfun')
if(!require('magrittr')) install.packages('magrittr'); library('magrittr')
if (!require('factoextra')) install.packages('factoextra'); library('factoextra')
if(!require('pracma')) install.packages('pracma'); library('pracma')
if (!require('cluster')) install.packages('cluster'); library(cluster)
if (!require('dendextend')) install.packages('dendextend'); library(dendextend)
if (!require('knitr')) install.packages('knitr'); library(knitr)
```
```{r seleccion de columnas, echo=TRUE, message=FALSE, warning=FALSE}
# We select only the variables that we are interested in working with.
hawks_select <- Hawks %>% select(BandNumber, Year, Month, Day, Species, Age, Wing, Weight, Culmen, Hallux)
```
Point 1.
We correct the primary key according to the parameters set in the plan:
```{r primary_key_corrections, echo=TRUE, message=FALSE, warning=FALSE}
# creating a new record with numbers from 1 to the total number of rows in the dataset.
hawks_select$Num_id <- 1:nrow(hawks_select)
# Verifying that all values are unique
if (length(unique(hawks_select$Num_id)) == nrow(hawks_select)) {
cat("The values in the column 'Num_ID' are unique.\n")
} else {
cat("The values in the column 'Num_ID' are NOT unique.\n")
}
hawks_select <- subset(hawks_select, select = -BandNumber)
```
Point 2.
We are going to correct the variable Wing that had a record with no value:
```{r Wing_imput, echo=TRUE, message=FALSE, warning=FALSE}
# El atributo tiene un valor NA, vamos a calcular una media condicional:
media_alas_especie_edad <- aggregate(Wing ~ Species + Age, data = hawks_select, FUN = function(x) mean(x, na.rm= TRUE))
# Calculamos las medias de la columna 'Wing' para cada especie y edad, excluyendo los valores NA
# Imputamos el valor faltante en la columna 'Wing' segun la especie y la edad
# iteramos sobre cada registro usdando un bucle for empezando por 1 hasta el valor del numero de filas
# usando 'nrow'
for (i in 1:nrow(hawks_select)) {
# Si el valor de 'Wing' que estamos buscando es NA, buscamos la media que corresponde a la especie y edad
if (is.na(hawks_select$Wing[i])) {
# Filtramos el 'data.frame' según la especie y edad
media_especie_edad <- media_alas_especie_edad$Wing[media_alas_especie_edad$Species == hawks_select$Species[i] & media_alas_especie_edad$Age == hawks_select$Age[i]]
# imputamos la media que corresponda
hawks_select$Wing[i] <- media_especie_edad
}
}
# Verificamos que no haya valores NA en la columna
if (!any(is.na(hawks_select$Wing))) {
cat("Todos los valores NA en la columna 'Wing' han sido correctamente imputados")
} else {
cat("Aun quedan valores NA en la columna 'Wing'. \n")
}
```
Point 3.
We go on to correct the NA values in the 'Weight' table using the same method as above.
That is, we impute the mean according to age and species:
```{r Weight_imputs, echo=TRUE, message=FALSE, warning=FALSE}
# El atributo tiene diez registros con NA, vamos a calcular una media condicional:
media_peso_especie_edad <- aggregate(Weight ~ Species + Age, data = hawks_select, FUN = function(x) mean(x, na.rm= TRUE))
# Calculamos las medias de la columna 'Weight' para cada especie y edad, excluyendo los valores NA
# Imputamos el valor faltante en la columna 'Weight' según la especie-edad
# iteramos sobre cada registro usando un bucle 'for' empezando por 1 hasta el valor del numero de filas
# empleando 'nrow'
for (i in 1:nrow(hawks_select)) {
# Si el valor de 'Weight' que estamos buscando es NA, buscamos la media que corresponde a la especie y edad
if (is.na(hawks_select$Weight[i])) {
# Filtramos el 'data.frame' según la especie y edad
media_especie_edad <- media_peso_especie_edad$Weight[media_peso_especie_edad$Species == hawks_select$Species[i] & media_peso_especie_edad$Age == hawks_select$Age[i]]
# imputamos la media que corresponda al registro NA
hawks_select$Weight[i] <- media_especie_edad
}
}
# Verificamos que no haya valores NA en la columna
if (!any(is.na(hawks_select$Weight))) {
cat("Todos los valores NA en la columna 'Weight' han sido correctamente imputados")
} else {
cat("Aun quedan valores NA en la columna 'Weight'. \n")
}
```
Point 4.
The variable 'Culmen' contains 7 NA records which we will correct using the same procedure as above, i.e. calculating the mean of the beak dimensions according to age \> and species.
```{r culmen_imputations, echo=TRUE, message=FALSE, warning=FALSE}
# El atributo tiene 7 registros NAs, vamos a calcular una media condicional:
media_pico_especie_edad <- aggregate(Culmen ~ Species + Age, data = hawks_select, FUN = function(x) mean(x, na.rm= TRUE))
# Calculamos las medias de la columna 'Culmen' para cada especie y edad, excluyendo los valores NA
# Imputamos el valor faltante en la columna 'Culmen' según la especie-edad
# iteramos sobre cada registro usando un bucle 'for' empezando por 1 hasta el valor del numero de filas
# empleando 'nrow'
for (i in 1:nrow(hawks_select)) {
# Si el valor de 'Culmen' que estamos buscando es NA, buscamos la media que corresponde a la especie y edad
if (is.na(hawks_select$Culmen[i])) {
# Filtramos el 'data.frame' según la especie y edad
media_especie_edad <- media_pico_especie_edad$Culmen[media_pico_especie_edad$Species == hawks_select$Species[i] & media_pico_especie_edad$Age == hawks_select$Age[i]]
# imputamos la media que corresponda al registro NA
hawks_select$Culmen[i] <- media_especie_edad
}
}
# Verificamos que no haya valores NA en la columna
if (!any(is.na(hawks_select$Culmen))) {
cat("Todos los valores NA en la columna 'Culmen' han sido correctamente imputados")
} else {
cat("Aun quedan valores NA en la columna 'Culmen'. \n")
}
```
Point 5.
The variable 'Hallux' contains another suitable number of NA records.
We apply the corrective policies previously implemented:
```{r Hallux_imputations, echo=TRUE, message=FALSE, warning=FALSE}
# El atributo tiene 6 registros NAs, vamos a calcular una media condicional.
# Calculamos las medias de la columna 'Hallux' para cada especie y edad
media_espolon_especie_edad <- aggregate(Hallux ~ Species + Age, data = hawks_select, FUN = function(x) mean(x, na.rm= TRUE))
# Imputamos el valor faltante en la columna 'Hallux' según la especie-edad, iteramos sobre cada registro # usando un bucle 'for' empezando por 1 hasta el valor del numero de filas empleando 'nrow'
for (i in 1:nrow(hawks_select)) {
# Si el valor de 'Hallux' que estamos buscando es NA, buscamos la media que corresponde a la especie y edad
if (is.na(hawks_select$Hallux[i])) {
# Filtramos el 'data.frame' según la especie y edad
media_especie_edad <- media_espolon_especie_edad$Hallux[media_espolon_especie_edad$Species == hawks_select$Species[i] & media_espolon_especie_edad$Age == hawks_select$Age[i]]
# imputamos la media que corresponda al registro NA
hawks_select$Hallux[i] <- media_especie_edad
}
}
# Verificamos que no haya valores NA en la columna
if (!any(is.na(hawks_select$Hallux))) {
cat("Todos los valores NA en la columna 'Hallux' han sido correctamente imputados")
} else {
cat("Aun quedan valores NA en la columna 'Hallux'. \n")
}
```
Before going on to check the tables and variables created and modified, let's round to one decimal place.
```{r tables_adecuation, echo=TRUE, message=FALSE, warning=FALSE}
# Copia de seguridad
hawks_select_copy <- hawks_select
# Redondeamos a un decimal en la tabla 'hawks_select':
hawks_select$Weight <- round(hawks_select$Weight, 1)
hawks_select$Wing <- round(hawks_select$Wing, 1)
hawks_select$Culmen <- round(hawks_select$Culmen, 1)
hawks_select$Hallux <- round(hawks_select$Hallux, 1)
# Redondeamos a dos decimales en la tabla 'media_alas_especie_X'(alas, espolon, peso, pico)
media_alas_especie_edad$Wing <- round(media_alas_especie_edad$Wing, 1)
media_espolon_especie_edad$Hallux <- round(media_espolon_especie_edad$Hallux, 1)
media_peso_especie_edad$Weight <- round(media_peso_especie_edad$Weight, 1)
media_pico_especie_edad$Culmen <- round(media_pico_especie_edad$Culmen, 1)
```
Based on the summary() performed at the beginning, there seem to be outliers in the records, aside from the issues with infinite or 'NA' values.
At first, we may be tempted to impute these apparent outliers or attribute the variability to age differences.
Age is the trap in this database.
It can greatly influence physiological measurements and may be, as we maintain, the main cause of variability.
However, clearly from the results of the summary() function, we can see extreme values in the 'Hallux' variable.
A minimum of 9.50 and a maximum of 341.40, unless some of the analyzed specimens are prehistoric birds, doesn't seem very normal.
And as for the 'Wing' variable, we can make the same assertion, although the values are less extreme.
Therefore, we will need to identify outliers beyond the normal range using the interquartile range or IQR.
It is a statistical dispersion measure that represents the difference between the third quartile (Q3) and the first quartile (Q1).
Quartiles divide a data distribution into four equal parts.
The first quartile is the value below which 25% of the data lies, the third quartile is the value below which 75% of the data lies, and the IQR represents the range where half of the data lies.
Outliers are therefore those that are above or below Q1 or Q3 in that distribution.
In our case, we need to expand the margins, meaning extend the range beyond Q1 and Q3, so multiply by +- 1.5.
The lower limit will be negative, so unless it's a typographical error, we doubt that the claw or spur has a negative length.
However, we don't know what the minimum is because we are talking about birds in a developmental period.
This allow us to observe the lower and upper limits of each variable of interest or those we suspected to have issues.
Also include the other variables to ensure that there are no problems.
In theory, there should be no infinite values, so the use of the na.rm() function to remove them will not be necessary.
Subsequently, proceding to represent them in box plots.
```{r outliers_indetification,echo=TRUE, message=FALSE, warning=FALSE}
print("Aplicando IQR")
## Aplicando IQR ##
###################
# Aplicamos IQR #
#################
wing_iqr <- IQR(hawks_select$Wing)
hallux_iqr <- IQR(hawks_select$Hallux)
weight_iqr <- IQR(hawks_select$Weight)
culmen_iqr <- IQR(hawks_select$Culmen)
# Establecemos los limites #
############################
# Nota: El limite inferior queda en negativo, asi que salvo error tipográfico, dudamos que la garra o espolon sea negativa. Ahora bien no sabemos cual es el minimo porque estamos hablando de aves en desarrollo.
# Variable 'wing'
wing_limite_bajo <- quantile(hawks_select$Wing, 0.25) -1.5 * wing_iqr
wing_limite_alto <- quantile(hawks_select$Wing, 0.75) + 1.5 * wing_iqr
# Variable 'Hallux'
hallux_limite_bajo <- quantile(hawks_select$Hallux, 0.25) -1.5 * hallux_iqr
hallux_limite_alto <- quantile(hawks_select$Hallux, 0.75) +1.5 * hallux_iqr
# Variable 'Weight'
weight_limite_bajo <- quantile(hawks_select$Weight, 0.25) -1.5 * weight_iqr
weight_limite_alto <- quantile(hawks_select$Weight, 0.75) +1.5 * weight_iqr
# Variable 'Culmen'
culmen_limite_bajo <- quantile(hawks_select$Culmen, 0.25) -1.5 * culmen_iqr
culmen_limite_alto <- quantile(hawks_select$Culmen, 0.75) +1.5 * culmen_iqr
# identificamos los valores atipicos en cada una de las variables:
wing_outliers <- hawks_select$Wing[hawks_select$Wing < wing_limite_bajo | hawks_select$Wing > wing_limite_alto]
hallux_outliers <- hawks_select$Hallux[hawks_select$Hallux < hallux_limite_bajo | hawks_select$Hallux > hallux_limite_alto]
weight_outliers <- hawks_select$Weight[hawks_select$Weight < weight_limite_bajo | hawks_select$Weight > weight_limite_alto]
culmen_outliers <- hawks_select$Culmen[hawks_select$Culmen < culmen_limite_bajo | hawks_select$Culmen > culmen_limite_alto]
## Visualizamos los valores fuera de rango, de haberlos:
########################################################
print("Visualizamos los valores fuera de rango")
# Convertimos los vectores en 'data.frames'#
hallux_outliers_df <- data.frame(hallux_outliers)
weight_outliers_df <- data.frame(weight_outliers)
wing_outliers_df <- data.frame(wing_outliers)
culmen_outliers_df <- data.frame(culmen_outliers)
# Imprimimos tablas en el documento
kable(hallux_outliers_df, caption = "Outliers en 'Hallux'")
kable(weight_outliers_df, caption = "Outliers en 'Weight'")
kable(wing_outliers_df, caption = "Outliers en 'Wing'")
kable(culmen_outliers_df, caption = "Outliers en 'Culmen'")
```
We proceed to the graphical representation, but considering that the only outliers are in the variable 'hallux':
```{r outliers_representation, echo=TRUE, message=FALSE, warning=FALSE}
# Representación de la variable 'Hallux'
boxplot(hawks_select$Hallux, main = "Hallux", ylab = "Tamaño del espolón (mm)", outline = TRUE)
```
The graphical representation shows us that, indeed, we have 7 values outside the dispersion ranges.
At this point, we must decide what to do with these values and we choose to impute the species- and age-dependent mean to them:
```{r hallux_outliers_means, echo=TRUE, message=FALSE, warning=FALSE}
print("imputamos las medias en 'Hallux' a outliers segun especie y edad")
for (i in 1:nrow(hawks_select)) {
# Si el valor de 'Hallux' que estamos buscando es superior a 55.9, buscamos la media que corresponde a la especie y edad. No es necesario buscar en el rango inferior porque sabemos que todos los valores son superiores a 55.9
if (hawks_select$Hallux[i] > hallux_limite_alto) {
# Filtramos el 'data.frame' según la especie y edad
media_especie_edad <- media_espolon_especie_edad$Hallux[media_espolon_especie_edad$Species == hawks_select$Species[i] & media_espolon_especie_edad$Age == hawks_select$Age[i]]
# imputamos la media que corresponda al registro NA
hawks_select$Hallux[i] <- media_especie_edad
}
}
# Verificamos que no haya valores NA en la columna
if (!any(hawks_select$Hallux[i] > hallux_limite_alto)) {
cat("Todos los valores Outliers en la columna 'Hallux' han sido correctamente imputados")
} else {
cat("Aun quedan valores Outliers en la columna 'Hallux'. \n")
}
```
The operation appears to have been successful, we repeat with the box chart:
```{r Hallux_revised, echo=TRUE, message=FALSE, warning=FALSE}
# Representación de la variable 'Hallux' en busca de confirmar que no quedan outliers:
boxplot(hawks_select$Hallux, main = "Hallux", ylab = "Tamaño del espolón (mm)", outline = TRUE)
```
Indeed, no out-of-range values left in the variable.
### Table Integrity revision
```{r summary hawks_select, echo=TRUE, message=FALSE, warning=FALSE}
summary(hawks_select)
```
No infinite values, and the means, although quite disparate as in the case of 'wing' - considering that it does not discriminate developing birds - do not contain extreme values.
Continue examining the table, now analyzing the variables using the name() function.
```{r names_head hawks_select, echo=TRUE, message=FALSE, warning=FALSE}
# Analizamos con `name()` las variables:
names(hawks_select)
# Analizamos con `head()`
head(hawks_select, 3)
```
Next we will analyse the structure with the str() function which gives us a more complete overview of the table structure:
```{r str_hawks_select, echo=TRUE, message=FALSE, warning=FALSE}
# Realizamos un resumen estadistico de todas la variables:
str(hawks_select)
```
We observe that we have 908 records in 10 variables.
At this point, we do not consider changing the measurement scale to kilograms for the 'Weight' variable, which seems more appropriate, especially for the juvenile bird population.
We also do not consider any changes necessary in the types of variables present in the table.
Regarding the "additional tables" that contain the means of different physiological measurements for different species based on their age, we do not believe it is necessary to analyze them using any R function, as they are highly summarized tables with 6 records and 3 variables.
### Inventory
At this point, our dataset consists of:
Table 'hawks_select' with all the relevant variables for the analysis and with the cleaning process applied.
Additional tables with means of wing length, hallux size, weight, and culmen length in relation to species and age.
These tables contain 6 records and 3 variables each.
### Distribution of variables in the 'hawks_select' table.
- Histograms.
```{r tema custom, echo=FALSE, message=FALSE, warning=FALSE}
mi_tema <- function() {
theme(
panel.border = element_rect(colour = "black",
fill = NA,
linetype = 1),
panel.background = element_rect(fill = "white",
color = 'grey50'),
panel.grid.major = element_line(colour = "grey80", linetype = "dashed"),
panel.grid.minor = element_blank(),
axis.text = element_text(colour = "black",
face = "plain",
family = "serif",
size = 12),
axis.title = element_text(colour = "black",
family = "serif",
face = "bold",
size = 14),
axis.ticks = element_line(colour = "black"),
axis.ticks.length = unit(0.15, "cm"),
plot.title = element_text(size = 23,
hjust = 0.5,
family = "serif",
face = "bold",
margin = margin(0, 0, 10, 0)),
plot.subtitle=element_text(size=16,
hjust = 0.5,
margin = margin(0, 0, 10, 0)),
plot.caption = element_text(colour = "black",
face = "italic",
family = "serif",
size = 10,
margin = margin(10, 0, 0, 0)),
legend.background = element_rect(fill = "white"),
legend.key = element_rect(fill = "white"),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 12),
legend.position = "right"
)
}
```
```{r hawks_select_distribution, echo=TRUE, message=FALSE, warning=FALSE}
# Histogramas para variables numéricas ##
#########################################
# Histograma para 'Wing'
print("Distribucion del tamaño de las alas")
ggplot(hawks_select, aes(x= Wing)) +
geom_histogram(binwidth = 10, color = "black", fill= "blue", alpha= 0.7) +
mi_tema() +
labs(title = "Distribucion 'Wing'", x = "Wing", y = "Frecuencia")
# Histograma paea 'Weight'
print("Distribucion del tamaño del peso")
ggplot(hawks_select, aes(x = Weight)) +
geom_histogram(binwidth = 50, color= "black", fill= "red", alpha = 0.7) +
mi_tema() +
labs(title = "Distribucion 'Weight'", x = "Weight", y = "Frecuencia")
# Histograma para 'Culmen'
print("Distribucion del tamaño del pico")
ggplot(hawks_select, aes(x = Culmen)) +
geom_histogram(binwidth = 1, color = "black", fill = "green", alpha = 0.7) +
mi_tema() +
labs(title = "Distribucion 'Culmen'", x = "Culmen", y = "Frecuencia")
# Histograma para 'Hallux'
print("Distribucion del tamaño de los espolones")
ggplot(hawks_select, aes(x = Hallux)) +
mi_tema() +
# scale_y_log10() + # aplicamos escala logaritmica para poder representar los datos de frecuencia (y)
# scale_x_log10() + # aplicamos escala logaritmica para poder representar las medidas adecuadamente (x)
geom_histogram(binwidth = 1, color = "black", fill = "purple", alpha = 0.7) +
labs(title = "Distribucion 'Hallux'", x = "Hallux", y = "Frecuencia")
# labs(caption = "Escalas logatimica en base 10 en los ejes X-Y")
```
- Bar charts for categorical variables in 'hawks_select'.
```{r Hawks_select_distr_II, echo=TRUE, message=FALSE, warning=FALSE}
## Grafico de barras para 'species' ###
#######################################
ggplot(hawks_select, aes(x= Species)) +
mi_tema() +
geom_bar(color= "black", fill = "blue", alpha = 0.7) +
labs(title = "Distribucion 'Species'", x = "Species", y = "Frecuencia") +
labs(caption = "Cooper's Hawk (CH), Red-Tailed Hawk (RT) o Sharp-Shinned Hawk (SS)")
## Grafico de barras para 'Age' ####
####################################
ggplot(hawks_select, aes(x = Age)) +
geom_bar(color= "black", fill = "gray", alpha = 0.7) +
mi_tema() +
labs(title = "Distribucion 'Age'", x= "Age", y= "Frecuencia") +
labs(caption = "A= Adulto I= Infantes")
```
On the one hand, we observed an over-representation of the RT species and an under-representation of CH.
On the other hand, the child population of the sample is higher than the adult population.
- Other visualisations of the variable distributions:
```{r distribucion en Hawks_select III, echo=TRUE, message=FALSE, warning=FALSE}
## Distribucion de las variables numéricas ##
#############################################
print("Distribuciones de variables en la tabla 'hawks_select' \n")
# Establecemos las opciones gráficas con `par()` y con `mfrow()`configuramos la matriz de representacion
par(mfrow = c(2,2))
hist(hawks_select$Wing, main= "Wing", xlab = "Longitud del ala (milimetros)")
hist(hawks_select$Weight, main = "Weight", xlab = "Peso (gramos)")
hist(hawks_select$Culmen, main = "Culmen", xlab = "Longitud del pico (milimetros)")
hist(hawks_select$Hallux, main = "Hallux", xlab = "Longitud de los espolones (milimetros)")
## Distribucion de las variables categóricas ##
###############################################
# Restablecemos las opciones gráficas con `par()` y con `mfrow()`configuramos la matriz de representación para el tipo de grafico "Barplot"
par(mfrow = c(1,2))
barplot(table(hawks_select$Age), main = "Age", xlab = "Edad", ylab = "Frecuencia")
barplot(table(hawks_select$Species), main = "Species", xlab = "Especie", ylab = "Frecuencia")
```
### Distribution of variables in annexed tables - averages according to age and species
```{r distribucion de medias, echo=TRUE, message=FALSE, warning=FALSE}
## Grafica de barras agrupadas para 'media_alas_especie_edad' ##
#################################################################
ggplot(media_alas_especie_edad, aes(x = Species, y = Wing, fill= Age)) +
mi_tema() +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Medias de alas por especie y edad", x = "Especies", y = "Media 'Wing'") +
labs(caption = "Cooper's Hawk (CH), Red-Tailed Hawk (RT) o Sharp-Shinned Hawk (SS)")
## Grafica de barras agrupadas para 'media_espolon_especie_edad' ##
###################################################################
ggplot(media_espolon_especie_edad, aes(x = Species, y= Hallux, fill = Age)) +
mi_tema() +
geom_bar(stat= "identity", position = "dodge") +
labs(title = "Medias de espolones por especie y edad", x = "Especies", y= "Media 'Hallux' ") +
labs(caption = "Cooper's Hawk (CH), Red-Tailed Hawk (RT) o Sharp-Shinned Hawk (SS)")
## Grafica de barras agrupadas para 'media_peso_especie_edad' ##
###################################################################
ggplot(media_peso_especie_edad, aes(x = Species, y= Weight, fill = Age)) +
mi_tema() +
geom_bar(stat= "identity", position = "dodge") +
labs(title = "Medias de peso por especie y edad", x = "Especies", y= "Media 'Weight' ") +
labs(caption = "Cooper's Hawk (CH), Red-Tailed Hawk (RT) o Sharp-Shinned Hawk (SS)")
## Grafica de barras agrupadas para 'media_pico_especie_edad' ##
###################################################################
ggplot(media_pico_especie_edad, aes(x = Species, y= Culmen, fill = Age)) +
mi_tema() +
geom_bar(stat= "identity", position = "dodge") +
labs(title = "Medias de pico por especie y edad", x = "Especies", y= "Media 'Culmen' ") +
labs(caption = "Cooper's Hawk (CH), Red-Tailed Hawk (RT) o Sharp-Shinned Hawk (SS)")
```
### Study using k-means aggregation method
Using the 'hawks_select' table, where we have selected the relevant variables for our study, we will employ the k-means clustering method.
This method will group observations into 'k' clusters based on the nearest mean.
It is an unsupervised clustering method used when there are no pre-defined labels for the data [@james2013; @gareth2013; @cannon2013a].
By "no pre-defined labels," we mean that the data does not have a predetermined category or class assigned to it.
In other words, we do not know which group each data point belongs to, as there are no right or wrong answers in the variable(s) @gironésroig2017 .
The most common uses of this technique are in segmentation - e.g., customers, to understand their needs -, dimensionality reduction - or to group into smaller clusters -, data understanding - to improve data representation -, defect or problem detection - e.g., outliers or when there are values far from the mean and logic -, and classification - after applying the algorithm, labels can be assigned.
The main objectives, but not exclusive, for applying this algorithm in this case would be:
1. Segmentation: Although we already know the labels - CH, RT, SS -, we can explore if these species have notable differences in their characteristics and if coherent data groups are formed.
It will also allow us to gain a better understanding of the data and identify potential subgroups.
2. Analysis of physical characteristics and age: Clustering will reveal differences between adults and immature individuals.
3. Problem detection: By applying the summary() function earlier, we suspected the presence of possible outliers or extreme values.
With this algorithm, we can confirm or dismiss this hypothesis.
#### Application of the k-means method to the numerical variables in the 'hawks_select' dataset
When applying the algorithm, we will use the normalized variables 'Wing', 'Weight', 'Culmen', and 'Hallux'.
This ensures that all features have equal weighting in the formation of clusters or groups [\@montoliucolásraúl].
We will carry this out using the scale() function in R, which performs normalization by default and follows the formula: z = (x-m)/d
Here, x is the original value, m is the mean of the variable, d is the standard deviation, and z is the standardized or z-score value [\@scalefu].
We aim to find 3 clusters, one for each hawk species, but it is also possible to have 6 clusters considering the age groups.
Therefore, we will apply the k-means algorithm twice, with 3 and 6 clusters respectively.
```{r k-means_application,echo=TRUE, message=FALSE, warning=FALSE }
print("Aplicando k-means")
# Preprocesamiento de los datos: z-score #
##########################################
hawks_z <- scale(hawks_select[, c("Wing", "Weight", "Culmen", "Hallux")])
# Aplicamos k-means con 3 agrupaciones:
# La semilla se fija para reproducibilidad. el valor inicializa el RNG. Si se mantiene constante, aunque el k-means use numeros aleatorios, se garantiza que se puede reproducir
# Aplicamos los k-means:
print("para tres grupos")
set.seed(131)
tres_grupos <- 3
kmeans3_result <- kmeans(hawks_z, centers = tres_grupos)
print("para seis grupos")
set.seed(131)
seis_grupos <- 6
kmeans6_result <- kmeans(hawks_z, centers = seis_grupos)
# Agregamos la información de los clusteres a los datos de la tabla original:
hawks_select$Cluster3 <- as.factor(kmeans3_result$cluster)
hawks_select$Cluster6 <- as.factor(kmeans6_result$cluster)
```
At this point, it is necessary to clarify that by using set.seed(), we are ensuring randomness.
Seeds are numbers used to initialize the random number generator (RNG).
Regardless of the chosen number, it guarantees that any procedure involving random numbers is replicable.
This is particularly important in algorithms like k-means, which are sensitive to random initialization and can produce different results in each execution [@macqueen1967; @kaufman2009a]
Next, we will visualize each combination of variables and the corresponding results for analysis:
- Graphs for 3 groups:
```{r k-means3_graphs, echo=TRUE, message=FALSE, warning=FALSE}
print("Graficando k-means para tres grupos")
# Tres grupos #
###############
# Wing vs Weight
print("Tres grupos: Wing vs Weight")
ggplot(hawks_select, aes(x = Wing, y = Weight, color = Cluster3)) +
geom_point() +
labs(title = "Clustering k-means (3 grupos)", x = "Tamaño del ala (mm)", y = "Peso (g)") +
theme_minimal()
# Wing vs Culmen
print("Tres grupos: Wing vs Culmen")
ggplot(hawks_select, aes(x = Wing, y = Culmen, color = Cluster3)) +
geom_point() +
labs(title = "Clustering k-means (3 grupos)", x = "Tamaño del ala (mm)", y = "Tamaño del pico (mm)") +
theme_minimal()
# Wing vs Hallux
print("Tres grupos: Wing vs Hallux")
ggplot(hawks_select, aes(x = Wing, y = Hallux, color = Cluster3)) +
geom_point() +
labs(title = "Clustering k-means (3 grupos)", x = "Tamaño del ala (mm)", y = "Tamanyo del espolon (mm)") +
theme_minimal()
# Weight vs Culmen
print("Tres grupos: Weight vs Culmen")
ggplot(hawks_select, aes(x = Weight, y = Culmen, color = Cluster3)) +
geom_point() +
labs(title = "Clustering k-means (3 grupos)", x = "Peso (g)", y = "Tamaño del pico (mm)") +
theme_minimal()
# Weight vs Hallux
print("Tres grupos: Weight vs Hallux")
ggplot(hawks_select, aes(x = Weight, y = Hallux, color = Cluster3)) +
geom_point() +
labs(title = "Clustering k-means (3 grupos)", x = "Peso (g)", y = "Tamaño de la garra (mm)") +
theme_minimal()
# Culmen vs Hallux
print("Tres grupos: Culmen vs Hallux")
ggplot(hawks_select, aes(x = Culmen, y = Hallux, color = Cluster3)) +
geom_point() +
labs(title = "Clustering k-means (3 grupos)", x = "Tamaño del pico (mm)", y = "Tamaño de la garra (mm)") +
theme_minimal()
# Visualizamos los clusters #
#############################
# cluster 3 #
#############
clusplot(hawks_z, kmeans3_result$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
title("\n \n Clustering k-means (3 grupos)")
```
The measurements of 'wing' and 'Culmen', that is, beak and wings, appear to tend to separate the three bird species.
Weight also does not seem to be the most consistent grouping measure, as there is a lot of overlap.
In general, it tends to mix observations.
We assume that the problem is the mixing of juvenile and adult populations.
Charts for 6 groups:
```{r graficando k-means6, echo=TRUE, message=FALSE, warning=FALSE}
print("Graficando k-means para seis grupos")
# Seis grupos#
##############
# Wing vs Weight
print("Seis grupos: Wing vs Weight")
ggplot(hawks_select, aes(x = Wing, y = Weight, color = Cluster6)) +
geom_point() +
labs(title = "Clustering k-means (6 grupos)", x = "Tamaño del ala (mm)", y = "Peso (g)") +
theme_minimal()
# Wing vs Culmen
print("Seis grupos: Wing vs Culmen")
ggplot(hawks_select, aes(x = Wing, y = Culmen, color = Cluster6)) +
geom_point() +
labs(title = "Clustering k-means (6 grupos)", x = "Tamaño del ala (mm)", y = "Tamaño del pico (mm)") +
theme_minimal()
# Wing vs Hallux
print("Seis grupos: Wing vs Hallux")
ggplot(hawks_select, aes(x = Wing, y = Hallux, color = Cluster6)) +
geom_point() +
labs(title = "Clustering k-means (6 grupos)", x = "Tamaño del ala (mm)", y = "Tamaño de ls espolones (mm)") +
theme_minimal()
# Weight vs Culmen
print("Seis grupos: Weight vs Culmen")
ggplot(hawks_select, aes(x = Weight, y = Culmen, color = Cluster6)) +
geom_point() +
labs(title = "Clustering k-means (6 grupos)", x = "Peso (g)", y = "Tamaño del pico (mm)") +
theme_minimal()
# Weight vs Hallux
print("Seis grupos: Weight vs Hallux")
ggplot(hawks_select, aes(x = Weight, y = Hallux, color = Cluster6)) +
geom_point() +
labs(title = "Clustering k-means (6 grupos)", x = "Peso (g)", y = "Tamaño de la garra (mm)") +
theme_minimal()
# Culmen vs Hallux
print("Seis grupos: Culmen vs Hallux")
ggplot(hawks_select, aes(x = Culmen, y = Hallux, color = Cluster6)) +
geom_point() +
labs(title = "Clustering k-means (6 grupos)", x = "Tamaño del pico (mm)", y = "Tamaño de la garra (mm)") +
theme_minimal()
# Visualizamos los clusters #
#############################
# cluster 6 #
############
clusplot(hawks_z, kmeans6_result$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
title("\n \n Clustering k-means (6 grupos)")
```
We observe that forcing the creation of 6 groups does not result in convincing outcomes.
In general, everything suggests that there are 3 groups, although not very homogeneous, which tend to mix and confuse the algorithm when trying to group them.
#### Analysis of the results
At this point, as we have observed throughout, the graphs of the different variables suggest the existence of 3 groups, although not very homogeneous.
Ideally, we would have been able to make the clusters comparable across different age groups and species by including age as an analytical variable.
However, the k-means algorithm is designed to work with continuous variables, and both species and age are categorical.
One possible solution that was not addressed at the beginning of this exercise would have been to create indicator variables (dummy variables) with values of 1 and 0.
However, this approach would have also brought problems, as k-means assigns equal importance to all variables, and the indicator variables would have played an inappropriate role in the analysis.
Additionally, the analysis seemed much more complex with dummy variables.
Since we still do not have a clear idea of how many groups we can label the data with, we will try to perform hierarchical clustering.
Given the nature of the dataset, we made the mistake of assuming that we could guess the number of clusters based on the nature of the data.
Therefore, we will attempt to use hierarchical clustering based on the Euclidean method.
```{r clustering jerarquico, echo=TRUE, message=FALSE, warning=FALSE}
print("clustering jerarquico")
# Calculamos la matriz de distancias
distancias <- dist(hawks_z, method = "euclidean")
# Realizamos el clustering jerárquico
clustering_jerarquico <- hclust(distancias, method = "complete")
# Cortamos el árbol para obtener 3 clusters
clusters_jerarquico3 <- cutree(clustering_jerarquico, k = 3)
# Añadimos la información de los clusters al data frame original
hawks_select$ClusterH3 <- as.factor(clusters_jerarquico3)
# Visualizamos el dendrograma
dend <- as.dendrogram(clustering_jerarquico)
# Coloreamos las ramas del dendrograma según los grupos
colors <- rainbow(3)[clusters_jerarquico3]
labels_colors(dend) <- colors[order.dendrogram(dend)]
plot(dend)
colored_bars(colors[order.dendrogram(dend)], dend, rowLabels = FALSE)
# Graficamos un ejemplo de dispersión para los clusters jerárquicos
ggplot(hawks_select, aes(x = Wing, y = Weight, color = ClusterH3)) +
geom_point() +
labs(title = "Clustering jerárquico (3 grupos)", x = "Longitud del ala (mm)", y = "Peso (g)") +
theme_minimal()
# visualizamos la tabla con la agrupacion de los resultados:
table(clusters_jerarquico3)
```
We observe that it has created two groups.
In general, we can conclude that the dataset was not particularly well-suited for the k-means method due to the aforementioned aspects.
With limitations, we can affirm that the data, when applying the algorithm, seems to fall into 3 heterogeneous groups, which would not allow us to confidently assign an observation to a specific group or label these observations.
# Example 2
## Analysis applying DBSCAN and OPTICS methods.
Based on the same dataset that has been created, including the normalized scores, we will proceed to perform an analysis using DBSCAN and OPTICS.
We will use the 'DBSCAN' library, which focuses on efficiently implementing DBSCAN and OPTICS.
Additionally, we will complement it with the use of the 'fpc' library, which includes a 'cluster.stats' function that can evaluate the results of an OPTICS clustering.
```{r library_II, echo=FALSE, message=FALSE, warning=FALSE}
if (!require('fpc')) install.packages('fpc'); library(fpc)
if (!require('dbscan')) install.packages('dbscan'); library(dbscan)
```
OPTICS (Ordering Points to Identify the Clustering Structure), introduced by Ankerst et al. in 1999, is an extension of the DBSCAN algorithm that can help us better handle the problem of variable density among clusters, which was evident when applying k-means.
OPTICS does not directly assign clusters but orders the points to highlight their density, facilitating visualization with a reachability plot that shows the distance between each point.
While theoretically, we do not have outliers in the dataset, if present, OPTICS does not exclude outliers but identifies them by assigning them a higher reachability value.
Let's proceed with OPTICS:
```{r OPTICS, echo=TRUE, message=FALSE, warning=FALSE}
print("aplicamos optics al dataset que ya tenemos")
# Configuramos eps= x o la distancia máxima entre dos puntos:
# El grupo debe tener como mínimo 'minPts'= x observaciones para ser agrupado
optics_result <- dbscan::optics(hawks_z, minPts = 10)
optics_result
# Obtenemos el orden de las observaciones o puntos
print("Obtenemos la odenacion de los puntos")
optics_result$order
```
Creating a reachability diagram:
```{r diagrama de alcanzabilidad, echo=TRUE, message=FALSE, warning=FALSE}
# Diagrama o gráfica de alcance
plot(optics_result, main= "Diagrama de alcanzabilidad de OPTICS")
```
We have identified one large cluster, two small clusters, and two medium-sized clusters.
This conclusion was reached after experimenting with different parameter values for eps (maximum distance between two points) and minPts (minimum number of points in a neighborhood) in the clustering of observations.
There is a reachability order of approximately 600 points that marks the valley of a large cluster.
Valleys represent high reachability distances, indicating clusters, while peaks represent the separation between clusters.
We observe that the clusters are not uniform, and the reachability plot reveals a total of 5 clusters.
To further analyze the clusters, we can create a reachability plot that shows the distances between nearby points within the same cluster and between different clusters.
```{r trazas alas-peso OPTICS, echo=TRUE, message=FALSE, warning=FALSE}
## Dibujo de trazas variables alas-peso
# Selección de columnas
a <- hawks_select$Wing
b <- hawks_select$Weight
c <- hawks_select$Culmen
d <- hawks_select$Hallux
# Creación de la grafica
plot(a, b, col="grey")
# agregamos trazos al par 'Alas-Peso'
title("Trazas Alas-peso")
polygon(a[optics_result$order], b[optics_result$order])
```
And now the graph for the traces:
```{r trazas pico-espolon OPTICS, echo=TRUE, message=FALSE, warning=FALSE}
# Dibujo de trazas pico-espolon
c <- hawks_select$Culmen
d <- hawks_select$Hallux