The analysis of proportions is of two primary types.

- Focus on a single value of a categorical variable, termed a
“success” when it occurs, for one or more samples of data. Analyze the
resulting proportion of occurrence for a single sample, or for a test of
*homogeneity*, compare proportions of successes across distinct data samples for a single variable. - Compare the obtained proportions across the values of one or more
categorical variables for a single sample. Applied to a single variable,
the analysis is a
*goodness-of-fit*. Or, evaluate a potential relationship between two categorical variables, a test of*independence*.

From standard base R functions, the **lessR** function
`Prop_test()`

, abbreviated `prop()`

, provides
either type of analysis. To use, generally enter either the original
data from which to compute the frequencies and then the sample
proportions, or enter already computed frequencies. For the analysis of
multiple categorical variables across two levels of one of the
variables, the test of *homogeneity* and the test of
*independence* yield the identical statistical result.

The following table summarizes the values of the
`Prop_test()`

parameters for different analyses of
proportions. Each function call for the analysis of data begins with the
name of a categorical variable, generically referred to as
`X`

. The value of `X`

is the first parameter in
the function definition, and so does not need its parameter name,
`variable`

. If needed, indicate a second categorical
variable, generically referred to as `Y`

, with the
`by`

parameter. If focused on a specific value of
`X`

as a success, referred to as `X_value`

,
indicate that value with the `success`

parameter.

Run each analysis either directly from pre-computed values of the sample proportions, or from the original data from which the sample proportions are calculated.

Evaluate | Data Parameters | Count Parameters |
---|---|---|

A hypothesized proportion | X, `success` =X_value |
`n_succ` , `n_tot` [scalars] |

Equal proportions across samples | X, `success` =X_value, `by` =Y |
`n_succ` , `n_tot` [vectors] |

Uniform goodness-of-fit | X | `ntot` [vector] |

Independence of two variables | X, `by` =Y |
`n_table` |

The remainder of this vignette illustrates these applications of
`Prop_test()`

.

Define the occurrence of a designated value of the
`variable`

as a `success`

. Define all other values
of the variable as failures. Of course, success or failure in this
context does not necessarily mean good or bad, desired or undesired, but
instead, a designated value either occurred or did not.

When analyzing proportions from data, first indicate the categorical
variable, the value of the parameter `variable`

. Next,
indicate the designated value of `variable`

with the
parameter `success`

. When entering proportions directly,
indicate the number of successes and the total number of trials with the
`n_succ`

and `n_tot`

parameters. Enter the value
of each parameter either as a single value for one sample or as a vector
of multiple values for multiple samples. Without a value for
`success`

or `n_succ`

the analysis is of
goodness-of-fit or independence.

The example below is from the documentation for the base R function
`binom.test()`

, which provides the exact test of a null
hypothesis regarding the probability of success.
`Prop_test()`

uses that base R function to compare a sample
proportion to a hypothesized population value.

For a given categorical variable of interest, a type of plant,
consider two values, either “giant” or “dwarf”. From a sample of 925
plants, the specified value of “giant” occurred 682 times and did not
occur 243 times. The null hypothesis tested is that the specified value
occurs for 3/4 of the population according to the `pi`

parameter.

```
##
## <<< Exact binomial test of a proportion
##
## ------ Describe ------
##
## Number of successes: 682
## Number of failures: 243
## Number of trials: 925
## Sample proportion: 0.737
##
## ------ Infer ------
##
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765
```

To illustrate with data, read the *Jackets* data file included
with **lessR** into the data frame *d*. The file
contains two categorical variables. The variable *Bike*
represents two different types of motorcycle: BMW and Honda. The second
variable is *Jacket* with three values of jacket thickness: Lite,
Med, and Thick. Because *d* is the default name of the data frame
that contains the variables for analysis, the `data`

parameter that names the input data frame need not be specified.

```
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Bike character 1025 0 2 BMW Honda Honda ... Honda Honda BMW
## 2 Jacket character 1025 0 3 Lite Lite Lite ... Lite Med Lite
## ------------------------------------------------------------------------------------------
```

In following example, for the `variable`

*Bike*
from the default *d* data frame, define the parameter
`success`

as the value *“BMW”*. The default null
hypothesis is a population value of 0.5, but here explicitly specify
with the parameter `pi`

.

For clarity, the following example includes the parameter names
listed with their corresponding values. These names are unnecessary in
this example, however, because the values are listed in the same order
of their definition of the `Prop_test()`

function.

```
##
## <<< Exact binomial test of a proportion
##
## variable: Bike
## success: BMW
##
## ------ Describe ------
##
## Number of missing values: 0
## Number of successes: 418
## Number of failures: 607
## Number of trials: 1025
## Sample proportion: 0.408
##
## ------ Infer ------
##
## Hypothesis test for null of 0.5, p-value: 0.000
## 95% Confidence interval: 0.378 to 0.439
```

Reject the null hypothesis, with a \(p\)-value of 0.000, less than \(\alpha = 0.05\). The sample result of the
sample proportion \(p=0.408\) is
considered far from the hypothesized value of \(0.5\) for the proportion of
`"BMW"`

values for *Bike*. Conclude that the data were
sampled from a population with a population proportion of BMW different
from 0.5.

The following example is from the base R `prop.test()`

documentation, which the **lessR** `Prop_test()`

relies upon to compare proportions across different groups.

The null hypothesis in this example is that the four populations of
*patients* from which the samples were drawn have the same
population proportion of *smokers*. The alternative is that at
least one population proportion is different. Label the groups in the
output by providing a named vector for the successes.

To indicate multiple proportions across groups, provide multiple
values for the `n_succ`

and `n_tot`

parameters.
Optionally, name the groups.

```
smokers <- c(83, 90, 129, 70)
names(smokers) <- c("Group1","Group2","Group3","Group4")
patients <- c(86, 93, 136, 82)
Prop_test(n_succ=smokers, n_tot=patients)
```

```
##
## <<< 4-sample test for equality of proportions without continuity correction
##
##
## --- Description
##
## Group1 Group2 Group3 Group4
## ----------- ------- ------- ------- -------
## n_ 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## --- Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
```

The result of the test is \(p\)-value \(=0.006 < \alpha=0.05\), so reject the null hypothesis of equal probabilities across the corresponding four populations. Conclude that at least one of the population proportions of smokers differ.

In the following example, duplicate the previous results, but in this
example from data. To illustrate, create the data frame *d*
according to the proportions of smokers and non-smokers with respective
values “smoke” and “nosmoke”. Of course, in actual data analysis the
data would already be available.

```
sm1 <- c(rep("smoke", 83), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm3 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm4 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm <- c(sm1, sm2, sm3, sm4)
grp <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
d <- data.frame(sm, grp)
```

To test if the different groups have the same population proportion
of `success`

, retain the syntax for a single proportion for
the categorical `variable`

of interest. Define success by the
value of this variable, here *“smoke”*. However, an additional
parameter `by`

indicates the variable that defines the
groups, a variable that contains a label that identifies the
corresponding group for each row of data. The grouping variable in this
example is *grp*, with values the first four uppercase letters of
the alphabet. The first five rows of data are shown below.

```
## sm grp
## 1 smoke A
## 2 smoke A
## 3 smoke A
## 4 smoke A
## 5 smoke A
## 6 smoke A
```

The relevant parameters `variable`

, `success`

,
and `by`

are listed in their given order in this example, so
the parameter names are unnecessary. List the names for clarity.

```
##
## <<< 4-sample test for equality of proportions without continuity correction
##
## variable: sm
## success: smoke
## by: grp
##
## --- Description
##
## A B C D
## ----------- ------ ------ ------ ------
## n_smoke 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## --- Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
```

The analysis of data that matches the previously input proportions, of course, provides the same results as providing the proportions directly.

For the previously discussed test of homogeneity of the values of a single categorical variable, the proportion of occurrences for a specific value across different samples is of interest. Here, instead calculate the proportion of occurrence for each value from the total number of occurrences, as one sample from a single population. In addition to the inference test, the following are also reported: - The observed and expected frequencies - The residual of expected from observed - The standardized version of the residual

For the goodness-of-fit test to a uniform distribution, provide the
frequencies for each group for the parameter `n_tot`

. The
default null hypothesis is that the proportions of the different
categories of a categorical variable are equal.

In this example, enter three frequencies as a vector for the
`n_tot`

parameter value. Optionally, make the vector a named
vector to label the output accordingly.

```
##
## <<< Chi-squared test for given probabilities
##
##
## --- Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## --- Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
```

This example does not quite attain significance at the customary 5% level, with \(p\)-value \(= 0.066 > \alpha = 0.05\). A difference of the corresponding population proportions was not detected.

The same analysis follows from the data. Just specify the name of the
categorical `variable`

of interest.

```
##
## <<< Chi-squared test for given probabilities
##
## variable: Jacket
##
## --- Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## --- Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
```

Tests of independence evaluated here rely upon a contingency table of two dimensions also called a cross-tabulation table or joint frequency table. Enter the joint frequencies directly or compute from the data. The corresponding analysis provides the chi-square test for the null hypothesis of independence.

Also provided is Cramer’s V to indicate the extent of the relationship of the two categorical variables. For each cell frequency, the expected value given the independence assumption is provided, along with the corresponding residual from the observed frequency and the corresponding standardized residual.

To enter the joint frequency table directly, store the frequencies in
a file accessible from your computer system. One possibility is to enter
the numbers into a text file with file type `.csv`

or
`.txt`

. Enter the numbers with a text editor, or with a word
processor saving the file as a text file. This file format separates the
adjacent values in each row with a comma, as indicated below. Or, enter
the numbers into an MS Excel formatted file with file type
`.xlsx`

. Enter only the numeric frequencies, no labels.

For example, consider the following joint frequency table with four
levels of the column variable and four levels of the row variable, here
in `csv`

format.

```
3,58,6,105
41,79,9,207
86,179,27,484
143,214,31,824
```

After saving the file, call `Prop_test()`

using the
parameter `n_table`

to indicate the path name to the file,
enclosed in quotes. Or, leave the quotes empty to browse for the joint
frequency table.

This table is included in a file downloaded with
**lessR** with the name *FreqTable99*. That name
triggers an internal process that locates the file within the
*lessR* installation without needing to construct a rather
complicated path name as part of this example. That also means that the
name becomes a reserved key word with its use always triggering the
following example.

In general, replace *FreqTable99* in this example with your
own path name to your file of joint frequencies, or just delete the name
leaving only the two quotes to indicate to browse for the file.

```
##
## <<< Pearson's Chi-squared test
##
## --- Description
##
## Cell Frequencies
## 3 58 6 105
## 41 79 9 207
## 86 179 27 484
## 143 214 31 824
##
## Cramer's V: 0.075
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 3 18.812 -15.812 -4.003
## 1 2 58 36.522 21.478 4.150
## 1 3 6 5.030 0.970 0.455
## 1 4 105 111.635 -6.635 -1.098
## 2 1 41 36.750 4.250 0.799
## 2 2 79 71.346 7.654 1.098
## 2 3 9 9.827 -0.827 -0.288
## 2 4 207 218.077 -11.077 -1.361
## 3 1 86 84.875 1.125 0.156
## 3 2 179 164.776 14.224 1.504
## 3 3 27 22.696 4.304 1.105
## 3 4 484 503.654 -19.654 -1.781
## 4 1 143 132.562 10.438 1.339
## 4 2 214 257.356 -43.356 -4.246
## 4 3 31 35.447 -4.447 -1.057
## 4 4 824 786.635 37.365 3.135
##
## --- Inference
##
## Chi-square statistic: 41.732
## Degrees of freedom: 9
## Hypothesis test of equal population proportions: p-value = 0.000
```

Do not have the path name to your file readily available? Then browse for the file. The following example is not run as it cannot run in this vignette.

`Prop_test(n_table="")`

The full path name for the file is provided as part of the output.

The \(\chi^2\) test of independence
evaluated here applies to two categorical variables. The first
categorical variable listed in this example is the value of the
parameter `variable`

, the first parameter in the function
definition, so does not need the parameter name. The second categorical
variable listed must include the parameter name `by`

.

The question for the analysis is if the observed frequencies of
*Jacket* thickness and *Bike* ownership sufficiently
differ from the frequencies expected by the null hypothesis that we
conclude the variables are related.

```
## variable: Jacket
## by: Bike
##
## <<< Pearson's Chi-squared test
##
## --- Description
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 89 151.703 -62.703 -8.288
## 1 2 135 139.469 -4.469 -0.602
## 1 3 194 126.827 67.173 9.287
## 2 1 283 220.297 62.703 8.288
## 2 2 207 202.531 4.469 0.602
## 2 3 117 184.173 -67.173 -9.287
##
## --- Inference
##
## Chi-square statistic: 104.083
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.000
```

The result of this test is that the \(p\)-value = 0.000 \(< \alpha=0.05\), so reject the null
hypothesis of independence. Conclude that the type of *Bike* a
person rides and the thickness of their *Jacket* are related.

To visualize the relationship of the two variables, use the same
function call syntax, but now to `BarChart()`

instead of
`Prop_test()`

. The visualization is accompanied by the same
\(\chi^2\) test of independence.

```
## >>> Suggestions
## Plot(Jacket, Bike) # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE) # horizontal bar chart
## BarChart(Jacket, fill="steelblue") # steelblue bars
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## Chi-square Test of Independence:
## Chisq = 104.083, df = 2, p-value = 0.000
```

The visualization depicts the relationship between motorcycle and jacket: Honda riders prefer thinner jackets, and BMW riders prefer thicker jackets. To speculate, perhaps because the BMW bikes are sportier, their riders are more concerned with going down on the pavement.

This relationship becomes even clearer to visualize with the corresponding 100% stack bar graph. Each bar representing a jacket choice in this visualization shows the percentage of riders with each type of motorcycle for that jacket.

```
## >>> Suggestions
## Plot(Jacket, Bike) # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE) # horizontal bar chart
## BarChart(Jacket, fill="steelblue") # steelblue bars
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## Chi-square Test of Independence:
## Chisq = 104.083, df = 2, p-value = 0.000
##
## Cell Proportions within Each Column
## -----------------------------------
##
## Jacket
## Bike Lite Med Thick
## BMW 0.239 0.395 0.624
## Honda 0.761 0.605 0.376
## Sum 1.000 1.000 1.000
```

From this visualization we see that 24% of Lite jacket owners are BMW riders, and, in contrast, 62% of the owners of Heavy jackets are BMW riders.