Detecting and reducing heterogeneity of error in acoustic classification

Passive acoustic monitoring can be an effective method for monitoring species, allowing the assembly of large audio datasets, removing logistical constraints in data collection and reducing anthropogenic monitoring disturbances. However, the analysis of large acoustic datasets is challenging and fully automated machine learning processes are rarely developed or implemented in ecological field studies. One of the greatest uncertainties hindering the development of these methods is spatial generalisability—can an algorithm trained on data from one place be used elsewhere? We demonstrate that heterogeneity of error across space is a problem that could go undetected using common classification accuracy metrics. Second, we develop a method to assess the extent of heterogeneity of error in a random forest classification model for six Amazonian bird species. Finally, we propose two complementary ways to reduce heterogeneity of error, by (i) accounting for it in the thresholding process and (ii) using a secondary classifier that uses contextual data. We found that using a thresholding approach that accounted for heterogeneity of precision error reduced the coefficient of variation of the precision score from a mean of 0.61 ± 0.17 (SD) to 0.41 ± 0.25 in comparison to the initial classification with threshold selection based on F‐score. The use of a secondary, contextual classification with thresholding selection accounting for heterogeneity of precision reduced it further still, to 0.16 ± 0.13, and was significantly lower than the initial classification in all but one species. Mean average precision scores increased, from 0.66 ± 0.4 for the initial classification, to 0.95 ± 0.19, a significant improvement for all species. We recommend assessing—and if necessary correcting for—heterogeneity of precision error when using automated classification on acoustic data to quantify species presence as a function of an environmental, spatial or temporal predictor variable.


| INTRODUC TI ON
Passive acoustic monitoring (PAM) is an increasingly common ecological survey tool. PAM has many advantages over traditional survey methods, facilitating sampling across larger spatiotemporal scales, and in places where access is logistically challenging, thus increasing costefficiency (Darras et al., 2018;Gibb et al., 2019). Given this capacity to record audio data autonomously, PAM can accrue very large datasets, often too large to analyse manually. Automated classification presents a solution to this challenge. A variety of approaches have been trialled including template matching, machine learning techniques such as clustering and random forests along with deep-learning algorithms such as convolutional neural networks (Priyadarshani et al., 2018).
However, outside of chiropterology, few studies have used fully automated classification to answer applied ecological questions in terrestrial landscapes, and especially for the challenge of multi-species classification across large audio datasets from tropical forests.
One of the issues facing the applied use of automated classification methods is how readily algorithms can be generalised-how well they can be applied to new data across time and space (Priyadarshani et al., 2018;Stowell, Petrusková, et al., 2019). The accuracy and generalisability of supervised machine learning techniques-those using labels generated by humans as training data-are heavily dependent on the nature of the labelled training data used (Knight et al., 2019(Knight et al., , 2020Towsey et al., 2012). Achieving high classification performance generalised across a range of conditions requires training data that is representative of the variation in the data on which the classifier is to be used, while also achieving a balance between classes (Towsey et al., 2012;Zhong et al., 2020). Classification performance is impacted by variation in the underlying audio data which may be intrinsic or extrinsic to the target species. Intrinsic sources include variation in an individual animal's vocalisations-between individuals, populations, geographically and temporally. For classifiers to perform well considering these forms of intrinsic variation, a representative sample of vocal variation for each species as training data is needed. Although potentially challenging for many species, obtaining such representative data is facilitated when ecological knowledge can be used to anticipate when and where training data can be obtained.
Sources of potential extrinsic variation include; other sounds that overlap with target signals in time and/or frequency such as vocalisations from other species, the prevalence of which depends on variation in co-occurrence and abundance of species with similar vocalisations (Tobias et al., 2014), plus other sources of biophony, geophony or anthropophony; and environmental factors that can impact sound propagation, such as weather conditions and the density of surrounding vegetation (Yip et al., 2017). There are also potential sources of audio variation that fit between the two categories-for example when a source extrinsic to the target animal causes an intrinsic change to the vocalisation of the target species.
Responses to predators or competitors, such as duetting or lekking birds can fit in this category (Mennill & Vehrencamp, 2008), as do broadband sources of noise such as cicadas or vehicle engines that may cause a change in the frequency at which calls are made.
In comparison to intrinsic variation, these sources may be more challenging to represent well in a training dataset, as they are likely to be both more variable and less predictable. In particular, the use of online libraries such as Macaulay Library (https://www. macau layli brary.org), xeno-canto (https://www.xeno-canto.org) or AmphibiaWeb (https://amphi biaweb.org) could cause training datasets to be less representative, as recordings made with directional microphones of single species have high signal-to-noise ratios (Priyadarshani et al., 2018, Towsey et al., 2012, and are unrepresentative of external sources of acoustic variation. Errors associated with variation in noise could be resolved by applying noise reduction techniques (e.g. Priyadarshani et al., 2020).
While this approach is undoubtedly effective in some or even many, cases, it is a difficult approach when dealing with a multi-species classifier. Here, one target species call is 'signal' when considering its own classification, but can also be considered as 'noise', and impact classification accuracy, for all the other species included in the classifier. Therefore, even removing all of the sound that could be considered 'noise' in all circumstances still leaves a considerable amount of variable sonotypes in the training data. Furthermore, noise reduction is not a universally agreed approach to improve classification accuracy, for instance the Open SOundScape package (Lapp et al., 2022) offers many approaches to augment and increase noise in training data-for instance overlaying training samples on top of each other.
Intrinsic and extrinsic variation in audio data make obtaining truly representative datasets extremely difficult if the classifier is intended to operate across large spatial extents, long periods of time or across heterogeneous habitats . In these cases, providing representative data labels at local scales would require huge increases in labelling effort (LeBien et al., 2020). The inevitable 4. We recommend assessing-and if necessary correcting for-heterogeneity of precision error when using automated classification on acoustic data to quantify species presence as a function of an environmental, spatial or temporal predictor variable.

K E Y W O R D S
automated signal recognition, autonomous recording unit, bioacoustics, ecoacoustics, machine-learning shortfalls in obtaining representative datasets mean that classification accuracies obtained on test datasets may not translate to field conditions (Stowell, Wood, et al., 2019)-a common problem in supervised learning fields termed covariate shift (Shimodaira, 2000).
Many classification algorithms may exhibit biases that will lead to heterogeneous error structure when exposed to these variations.
Heterogeneity of error could be especially problematic if the covariate responsible for shifting classification accuracy is the same as that being studied for ecological purposes. This may occur, for example, when error varies; spatially in a space-for-time swap experimental design, temporally in a phenological study or at ecotones where replacement species with similar vocalisations may overlap when examining habitat preference.
Automated classification models are typically assessed by deriving a range of accuracy metrics from a manually labelled test dataset that has been randomly subsampled from the training data, or independently subsampled from the dataset on which the algorithm is to be applied (Knight et al., 2017;Priyadarshani et al., 2018). Following Knight et al. (2017), precision, recall, F-score and area under the curve (AUC) have been widely adopted to determine classification algorithm performance (Table 1). However, these methods fail to explicitly test the generalisability of the algorithm across the gradient of a shifting covariate. Consequently, using only these metrics to assess classification performance risks masking high variability in false-positive error and subsequent confounding of results if error covaries with a variable of ecological interest.
Although both heterogeneity in false-positive and false-negative errors can be detrimental, we focus here on false-positive errors, and consequently the precision metric, in keeping with other studies highlighting precision as important for acoustic surveys (e.g. Juodakis et al., 2021). Unlike false negatives, false positives cannot be mitigated by summarising presence across longer time periods or spatial extents (Metcalf et al., 2019), violate the assumptions of many standard methods for modelling detection probability and can lead to poor model inference (Royle & Link, 2006). While research has been conducted into the overall reduction in false-positive error in ecoacoustic datasets (e.g. Balantic & Donovan, 2020;Clare et al., 2021;Knight et al., 2020), and in reducing the variation in error when analysing the ecological variables through the use of occupancy models (e.g. Chambert et al., 2018;Rempel et al., 2019), there has been limited research into methods to reduce variability of falsepositive error at the classification stage.
We present a case study using automated classification of an Amazonian PAM dataset to highlight the challenges in detecting heterogeneity of precision error. First, by incorporating a measure of heterogeneity into the metric used to set a threshold for classification confidence score, so that thresholding goes beyond a twoway trade-off between precision and recall, and instead becomes a three-way trade-off between precision, recall and heterogeneity of (precision) error. Second, by incorporating a secondary classification that accounts for the context in which the classification is made, through the use of neighbouring classification scores for the target and other species, environmental and temporal data.

| Data collection
We used PAM to collect data from 29 survey points from a 10,000 km 2 study area in eastern Amazonia encompassing parts of the municipalities of Santarém, Belterra and Mojuí dos Campos (≈3°2′45.6″S, 54°56′49.2″W) in the Brazilian state of Pará. Survey points located in upland terra firme rainforest had a minimum separation of 2 km to minimise spatial dependence. The experimental design was intended to investigate differences in socioecological responses to forest disturbance (see Metcalf et al., 2021 for full details), and the survey points are thus distributed across an anthropogenic disturbance gradient. Consequently, we address spatial heterogeneity of error in this study across the 29 points.

Recall
The percentage of all possible positive events that are correctly predicted TP (TP + FN)

Precision-Recall
Area-Underthe-Curve (PR-AUC) Precision-Recall curves summarise the trade-off between precision and recall at different classification confidence score thresholds. The area-under-the-curve is therefore a good metric for comparing classification accuracy independent of threshold selection F-Score The harmonic mean of precision and recall = 1 The harmonic mean of precision and recall, weighting precision as twice as important as recall We collected two sets of acoustic data. The first was used exclusively for the purpose of training the classification algorithm (hereafter 'Acoustic Dataset 1') between 20 November and 30 December 2017 from two of our survey points, each with three recorders 150 m apart. The second set of acoustic data (hereafter 'Study Dataset') was collected from a single recording location at each of the 29 survey points between 12 June and 16 August 2018. Recordings for the Study Dataset were made over 1-2 recording periods at each point, with recording period varying in length between 3 and 22 days. This gave as optimal a coverage of nocturnal species as logistical limitations would allow (nocturnal species vocalisation rate may be impacted by the lunar cycle). A minimum of 13 days were surveyed at each location. Both datasets were collected using Frontier Labs Bioacoustic Recorders (Frontier Labs, 2015). Recordings were continuous across the diel cycle, and were filtered afterwards to only include astronomical night, measured using the Suncalc package in R (Thieurmel & Elmarhraoui, 2019) using the location of our field station (coordinates given above). Full details of recording periods, equipment and protocols for each location are given in SOM Appendix 1. This work was conducted under SISBIO permit number 53271-13.

| Automated classification and verification datasets
Tadarida is an open-source toolbox for detecting, labelling and classifying sounds  and shown to be effective at classifying various European species of insects and mammals (Barré et al., 2019;Newson et al., 2017). We used Tadarida to build a classifier in R (R Core Team, 2020) for seven nocturnal bird species-four owl species; Southern Tawny-bellied Screech-owl Megascops usta, Crested Owl Lophostrix cristata, Spectacled Owl Pulsatrix perspicillata, Amazonian Pygmy-owl Glaucidium hardyi and three nightjar species; Ocellated Poorwill Nyctiphrynus ocellatus, Silky-tailed Nightjar Antrostomus sericocaudatus and Common Pauraque Nyctidromus albicollis. Tadarida first identifies sound events using a hysteresis function; the sound event starts when a high amplitude threshold is passed and ends when the signal-to-noise ratio drops below a second lower threshold. The program extracts 269 acoustic features (e.g. minimum and maximum frequency, peak frequency, duration) from each detected sound event and facilitates feature labelling for use as training data in a random forest classifier (see Bas et al., 2017 for full details). As multiple detected sound events may be identified from a single animal vocalisation, Tadarida uses simple rules to group events and makes classification predictions. Consequently, Tadarida works best over short-duration sound files, so we split all the recordings into 15 s files for all further processes. We limited all detections to those with the point of highest amplitude between 0.2 and 4.2 kHz which includes most terrestrial nocturnal vertebrates in the region.
To create training data for the classification algorithm, we undertook manual labelling of sound events detected by Tadarida (this labelled dataset hereafter referred to as 'Training Dataset 1') ( Figure 1). Tadarida classifies every detected sound event, potentially comprising tens of millions of sound events of which only a fraction are made by target species. Consequently, we chose to label additional classes beyond those of the target species so that common non-target sounds would be classified into those groups. We were unconcerned about classification accuracy for these non-target classes. During the labelling process, in addition to vocalisations of the seven target species, we identified 293 potential non-target classes by grouping similar sounds together, which included a range of biophony, geophony and rarely anthropophony. These sound types were simplified to a final set of sonotypes, either by merger or removal to give a final set of 59 sound types, including the seven classes for target species, as the classes the Tadarida algorithm classified detected sound events into. We identified each sonotype to species level where appropriate and possible. Where identification was not apparent, online resources such as the Macaulay Library, xeno-canto and AmphibiaWeb were consulted, and some call types were shared with relevant regional experts. If identification was still not possible, the sound type was left unidentified.
To obtain training data, we systematically searched for discrete sound types in our recording datasets. First, we labelled data from a subset of Acoustic Dataset 1. This subset consisted of 3 hr of recording per night-1 hr up to 30 min before sunrise, 1 hr commencing 30 min after sunset and 00:00-01:00, every 3rd night from each of the three recording units deployed, totalling 96 hr of data. While each sound file was searched systematically, training data were added based on the labeller's discretion so that not all calls in an extended vocalisation bout were necessarily included, especially for common sound types. As this data only came from three survey points, we additionally labelled data from eight other survey points in the Study Dataset, to increase spatial coverage and representation of forest disturbance, which can impact species abundance and composition. As the systematic search method adopted for Acoustic Dataset 1 was extremely time-consuming, we adopted a more targeted approach to labelling the additional data from the Study Dataset, choosing data from periods of time and places which we knew were most likely to contain vocalisations of species that were known to be present in the region (Lees et al., 2013), but were thus far under-represented in Training Set 1.
Finally, we supplemented labels generated from our own audio data using recordings from online archives (Macaulay Library, xenocanto), also split into 15 s .wav files. We augmented all recordings by adding noise at six amplitude levels by combining each labelled file with three files identified in our own recordings as containing only heavy rainfall, and manipulating each by increasing the amplitude, giving six 'rain' files. Each rain file was then combined with each labelled file, increasing six-fold the number of labels in the training dataset. For full details of the Tadarida labelling and data augmentation process, see SOM Appendix 2.

| Assessing classifier performance
To assess classifier performance, we followed Knight et al. (2017) in using precision, recall, F-score and the area under the precision/ recall curve (hereafter 'PR-AUC'). We calculated these scores from two test subsets of the Study Dataset by comparing Tadarida predictions against manual assessment of the same audio files. We detect the presence or absence of a species at 15 s resolution.
Tadarida gives a confidence score for each class included in the classification algorithm (n = 59) for each group of detected sound events (Tadarida made classification predictions on n = 28,030,710 grouped sound events in the Study Dataset), and we considered the species with the highest score in each grouped sound event to be the species predicted as present by Tadarida. We then summarised the predictions to the 15-s file level, taking the maximum score for each predicted species per file so that a species only had one score per file. All predictions for non-target species were discounted. Manual assessment of target species present within 15 s files was conducted using Raven Pro (Center for Conservation Bioacoustics, 2019).
When creating a test dataset for manual labelling, it is vital to use a sample large enough to be representative of the original dataset to accurately assess classification performance (Knight et al., 2017).
However, test sets of 10% or even 1% may not be practical for very large datasets, as the subsample would take too long for manual assessment, as is the case here (n = 1,081,780 fifteen second files in the Study Dataset). Instead, we made a subjective decision on test set sample size based on the trade-off between manual labelling effort and representativeness. Consequently, the first test set (Test Set 1) consisted of 2900 15 s files stratified such that 100 files were taken from each survey point, just under 0.3% of the total number of unique 15 s files in the study dataset.
Test Set 1 was randomly subsampled from the study dataset prior to classification following Knight et al. (2017), ensuring independence from the training dataset. A 15 s file was considered to be a true positive when Tadarida predicted a species presence within that file and the species was also detected in the file during manual assessment. A 15 s file was considered a false positive when Tadarida predicted a species presence within that file, but the species could not be detected in the file during manual assessment, and so on, respectively, for true negative and false & Donovan, 2020; Royle & Link, 2006), and as we used short files likely to be summarised over larger time-scales to avoid temporal autocorrelation in a later analysis, thus mitigating the impact of false negatives. Calculation of mean accuracy metrics was done by first calculating the scores for each target species across all sites, as opposed to at each survey site, and taking the unweighted mean from the seven values generated. Knight et al. (2017) highlight the importance of score thresholds in assessing classifier performance, so we calculated F 0.5 -score at each possible threshold between 0 and 100 with intervals of 0.001.
We used the maximum F 0.5 -score to determine the optimal threshold for each target species, and recalculated precision, recall and F 0.5 -score with that threshold applied.

| Detecting heterogeneous error
To look for heterogeneity in precision across our survey points, we calculated precision values for each target species at each survey point and computed the sample coefficient of variation (CV) using the envStatS R package (Millard, 2013). However, we were concerned that the random subsampling approach used to select Test Set 1 may result in a non-representative subsample. In particular, selecting too few true-positive data points (e.g. instances in which rarer species were present) at intervals across the gradient of the predictor variable to be sensitive to variance in error. Consequently, the second test set (Test Set 2) was subsampled from the study dataset after classification, and consisted of 50 15 s files from each survey point per target species, with sampling based on the probability distribution of the classification score of the target species, using the createDataPartition function in the caret R package (Kuhn, 2021). All files were manually assessed for the presence or the absence of the predicted target species. Optimal thresholds were determined in the same manner, and Precision, Recall and F 0.5 -Score were calculated.
However, while this approach allows a more accurate assessment of the number of true and false positives, it comes at the expense of precise estimation of true and false negatives, which are necessarily excluded from the test dataset. To compensate, the recall scores for each species generated for Test Set 2 were multiplied by the recall score for Test Set 1 prior to the application of a threshold (hereafter 'corrected recall').
To see if heterogeneity of precision error detected by Test Set 2 varied from Test Set 1, we used Levene's test with Benjamini-Hochberg correction to compare the variance in precision score of each survey point for each species.

| Reducing heterogeneity of error
We used two methods to reduce both the absolute number of false positives and for comparability across study treatmentsheterogeneity of precision error. First, we incorporated a measure of variance-CV, in the threshold selection process (hereafter 'CV-optimised threshold'). We normalised both the F 0.5 -score and CV values from each threshold interval to within the range zero to one. Instead of using the maximum possible F 0.5 -score, we included a term to favour threshold intervals with lower CVs, using the maximum of (F 0.5 -Score 2) + (1 -CV), but weighted in favour of F 0.5 score.
We recalculated accuracy metrics for Test Sets 1 and 2, comparing the CV scores for the two thresholding approaches.
Second, we built a second random forest model-a 'contextual classifier', trained on predictions from the Tadarida algorithm, time, date and acoustically derived environmental data. This secondary classification process was explicitly designed to reflect the contextual and environmental information used by experienced field observers to support identifications. This includes the environmental conditions, such as background noise levels (Simons et al., 2007), the presence of certain indicative species or groups which increases or decreases the likelihood of other species being present, and the observer's own capacity to overlook or ignore certain species (Kepler & Scott, 1981). We argue that an experienced observer uses an awareness of all these factors and adjusts identification confidence accordingly (Robinson et al., 2018). We have attempted to artificially replicate this process by providing a random forest both with the initial confidence scores made by the first classifier, and a wide array of contextual data which can be used to modify that initial prediction.
As we were primarily concerned with rectifying problems with precision, we designed the contextual classifier to operate only on those 15 s files already classified by Tadarida as having a target species present, similar to the secondary classification method used by Balantic and Donovan (2020) to reduce overall false-positive rates for template-matching. We took a random, stratified sample of files (n = 2,900, henceforth 'Training Set 2') in which Tadarida had classified the target species as present. We stratified the sample, taking 100 sound files from each location, further stratified into uneven quintiles of confidence score: 0-0.29, 0.3-0.49, 0.5-0.69, 0.7-0.84 and 0.85-1. These ranges were chosen to include a full range of confidence scores, while taking most samples from scores that were most likely to have a mix of true and false positives. When there were not enough samples within a quintile, which occurred mostly at high confidence ranges, additional samples were taken randomly.
We manually checked for vocalisations of the target species in each sampled file and calculated the specificity of the classifier for each species at each survey location.
We built individual contextual classifiers for each of our seven target species using the stratified sample as training data. From each manually checked 15 s file, we calculated a series of variables to be used to train a new random forest. This included environmental data about each 15 s file; time, date, root mean square of the sound envelope calculated utilising the Seewave package (Sueur et al., 2008) as a measure of background noise levels and the 'rainQ2' and 'rain_min' prediction of rainfall from the hardrain package (Metcalf et al., 2020). We also used Tadarida confidence scores for each 15 s file as predictors. These included the maximum Tadarida confidence score of the target species, and for every class in the Tadarida classifier (n = 59), the minimum, maximum, mean, 90th and 95th quantiles of the confidence scores. We also included the summed confidence score of each class per 15 s file, the ratio of classified sound events to the target species, and the three species most commonly detected in the file. In addition, we calculated the same confidence score variables across both 10 min and 1-hr periods, with the time centred around the file being classified. For the latter, we also calculated the 98th percentile of the classifier score for each class. This gave us a feature set of 716 predictors for each target species.
We used this feature set to build a distributed random forest

| Classification performance
In general, the Tadarida classifier performed poorly prior to thresholding (Table 2). For most species, precision was low (0.27 ± 0.16 [Mean ± SD])-as expected in a process that is classifying every sound event. Recall fared better (0.65 ± 0.2), with a minimum 0.42 for P. perspicillata. PR-AUC scores should better account for a large number of false positives at low threshold scores, but these were also low (0.54 ± 0.23). We found the classifier performed well on two species, L. cristata and A. sericocaudatus (PR-AUC scores = 0.78 and 0.84, respectively), while P. perspicillata had a PR-AUC of just 0.17. However, PR-AUC scores weight precision and recall equally, and as here we prioritised precision over recall, it can still be possible to find thresholds that allow for high precision at the expense of recall. We found a dramatic increase in the precision and the F 0.5 -score of the classifier once an F 0.5 -score based threshold has been applied (Figure 2a in blue). Precision increases from 0.27 ± 0.16 to 0.83 ± 0.13, while recall decreases to a mean of 0.38 ± 0.16. F 0.5 -Score, reflecting the weighting of precision over recall, also increases substantially to 0.64 ± 0.17 from 0.54 ± 0.23.
However, some of the recall scores are particularly low-for example, just 0.12 for P. perspicillata.

| Detecting heterogeneity of error
We found considerable heterogeneity in precision when applying optimised F 0.5 -score thresholds to classification metrics derived from both test sets. Precision ranged from zero to one between survey sites for every species except G. hardyi with Test Set 2, which had a maximum precision of 0.95 (Figure 2b). Both test sets had high coefficients of variation in precision for many species, 0.57 ± 0.42 for Test Set 1 and 0.61 ± 0.17 for Test Set 2 (Figure 2c). High precision scores still masked much variation, for instance in the case of A. sericocaudatus which has precision scores of 0.96 and 0.89 for Test Sets 1 and 2, respectively, but precision CV of 0.43 and 0.50, respectively.
Levene's tests revealed a significantly higher estimate of precision variance with Test Set 2 compared to Test Set 1 for P. perspicillata (F = 12.54, p = 0.006). No significant difference was found between the precision variance in the other six target species, although Test Set 2 showed higher standard deviation of precision for four of the remaining six species, the exceptions being M. usta and G. hardyi (Figure 2b). Additionally, precision CV was higher with Test Set 2 than Test Set 1 for five species, the exceptions being M. usta and N. albicollis (Figure 2c). In addition, Test Set 1 produced the highest and lowest CV estimates, 0.28 for P. perspicillata and 1.50 for N. albicollis, perhaps indicative of low sample sizes causing relatively extreme estimations. A comparison of the other estimated accuracy metrics from Test Set 1 and Test Set 2 can be found in Figure 2a.
Results hereafter refer to metrics derived from Test Set 2. TA B L E 2 Tadarida classification accuracy metrics without thresholds. F 0.5 -score is weighted twice as heavily in favour of precision than recall. These accuracy metrics are based on a randomly sampled test set of n = 2,900 15 s files stratified to sample 100 files per survey point

| Reducing heterogeneity of error
The use of CV-optimised thresholds resulted in the selection of higher thresholds by a mean of 0.08 ± 0.05, with no thresholds decreasing, the threshold for N. albicollis stayed the same and a maxi- Additionally, we found significant improvements in corrected recall and F 0.5 -score with the contextual classifier compared to the Tadarida classifier with CV-optimised thresholds ( Figure 5). F 0.5score also significantly improved, from 0.59 ± 0.11 to 0.75 ± 0.09 F I G U R E 2 (a) A comparison of classification accuracy metrics for the Tadarida classifier, derived from Test Set 1 (randomly sampled prior to classification) and Test Set 2 (sampled post-classification based on the probability distribution of positive classifications), respectively, with F 0.5 -score optimised thresholds applied. Recall scores for Test Set 2 were multiplied by the recall score for Test Set 1 prior to the application of a threshold to compensate for Test Set 2 being sampled from Tadarida

| DISCUSS ION
Heterogeneity of error is rarely tested for in ecoacoustic studies using automated classification (Wright et al., 2020) and could potentially confound ecological interpretation. We found strong evidence that classification of sound events with a random forestbased Tadarida algorithm exhibited strong indications of heterogeneous false-positive error. Six of seven target species had precision ranging from zero to one across 29 survey sites, and mean CV across all species was greater than 0.5 regardless of the test set used to estimate accuracy metrics. This highlights the need for accuracy metrics that better reflect the performance of machine learning classification under field conditions, and that do not rely solely on those metrics optimised for machine learning use in 'laboratory conditions' (Wearn et al., 2019). In particular, we emphasise the need to include metrics to detect error variance across prediction variables, here successfully undertaken using precision CV.
Additionally, we found that using post-classification sampling of files from only positive classifications can provide a more reliable estimate of precision error heterogeneity, while providing better estimations of standard accuracy metrics. In comparison, standard approaches to drawing test sets (Knight et al., 2017)  In other circumstances, ensuring a large enough pre-classification sample to obtain a good number of positive predictions may suffice.

F I G U R E 3
Threshold selection approaches with (red circles) and without (black circles) including the coefficient of variation (CV) of precision. Scores shown within each plot are precision (PREC) and recall (REC) for the CV-optimised threshold selection based on Test Set 2. Nyctidromus albicollis is not shown, but both threshold approaches selected the same threshold score giving a precision score of 0.7 and recall of 0.25.
We also demonstrate two methods of reducing precision error heterogeneity. First, by incorporating a measure of heterogeneity of error in the thresholding process. This is straightforward to implement and resulted in substantial reductions of CV by 0.2 ± 0.16 per species. However, Levene's tests only found A. sericocaudatus to have a significantly lower precision variance than scores derived with a threshold optimised for F 0.5 -score alone. In addition, this method reduces error heterogeneity by increasing the confidence F I G U R E 4 A comparison of the variance in precision at 29 survey points by classification method and threshold optimisation approach. Blue brackets and stars show the significant results of pairwise Levene's tests with Benjamini-Hochberg correction on the homogeneity of variance. Black brackets and stars show the significant results of pairwise Wilcoxon signed-rank tests with Benjamini-Hochberg correction on precision values. *p < 0.05 and >0.01, **p < 0.01 and >0.001, ***p < 0.001 and >0.0001, ****p < 0.0001. Yellow boxplots show results for the contextual classifier with a CV-optimised threshold, green for the Tadarida classifier alone with a CV-optimised threshold and blue results for the Tadarida classifier with an F 0.5 -score optimised threshold.

F I G U R E 5
A comparison of accuracy metrics for Tadarida classification with CV-optimised thresholds applied, and contextual classification with CVoptimised thresholds applied. F 0.5 -score is weighted twice as heavily in favour of precision and recall is corrected based on the recall values from Test Set 1.
score threshold, thus also requiring a greater reduction in recall. This suggests that incorporating heterogeneity of error in the thresholding process should only be used on its own for classifiers that already have high traditional performance metrics, in particular a high F-score and PR-AUC, requiring a limited reduction in precision error heterogeneity. This is further emphasised by the incorporation of CV into threshold selection not resulting in an increase in threshold for N. albicollis despite this species showing the highest precision error heterogeneity, because the classifier performed poorly enough in general that it did not justify the decrease in F 0.5 -scores.
The second method to reduce precision error heterogeneity was a secondary contextual classifier. In contrast, this required considerably more effort to incorporate into a classification workflow, but substantially reduced CV of precision for all species, and significantly reduced the variance of precision scores for all species except L. cristata. In keeping with previous studies that also used a secondary classifier (Balantic & Donovan, 2020), it also improved overall classification performance, significantly improving precision for all target species and simultaneously increasing recall in comparison to only using a CV-optimised threshold. This suggests that users of all but the best performing classification models, and those with particular concerns about error heterogeneity confounding ecological results, should consider using an additional contextual classification to redress variance in precision.
Concern around precision error heterogeneity led us to train the contextual classifier on positive predictions. This is because we believe variance in precision is more likely to confound ecological findings and be harder to mitigate against in other ways, such as summarising results over longer time periods, or the use of occupancy models, than heterogeneity of recall. However, for some uses of audio classification, homogeneity of recall may be as, or more, important-and previous research has suggested that a minimum level of recall is required for studies to be reliable (Knight et al., 2020), for instance recall can impact detectability in occupancy models, which could bias occupancy estimates (MacKenzie et al., 2017). In these cases, the use of a contextual classifier that allows for a lower threshold and increased recall is clearly beneficial but may be improved further by also training it on negative predictions. However, this does entail a higher degree of effort to implement, especially in cases of low call density due to the class imbalances inherent in detecting bird calls. In such cases, there are always likely to be many more true negatives than either false negatives or true positives, so finding sufficient instances of false negatives to train a secondary classifier may prove challenging without a priori knowledge of when and where positive instances are likely to occur.
There is no reason to think heterogeneity of error is unique to this dataset, or even to random forest classifications. It is likely to occur broadly in supervised learning classification methods, with parallels to image classification in camera-trapping (Wearn et al., 2019), including convolutional neural networks currently producing the best classification accuracy. Covariate shift from training datasets can be caused by underlying ecological factors impacting the soundscape varying across a range of gradients including space, time, light and temperature (Royle, 2018;Yip et al., 2017). We therefore strongly recommend that future studies explicitly test for, and take measures to reduce variance in error.
The methods proposed here are just two possibilities, and are not necessarily optimal. Consideration should be afforded to study objectives and the ecology of the target species-for example the use of contextual classification here could bias against rarer species or events, so may only be appropriate for species with regular calling bouts, and for occupancy or abundance estimations rather than presence/absence of extremely rare species. Other approaches could instead be used to reduce heterogeneity of error, for instance with a suitably large initial training set, a naive classifier could be trained at short time-scales and then the naive classification scores used as features in a contextual classifier at longer time-scales with boot-strapping to maintain independence. This approach would remove the need to generate a second training dataset, although would probably make the collection of a suitable initial dataset more challenging. Other machine learning methods, or even ensembles, could outperform Distributed Random Forest algorithms, and other useful contextual variables could be used-for instance acoustic indices to better characterise and contextualise the soundscape. For those wishing to work within an occupancy model framework, it would be useful to undertake further research to compare the efficacy of methods to resolve precision and recall error within the occupancy model (e.g. Rempel et al., 2019) to reduce error heterogeneity during the classification process itself.
Nonetheless, by revealing for the first time the potential importance and a solution for dealing with error heterogeneity we hope to stimulate further research, and to encourage those who use machine learning classification in ecoacoustics to carefully consider the implications of classification error on ecological inference.

AUTH O R CO NTR I B UTI O N S
All authors made substantial contributions. Oliver C. Metcalf, Jos Barlow, Yves Bas and Alexander C. Lees made substantial contributions to conception and design. Oliver C. Metcalf, Christian Devenish and Yves Bas contributed to code development and testing. Oliver C. Metcalf, Erika Berenguer and Filipe França contributed to acquisition of data. All authors contributed to analysis and interpretation of data, and drafting the article. All authors had final approval of the version to be published, agree to be accountable for the aspects of the work that they conducted and ensuring that questions related to the accuracy or integrity of any part of their work are appropriately investigated and resolved.

ACK N OWLED G EM ENTS
We would like to thank the RAS field and laboratory assistants: Marcos Oliveira, Gilson Oliveira, Renílson Freitas and Josué Jesus de Oliveira for their hard work and assistance, without whom this would not be possible. We are also grateful to Joice Ferreira and

CO N FLI C T O F I NTE R E S T S
The authors have no conflicts of interest to declare.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.13967.