Improving imbalance classification via ensemble learning based on two-stage learning

The excellent performance of deep neural networks on image classification tasks depends on a large-scale high-quality dataset. However, the datasets collected from the real world are typically biased in their distribution, which will lead to a sharp decline in model performance, mainly because an imbalanced distribution results in the prior shift and covariate shift. Recent studies have typically used a two-stage learning method consisting of two rebalancing strategies to solve these problems, but the combination of partial rebalancing strategies will damage the representational ability of the networks. In addition, the two-stage learning method is of little help in addressing the problem of covariate shift. To solve the above two issues, we first propose a sample logit-aware reweighting method called (SLA), which can not only repair the weights of majority class hard samples and minority class samples but will also integrate with logit adjustment to form a stable two-stage learning strategy. Second, to solve the covariate shift problem, inspired by ensemble learning, we propose a multi-domain expert specialization model, which can achieve a more comprehensive decision by averaging expert classification results from multiple different domains. Finally, we combine SLA and logit adjustment into a two-stage learning method and apply our model to the CIFAR-LT and ImageNet-LT datasets. Compared with the most advanced methods, our experimental results show excellent performance.


Introduction
Benefiting from the development of computing resources in recent years, deep neural networks (DNNs) have been widely used in image classification (He et al., 2016), image segmentation (Zhou et al., 2019), object detection (Tian et al., 2019), etc.These successful application cases usually require large-scale high-quality labeled data, such as ImageNet (Russakovsky et al., 2015) and COCO (Lin et al., 2014), in which the sample distribution in the training and test dataset is almost consistent.However, training datasets collected from the real world generally have a biased distribution, i.e., the number of samples of each class varies greatly.Models trained by biased datasets will not only cause minority class samples to be misidentified as majority class samples but also confuse minority class samples with hard samples from the majority class, eventually leading to a sharp drop in network performance.
The prior shift and covariate shift resulting from an imbalanced distribution are the primary causes of the decline in network performance.Prior shift refers to the phenomenon that the label distribution of one class in the training dataset and test dataset is inconsistent.Covariate shift mainly refers to the phenomenon that the data distribution of one class in the training dataset and test dataset is inconsistent.These shifts make the network parameters overfit to some majority class samples, resulting in the model's overconfidence in these examples and poor performance on the test dataset.For a long time, many studies have concentrated on developing rebalancing strategies to alleviate this overfitting, such as reweighting for the loss function (Ren et al., 2018;Cui et al., 2019), resampling for the training sample (Pouyanfar et al., 2018;Zhou et al., 2020), and logit adjustment for output logit (Menon et al., 2021;Xu et al., 2021).These strategies provide some good ideas for solving the problems caused by the imbalanced distribution.However, although reweighting and resampling can address class imbalance issues to some extent, the direct application of these methods will damage the deep feature representation ability of the network, making it difficult for the network parameters to reach their theoretical optimal solution (Zhou et al., 2020).
Adopting a two-stage learning strategy, typically using two separate rebalancing strategies in two training stages to decouple network feature representation learning and classifier learning, is a common way to overcome the issues mentioned above.However, some rebalancing strategies are incompatible, e.g., using resampling in the first stage and reweighting in the second stage.Reweighting promotes classifier learning, which encourages the classifier's decision boundary to move in the direction of classifying the minority classes as correctly as possible.Resampling ensures that the label distribution of the mini-batches sampled from the training dataset is consistent with the label distribution of the test dataset.Owing to the undersampling of the majority class samples and the oversampling of the minority class samples, some samples are not involved in the training process, resulting in a negative impact on feature representation learning.It is difficult to use the reweighting method to optimize the classifier when the separability of the feature is weak (Zhou et al., 2020).Based on the above analysis, we propose to use data augmentation instead of resampling in the first stage to maximize the representation ability of the network.
Our goal in this work is to design an efficient and useful two-stage learning method using currently available rebalancing strategies.Owing to the conflict between reweighting and resampling, we investigate the effects of the combination of logit adjustment and reweighting on DNNs.We discover that the network performance will be degraded when combining the existing classic reweighting methods with logit adjustment.This is because both logit adjustment and reweighting try to give minority class samples more attention while giving the majority class samples less attention, ultimately making the performance of the majority class drastically deteriorate.Additionally, because the confidence of majority class hard samples and minority class samples is extremely similar, the sample confidence-based reweighting method [such as focal loss (Lin et al., 2017)] will unfairly assign weights to these samples, which will increase the expected calibration error of the network (Guo et al., 2017).To this end, we propose a logit-aware reweighting method (called SLA) that could use the sample with the largest logit of each class as the benchmark sample to assign appropriate weights to the remaining samples (Figure 1).
Furthermore, two-stage learning methods are ineffective at dealing with the covariate shift problem, which is an unavoidable but easily neglected issue in imbalanced image classification.It is hard to ensure that the distribution of the training and test dataset is entirely consistent.The minority class may have dramatically different numbers on the training and test datasets when the distribution of the training dataset is imbalanced, which exacerbates the inconsistency between the training data distribution and testing data distribution.In this situation, it is difficult to train a model with good generalizability using just a twostage learning method.Inspired by ensemble learning, we propose a multi-domain expert specialization model to enhance the feature extract ability in a specific data distribution.In particular, in the first training stage, three different levels of data augmentation were employed to specialize the original data distribution into three distinct data distributions.Additionally, mixup was used to blend the original smaller feature distribution space into a larger feature space, thereby enhancing the model's feature extraction ability.At the same time, the model also includes a two-stage training loss strategy, which can promote the classifier to learn a more reliable decision boundary.Under the guidance of the two-stage learning method, our proposed model demonstrated excellent performance on existing imbalanced datasets.
In summary, our main contributions are as follows: (1) For two-stage learning methods, we indicate that the combination of existing reweighting methods and logit adjustment will lead to performance degradation for the majority class or cause significant calibration errors.
(2) We propose a new reweighting method that can repair the weight of the majority of hard samples and minority samples calculated by the sample confidence-based reweighting method without significantly reducing the majority accuracy.
(3) We propose a new ensemble learning framework that provides three deep specialized feature extractors for three different levels of data augmentation, which can significantly improve the representation ability of the network.Under the guidance of our proposed two-stage training loss strategy, it can significantly increase classification accuracy and reduce expected calibration error.

Related Work
.

Reweighting
The reweighting method assigns weights to each class or sample to alleviate the model performance degradation caused by imbalanced data.A weighting function that maps the loss function (or gradient) to each sample can be used to determine the weight.Through artificial prior knowledge or a simple neural network, the weighting function could be easily estimated.
Initially, Huang et al. (2016Huang et al. ( , 2019) ) used the reciprocal of class frequency as a weighting factor applied to class loss (Wang et al., 2017).Subsequently, Lin et al. (2017) extended the class frequency

FIGURE
In the process of reweighting based on probability, some hard samples from the majority class will have similar weights to the samples from the minority class.As shown in the curve on the right of the above figure, our method can e ectively focus on the hard sample of each class, and the loss of our proposed method rapidly decays in the low probability areas.
from a fixed prior to an adjustable parameter version.Khan et al. (2019) further extended the weighting method from the class level to the instance level (Cui et al., 2019).Although this approach is effective, the complex parameter adjustment rules are tedious and not universal.In addition, hard samples from the majority class are frequently weighted improperly because they share a lot of similarities with minority class samples in terms of loss values.To solve this problem, Ren et al. (2018) and Shu et al. (2019) proposed a robust weighted function mapping from samples to instance losses based on the meta-learner.However, it is difficult to estimate the parameters of the weighting network in the meta-learning method.The meta-learning method requires nested training, which costs a lot of time.Also, meta learners need a meta dataset that is close to the distribution of test dataset (Finn et al., 2017;Shu et al., 2019;Jamal et al., 2020;Li et al., 2021).Zhang and Pfister (2021) adjusted the process of meta-learning, which greatly reduced the training cost of meta-learning and alleviated the excessive dependence on metadata distribution.Although meta-learning is currently the best reweighting method for specific datasets, its demanding prerequisites and high training cost precluded us from using it to search for a weighting function.

. Logit adjustment
The idea of logit adjustment was expressed earlier as margin loss.The essence of margin loss is to apply margin to logits of a specific class to obtain a greater classification interval (Liu et al., 2016(Liu et al., , 2017;;Wang et al., 2018).To address the imbalance image classification task, LDAM (Cao et al., 2019), EQL (Tan et al., 2020), and BALMS (Ren et al., 2020) suggest that minority classes need a large margin while majority classes need a small margin, and the margin is determined by an optimal trade-off boundary (Cao et al., 2019) or by using a meta learner (Jamal et al., 2020;Ren et al., 2020).Menon et al. (2021) summarizes the previous marginbased method and proposes the concept of logit adjustment.To find a suitable logit adjustment method more effectively and quickly, adding label distribution as prior information to the logit has become a stable improvement method (Hong et al., 2021;Menon et al., 2021;Xu et al., 2021;Aimar et al., 2022).

. Two-stage learning
The two-stage training method usually defers the use of the rebalancing strategy, such as reweighting or resampling, to the second stage (Hong et al., 2021).By using a smaller learning rate, the classifier of the model can obtain a better decision boundary on the feature extracted by the feature extractor.Although the two-stage learning method can achieve decoupling training and improve the generalization performance of the model, combining two conflicting rebalancing strategies will lead to a decrease in model performance (Zhou et al., 2020).Therefore, it is important to carefully select and evaluate different rebalancing strategies to ensure that they are compatible with each other and can lead to improved overall performance.In this study, we found that the combination of logit adjustment and the existing reweighting method causes conflicts, making it difficult for the model to converge to the optimal solution.Based on the above findings, we propose a new reweighting method to address this issue.

Analysis
For a multi-class classification task, we assume a dataset with N samples, in which X = {x 1 , x 2 , ..., x N } denotes the samples and Y = {y 1 , y 2 , ..., y N } denotes the labels.The dataset can be defined as D = {(x i , y i ), 1 ≤ i ≤ N}, where x i denotes the ith sample and y i ∈ {0, 1} c is a c dimension vector.Our goal is to train a network that can minimize the misclassification error, i.e., min N i=1 P(y i = arg max p y i (x i )), where p y i (x i ) represents the probability of x i belonging to class y i .In general, we use the softmax cross-entropy (CE) to represent this error, where f y i (x i ) and f y j (x i ) represent the output logit of x j belonging to classes y i and y j .For the class imbalance problem, the direct use of the CE loss function may lead to the bias toward majority classes during the training process and neglect the learning of minority classes, resulting in some minority class samples being mistakenly classified as the majority classes during the testing phase.To address We used the mixup α = 0.4 on the CIFAR-100-LT dataset (ρ = 100).
this issue, most reweighting methods usually apply a learnable or pre-designed weighting factor w to modulate the CE loss function, which can improve the contribution of minority classes to the average loss and make network learning more focused on minority classes.The reweighting loss function can be expressed by the following equation, However, it is challenging to derive an explicit reweighting function without prior knowledge.In most reweighting methods, the weighting factor is naturally defined as a small weight for the majority class and a large weight for the minority class.Although this logical viewpoint is empirically correct, it does not consider the imbalanced distribution within the class; the samples of the same class can also be divided into the common sample and rare sample.

. Compensation training classifier
From the perspective of data distribution, we can rapidly identify why the model trained from the training dataset often performs poorly in the test phase in imbalance image classification tasks.The training and test objectives can be expressed by the following probability, where s represents the source domain (training dataset) and t represents the target domain (test dataset).According to Equations ( 3) and (4), we can further express it as a form of measuring the difference between the training and testing object (Jamal et al., 2020), Covariate shift is a common issue in deep learning tasks that refers to the situation in which the input data or feature distribution differs between the training dataset and test dataset, leading to a poor generalization performance of the trained model on the test dataset.The network will inevitably suffer from this damage during training.For the imbalance image classification task, this damage will become more serious (Jamal et al., 2020).Prior shift refers to a common problem that arises when there is some difference in the label distribution between the training and test datasets.Specifically, it is caused by the difference in the distribution of the number of samples per class between the training and test datasets (Menon et al., 2021).This makes the algorithm learn a biased representation, resulting in decreased performance when applied in the test phase.Owing to the difficulty in estimating covariate shift, we will discuss strategies for mitigating this problem in Section 4.2, but temporarily ignore its impact here.In previous training processes, the softmax classifier was typically used for both training and testing.However, as indicated by Equation ( 6), two shifts between the training and test objectives exist.To address these problems, we can adjust the training loss as follows: where µ i = P train (y i ) P test (y i ) , µ is a factor to measure the label distribution difference between the training and test datasets.Furthermore, Equation ( 7) can be expressed as follows: If y i represents the majority classes and µ j < µ i , the loss value calculated based on Equation (8) will decrease compared with CE.This will make the network tend to learn from minority classes during parameter updates, reducing the attention to majority classes, thereby improving the performance of the network.For convenience, we will use logit adjustment (LA) to represent the above training losses.

. Mixed reweighting and LA
Compensating the output logit can effectively alleviate the learning bias caused by imbalanced data distribution.To further improve the effectiveness of boundary correction, we combine reweighting with LA into a new paradigm and explore effective combination strategies.Specifically, we conduct experiments using ResNet-32 trained on the CIFAR-100-LT dataset with different combinations of reweighting and LA.The reweighting methods, which include reweight (RW) (Wang et al., 2017), class-balanced loss (CB) (Cui et al., 2019), and focal loss (FL) (Lin et al., 2017), were introduced in the 180th epoch (out of a total of 200 epochs) for ResNet-32.
Table 1 presents the results obtained from the aforementioned settings.We can infer that (1) the combination of existing reweighting methods and LA will lead to a decline in overall accuracy, especially in the majority classes.This indicates that there is a conflict between the existing reweighting and LA, and there is an overlap between providing large margins and large weights for the minority classes, which ultimately leads to a significant decline in the performance of the majority classes.(2) Although focal loss can maintain the accuracy of the majority classes to a certain extent, it is expected that the calibration error is still large.This is because focal loss assigns similar weights to the hard samples from majority classes and the samples from minority classes.

Method
. Sample logit-aware reweighting The purpose of the two-stage training method is to focus on obtaining a powerful feature extractor and classifier in the first stage and reduce the difference between the sample confidence and the overall class confidence in the second stage.From the perspective of sample confidence, assigning higher weights to samples with low confidence is an effective solution.However, when it comes to hard samples in the majority classes, their confidence levels are often indistinguishable from the samples in the minority classes.To overcome this issue, we propose a sample logit-aware reweighting method (called SLA in this study) that reduces the gap between the single sample confidence and the overall class average confidence, without significantly sacrificing accuracy.The sample confidence can be calculated as follows: where p i represents the predicted probability that sample x i belongs to the correct label after adjusting for the output logit.In addition, based on the idea of SLA, to make the weighting factor w i pay more attention to hard samples based on the probability reweighting method, we use the sample with the maximum logit of each class to guide the learning of the remaining samples.The sample weight can be expressed as follows: where x * is the sample with the largest logit in all training samples belonging to y i , and γ is a weighted rate adjustment factor.Commonly, f y i (x i ) = W y i z i , W is the weight matrix of the linear layer and z i is the feature embedding of x i .To obtain more stable sample weights, we calculate the cosine value by standardizing W y i and z y i .
Therefore, after transforming the logit into the corresponding cosine representation (Figure 2), the final SLA reweighting formula can be expressed as follows: where θ y i corresponds to the angle between z i and W y i , θ y * corresponds to the z * and W y i , and τ is a hyperparameter. .

Multi-domain expert specialization model
The main objective of the first stage of training in the twostage method is to enhance the feature extraction capability of the network.However, it is challenging for a single-channel feature extractor to learn robust parameters when the data distribution is extremely imbalanced, particularly when complex data augmentation techniques are applied.To address this problem, we propose a multi-domain expert specialization model for augmenting data across multiple domains (Algorithm 1).Getting {( x i , y i )} using Equation ( 14) 5:

. . Multiple data augment header with mixup
Getting each expert loss using Equation ( 16) 6: Calculate total loss using Equation ( 18) Create a list S to store cos θ y * of each class.Update the list S t → S t+1 . 13: end if 14: Use SGD to update network parameters θ ; 15: end while Algorithm .The training process of our proposed method.
Frontiers in Computational Neuroscience frontiersin.org

FIGURE
An overview of our proposed two-stage learning method is as follows: in the first stage, we employ LA to train a robust feature extractor by learning feature representations of each class on a larger feature space that is guided by the mixup technique.In the second stage, we introduce the SLA reweighting method and remove augmentation and mixup to optimize the decision boundary of the classifier.
(Figure 3), thereby alleviating the severe covariate shift caused by the imbalanced distribution.
where T k represents the result of applying the k-th data augmentation function (Aug) to input x i .To make better use of data augmentation, we apply the mixup strategy based on the augmented data during the first stage of training.By using mixup, the resulting data can be represented as if it was sampled from a new sampling space: D l = {( x i , y i )}, 1 ≤ i ≤ N ′ .After combining two augmented samples using mixup, the newly generated sample { x, y} can be expressed as follows: where ǫ ∼ Beta(α, α) α ∈ (0, 1), which allows for flexible adjustment of the mixing ratio during training.By introducing this sampling procedure, the model can be trained on a new sample space that comprises mixtures of the original augmented inputs, allowing it to learn more robust representations and improve its ability to generalize to new samples.

. . Early shared and deep special feature extractor
During the feature extraction process in the early layers of CNN, the network tends to learn low-level features such as points and lines.As a result, we opt for utilizing the same early shared feature extractor for different enhanced data during the first stage.However, during deep feature extraction, the varying enhancement of three levels of data augmentation requires specialized deep feature extractors to extract professional features.To achieve this goal, we employ three distinct deep feature extractors, with their outputs expressed as where k ∈ [1, 3], f k (x i ) represents the output logit after x i passes through the early shared feature extractor ϕ θ and k-th deep special feature extractor ψ θ k .

. . Two-stage training loss strategy
As analyzed in Section 3.1, the two-stage training method requires training a better feature extractor in the first stage.Therefore, we only compensate the classifier and do not use any reweighting method during the first stage of training.Hence, the model should use a reweighting method in the following training process to optimize the decision boundary of the classifier to reduce ECE.
Equation ( 16) represents the loss function L k for the k-th expert, and w k i can be expressed in the following form: Here, γ and τ are hyperparameters, p k i is the predicted probability of the k-th expert of the sample x i belonging to its true class after compensating the out logit, and f k y i (x i ) is the output from the k-th expert belongs to y i class from the k-th expert.Thus, the final loss function can be expressed as the weighted sum of losses obtained by three experts.We use ǫ k to indicate the degree of attention given to the k experts; increasing ǫ k can make the model more inclined to learn from expert k-th.To make the results of other ensemble learning methods more comparable and ensure the fairness of the comparison, we set ǫ k to 1 in all the experiments conducted in this study.The final expression for the total loss function is represented by Equation ( 18). .

. Test time prediction
Considering we used a loss function in the training stage that was the weighted sum of individual losses from multiple experts, we employ the weighted average logit output of three experts during the test process as our final prediction to minimize empirical risk.The probability that x i belongs to a certain class can be calculated using the following formula: Experiments .

Datasets . . CIFAR--LT and CIFAR--LT
The CIFAR-10 and CIFAR-100 datasets are common image classification datasets that contain 50,000 training images and 10,000 test images with 10 or 100 classes (Krizhevsky et al., 2009).Following Cao et al. (2019), we create the long-tailed distribution version by randomly removing training samples and keeping the distribution of the test dataset balanced.We use the imbalance ratio ρ to represent the imbalance degree of the dataset, where ρ = N max /N min , N max (N min ) is the number of the most (least) frequent class.In this study, we used the imbalance ratio of 10, 50, 100, and 200 to carry out experiments.

. . ImageNet-LT
ImageNet (Russakovsky et al., 2015) is a large-scale dataset for object classification.Based on this, Liu et al. (2019) made ImageNet-LT by sampling a subset following the Pareto distribution with power value α = 0.6 from ImageNet, which contains ∼115.8Kimages with 1,000 classes.This choice is crucial because it controls the proportion of frequent and infrequent categories in the longtailed distribution.In addition, the Pareto distribution has a characteristic long tail, which is desirable as it can generate more extreme long-tail datasets that are closer to real-world scenarios.The number of samples for the most frequent class is 1,280 images, whereas the number of samples for the least frequent class is only five images, i.e., the imbalance ratio ρ = 256.

. Evaluation protocol . . Expected calibration error
The purpose of model calibration is to ensure that the predictive confidence of the model for one sample is consistent with the true empirical risk probability.Therefore, we use the expected calibration error (ECE) to measure the calibration degree of the network.To compute ECE, we group all N predictions into B interval bins of equal size.The ECE can be defined as: where T b is the set of samples with a network prediction belonging to Bin-b, acc(•) is the accuracy of T b , and conf (•) is the predicted confidence of T b . .

Implementation details
For CIFAR-10-LT and CIFAR-100-LT datasets, we used ResNet-32 as the benchmark network.We used three different levels of data augmentation; the specific details are shown in Appendix.Following most practices, we set the batch size as 128 and the weight decay as 5e-4.We used the SGD optimizer, and the initial learning rate was 0.1.For all experiments on the main result, the hyperparameter α was set to 0.2, and τ was set to 1.For a fair comparison, we trained 200 and 400 epochs, respectively, based on the above settings.During the training of 200 epochs, the learning rate was decreased by a factor 10 at epochs 160 and 180.During the training of 400 epochs, the learning rate was decreased by a factor 10 at epochs 320 and 360.The 1/2 stage switching time was set to epochs 160 and 320.
For ImageNet-LT, we adopted ResNet-50 and ResNetx-50 as the benchmark networks.As with CIFAR-LT, three different levels of data augmentation were employed.The batch size was set to 128 for ResNet-50 and 64 for ResNetx-50 with the weight decay as 5e-4.We used the SGD optimizer, and the initial learning rate was set at 0.025.We used a cosine annealing learning rate schedule.For all experiments on the main result, the parameter α was set to 0.1, and τ was set to 1.During the training of 180 epochs, the learning rate changed periodically according to the law of the cosine annealing learning rate schedule.The 1/2 stage switching time was set to epoch 160.
. Main results

. . Result for CIFAR-LT
Table 2 presents a comparison of the results obtained by our proposed method and other various methods on CIFAR-LT.All experiments trained for 200 epochs.First, we observed that our method outperformed existing methods across all class imbalance ratios.Specifically, our proposed method achieved improvements of 4.7, 4.3, 3.2, and 1.4% on CIFAR-10-LT, and 3.9, 4.2, 4.2, and 4.1% on CIFAR-100-LT for imbalance ratios of 200, 100, 50, and 10, respectively, when compared with the state-of-the-art method.Second, it is worth noting that our method maintained a significant performance gap compared with other methods regardless of the class imbalance ratio, which demonstrates the effectiveness of our method.Furthermore, we observed that, compared with existing multi-expert methods, the accuracy gap between our proposed method and theirs gradually decreased with a decrease in the imbalance ratio.This phenomenon can be explained by the fact that when the imbalance ratio is small, data from the minority classes All experiments used ResNet-32 as the backbone and trained for 200 epochs.

FIGURE
Test accuracy (%) and ECE (%) of di erent methods trained for epochs on CIFAR--LT (ρ = ), including the contrastive learning method PaCo and the ensemble learning methods RIDE, SADE, and ours.
already cover a large data distribution space in the training dataset, thus weakening the effect of data augmentation on alleviating covariate shift caused by an imbalanced distribution.At the same time, we compared the SLA of different methods and the results showed that our proposed method achieved lowest SLA in addition to achieving considerable accuracy (Figure 4).
At the same time, we performed long-term training for 400 epochs on CIFAR-100-LT (ρ = 100), and the corresponding results are presented in Table 3.Compared with those in Table 2, our proposed method demonstrated continued improvement in accuracy beyond 200 epochs.This is attributed to the inclusion of multiple data augmentation headers in our network architecture, which significantly enhances the representation ability of the network's feature extractor and mitigates the representation difficulties introduced by covariate shift, leading to enhanced overall accuracy.More importantly, the performance of our proposed method in the few classes is far better than that of other methods.This is because we have assigned a specialized feature extractor for each level of data augmentation, which can prevent the representation coupling caused by different levels of data augmentation.

. . Result for ImageNet-LT
Tables 4, 5 present the comparison results between our proposed method and existing methods on the long-tailed dataset ImageNet-LT.Compared with the multi-expert model RIDE (Wang et al., 2021) and SADE (Zhang et al., 2022), our method introduces a multiple data augmentation header with mixup based on the deep specialized feature extractor, leading to an improved performance on minority classes by effectively maintaining the  model's strong representation ability from the first stage to the second stage via our proposed two-stage adjustment strategy.In contrast to other methods based on contrastive learning, such as PaCo (Cui et al., 2021) and BCL (Zhu et al., 2022), we all use various data augmentation methods.However, our proposed multi-channel deep feature extraction strategy can learn the optimal representation of different degrees of data augmentation to maximize their effectiveness.This is the main difference between our approach and others.By exploiting the different levels of data augmentation, we achieve better performance.
To further verify the effectiveness of our proposed reweighting method, we report the test accuracy (%) and ECE (%) on the combination of LA and different reweighting methods on ImageNet-LT using ResNet-50.All experiments used the same model structure and experimental settings as the multi-domain expert specialization model we proposed.Table 6 presents the results of our experiments, which demonstrate that our reweighting method outperformed other reweighting techniques in the minority classes, while only slightly compromising performance in the majority classes.The results suggest that appropriate reweighting methods can alleviate the overfitting of model parameters to most classes caused by the long-tailed distribution.On the other hand, inappropriate reweighting methods will lead to biased models or significant performance decreases in the majority classes. .

Feature distribution
To gain further insights into the effectiveness of our proposed method, we visualized the extracted features using t-SNE.As depicted in Figure 5, feature-1 and feature-2 correspond to the features obtained after dimensionality reduction.We observed that strong data augmentation could enhance feature separability but at the expense of increasing intraclass distance.By leveraging the domain expertise of three different experts and averaging their augmented features, we were able to obtain distinctive features that preserve intraclass similarity while improving interclass discrimination.This allowed us to achieve a clear decision boundary between different classes, even when using a simple linear classifier.

. Ablation study . . The e ect of di erent mixup parameters α
To study the influence of the change of mixup parameters (α) on our proposed method, we conducted a thorough ablation experiment on CIFAR-LT with ρ = 100 to find out the optimal parameter range.Figure 6 shows the result.We observed that (1) when α is > 0.4, the accuracy of the tail class fluctuates greatly; this phenomenon is obvious when the number of classes is small.The main reason for this is that with the increase in α, the value of u tends to be uniformly distributed due to the drastic change in mixing degree between different epochs and the lack of tail class  .

. The e ect of the di erent hyperparameter τ
As reported in Table 7, we explored how hyperparameter τ influences the model.We can easily find that when the imbalance factor is fixed, the accuracy and ECE will decrease as the τ increases.The main reason for this phenomenon is that increasing τ enhances We chose several values from 1 to 0.5 for 1/τ to perform our ablation experiment.We conducted a thorough ablation experiment.MU, using mixup in the first stage of learning; SLA, using SLA in the second stage of learning; TL, using the two-stage learning method.
the effect of SLA, which changes the decision boundary while reducing intraclass spacing.As the decision boundary no longer tends to reduce overall empirical risk, this will reduce some of the model's performance. .

. The e ect of di erent modules
Table 8 present the results of our ablation investigation into the use of mixup in the first stage (MU), reweighting in secondstage learning (SLA), and two-stage learning (TL).As expected, we observed a decrease in accuracy and an increase in ECE for all datasets as the imbalance ratio increased.Combining MU or SLA modules with TLs consistently led to improved accuracy and reduced ECE.Notably, our proposed SLA method demonstrated a more positive impact on TL than MU under multidata augmentations, thereby proving its effectiveness.Additionally, when all three modules were combined, our proposed algorithm maximized the model's generation ability while maintaining low ECE, despite not being optimal.

Conclusion
In this study, we addressed the problem of poor model performance due to prior shift and covariate shift caused by imbalanced distribution.To investigate the impact of logit adjustment and reweighting on model performance, we employed the two-stage learning method, which is currently a popular research direction.Our analysis revealed that combining existing reweighting methods and logit adjustment not only reduces model performance but also increases ECE.Therefore, we proposed a sample logit-aware reweighting method that assigns more suitable weights to hard samples from majority classes and samples from minority classes.Additionally, to tackle the covariate shift problem, we introduced a multi-domain expert specialization model designed to enhance the feature extraction ability of the model.Through experiments conducted on various datasets, we demonstrated the effectiveness of our proposed method.Furthermore, ablation experiments reinforced our findings and emphasized that our proposed model outperforms current state-of-the-art methods.Overall, our study highlights the necessity of addressing prior and covariate shift in imbalanced datasets and provides an effective solution to improve model performance.

FIGURE(
FIGURE (A) The hard samples from the majority class and some samples from the minority class are very close between the decision boundary.(B) SLA can allow these samples to converge toward their class center, thereby improving accuracy and reducing ECE.
Before inputting data into the network, multiple data augmentation techniques (including mixup) should be applied to the data.The purpose of data augmentation is to expand the domain boundary of the source domain data as much as possible Input: Training data D = {(x i , y i )} N i=1 , batch size n.Output: Optimized network parameters θ. 1: Initialization for θ 2: while t <= MaxEpoch do 3: {(x i , y i )} n i=1 ← Sample a minibatch from D.

FIGURE
FIGURE Feature distribution of the test set on CIFAR--LT (IR = ).We demonstrate the distribution feature maps of t-SNE for some majority and minority classes.(A-C) T-SNE on three experts performing di erent data augmentations.(D) T-SNE on the mean of three experts.

TABLE Top -
accuracy (%) and ECE (%) from the di erent combinations of logit adjustment and di erent reweighting methods.
TABLE Test accuracy (%) on CIFAR--LT for various methods with di erent imbalance ratios ρ.
All experiments used ResNet-32 as the backbone and trained for 400 epochs.
TABLE Test accuracy (%) on ImageNet-LT on ResNet-and ResNetx-for various methods.
TABLE Test accuracy (%) on ImageNet-LT on ResNetx-for various methods.

TABLE Test
TABLE Ablation study of di erent imbalance ratios and τ .
TABLE Ablation study of various combinations of the module to verify the e ectiveness of di erent modules.