We are delighted to introduce Hard-Bench, a benchmark that carefully curates training data from original datasets, which provides a greater challenge than simply random sampling from typical datasets. Our benchmark draws on data from a variety of areas, including NLP and CV, such as the eight tasks from the GLUE benchmark, CIFAR-10, CIFAR-100, and the ImageNet collection.
                                In Hard-Bench (GradNorm), 
                                we create a k-shot dataset by 
                                selecting the k data points with the highest gradient norms 
                                for each label. 
                                Specifically, for each data point, 
                                we compute the gradient vector of the predictor being trained 
                                and calculate its Euclidean norm. 
                                We then sort the data points by their gradient norms within each label 
                                and select the top k data points to form the k-shot dataset. 
                                This method allows us to construct a challenging dataset that 
                                provides a rigorous evaluation of model performance.
                            
                                In Hard-Bench (Loss), 
                                we construct a new k-shot dataset by 
                                selecting the k data points of 
                                each label with the highest loss value. 
                                Specifically, 
                                for each data point in the original dataset, 
                                we calculate the loss value of the predictor being trained. 
                                We then sort the data points by their loss values 
                                within each label and select the top k data 
                                points to form the k-shot dataset. 
                                This approach enables us to construct a 
                                challenging dataset that provides a more comprehensive 
                                evaluation of model performance.
                            
The content on this website was proofread by ChatGPT.
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | RoBERTa | 57.33 | 51.01 | 66.10 | 38.42 | 48.61 | 82.55 | 56.69 | 60.36 | 54.93 | 
| 2 | Transformer | 55.98 | 51.88 | 69.15 | 35.11 | 50.59 | 68.38 | 62.41 | 54.01 | 56.34 | 
| 3 | GPT-2 | 52.90 | 51.44 | 51.93 | 35.98 | 48.62 | 65.98 | 55.40 | 57.76 | 56.06 | 
| 4 | T5 | 50.69 | 52.34 | 55.09 | 34.27 | 48.99 | 55.88 | 55.72 | 48.88 | 54.37 | 
| 5 | BERT | 47.88 | 47.94 | 45.77 | 33.96 | 46.24 | 56.08 | 52.60 | 51.12 | 49.30 | 
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Transformer | 53.70 | 51.38 | 69.11 | 34.98 | 50.57 | 65.64 | 48.17 | 53.43 | 56.34 | 
| 2 | GPT-2 | 48.69 | 49.79 | 56.18 | 31.41 | 51.01 | 50.54 | 40.33 | 54.73 | 55.49 | 
| 3 | T5 | 48.64 | 49.86 | 55.32 | 32.76 | 47.15 | 53.19 | 48.84 | 48.45 | 53.52 | 
| 4 | RoBERTa | 44.13 | 50.55 | 48.32 | 31.66 | 41.79 | 38.14 | 31.74 | 55.09 | 55.77 | 
| 5 | BERT | 41.50 | 45.64 | 40.92 | 30.55 | 40.11 | 38.24 | 35.55 | 47.44 | 53.52 | 
| Rank | Models | Average | CIFAR-10 | CIFAR-100 | ImageNet | 
|---|---|---|---|---|---|
| 1 | ViT-B/16 | 59.92 | 97.39 | 82.36 | - | 
| 2 | EfficientNetV2-S | 54.02 | 92.51 | 69.56 | - | 
| 3 | DenseNet-121 | 36.60 | 59.87 | 20.96 | 28.96 | 
| 4 | ResNet-18 | 28.73 | 46.87 | 15.50 | 23.81 | 
| 5 | VGG-16 | 27.28 | 55.11 | 17.22 | 9.51 | 
| 6 | FFN | 13.84 | 29.64 | 8.75 | 3.13 | 
| Rank | Models | Average | CIFAR-10 | CIFAR-100 | ImageNet | 
|---|---|---|---|---|---|
| 1 | ViT-B/16 | 59.24 | 96.85 | 80.87 | - | 
| 2 | EfficientNetV2-S | 50.10 | 89.88 | 60.42 | - | 
| 3 | DenseNet-12 | 26.13 | 44.81 | 11.59 | 22.00 | 
| 4 | ResNet-18 | 17.83 | 33.20 | 6.96 | 13.34 | 
| 5 | VGG-16 | 14.00 | 27.58 | 7.14 | 7.27 | 
| 6 | FFN | 7.70 | 17.26 | 3.18 | 2.66 | 
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Transformer | 56.48 | 52.41 | 69.17 | 35.25 | 50.68 | 68.43 | 63.45 | 56.10 | 56.34 | 
| 2 | GPT-2 | 55.19 | 53.46 | 65.77 | 34.77 | 53.34 | 65.34 | 55.04 | 57.76 | 56.06 | 
| 3 | RoBERTa | 54.24 | 51.93 | 62.84 | 36.87 | 50.08 | 60.34 | 63.02 | 53.94 | 54.93 | 
| 4 | BERT | 52.26 | 51.38 | 61.25 | 34.74 | 49.98 | 58.77 | 60.11 | 52.56 | 49.30 | 
| 5 | T5 | 51.86 | 52.25 | 57.49 | 34.27 | 49.99 | 57.79 | 57.54 | 51.19 | 54.37 | 
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Transformer | 55.35 | 51.74 | 68.78 | 35.11 | 50.54 | 66.67 | 60.54 | 53.07 | 56.34 | 
| 2 | GPT-2 | 52.95 | 51.22 | 64.60 | 33.86 | 51.80 | 64.07 | 51.27 | 51.26 | 55.49 | 
| 3 | T5 | 50.20 | 51.01 | 56.72 | 32.98 | 48.91 | 56.52 | 53.35 | 48.59 | 53.52 | 
| 4 | RoBERTa | 48.81 | 50.57 | 59.64 | 33.64 | 50.23 | 41.47 | 48.10 | 51.05 | 55.77 | 
| 5 | BERT | 47.89 | 49.33 | 57.09 | 33.13 | 47.26 | 48.24 | 47.14 | 47.44 | 53.52 | 
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-2 | 56.65 | 57.73 | 66.06 | 34.75 | 53.70 | 67.11 | 56.03 | 60.36 | 57.46 | 
| 2 | Transformer | 56.48 | 52.50 | 69.13 | 35.45 | 51.36 | 68.43 | 63.26 | 55.38 | 56.34 | 
| 3 | RoBERTa | 54.98 | 57.36 | 66.27 | 36.55 | 50.67 | 55.34 | 61.82 | 54.95 | 56.90 | 
| 4 | BERT | 54.08 | 54.70 | 62.05 | 34.45 | 50.85 | 61.72 | 61.12 | 52.85 | 54.93 | 
| 5 | T5 | 52.57 | 54.06 | 58.39 | 33.87 | 50.74 | 58.04 | 56.21 | 52.06 | 57.18 | 
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Transformer | 56.58 | 52.27 | 69.15 | 35.42 | 50.78 | 68.48 | 63.18 | 55.02 | 58.31 | 
| 2 | GPT-2 | 54.10 | 51.72 | 66.04 | 34.00 | 52.86 | 63.97 | 55.00 | 52.85 | 56.34 | 
| 3 | BERT | 51.06 | 50.28 | 60.44 | 33.41 | 50.08 | 54.80 | 54.83 | 49.46 | 55.21 | 
| 4 | T5 | 50.53 | 51.06 | 57.81 | 33.13 | 49.50 | 55.78 | 54.53 | 49.17 | 53.24 | 
| 5 | RoBERTa | 50.19 | 50.09 | 59.85 | 33.81 | 49.61 | 35.39 | 62.81 | 52.78 | 57.18 | 
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Transformer | 56.40 | 52.89 | 68.74 | 35.45 | 50.95 | 68.38 | 63.23 | 54.95 | 56.62 | 
| 2 | GPT-2 | 55.39 | 52.52 | 66.19 | 34.40 | 53.65 | 66.23 | 54.64 | 58.84 | 56.62 | 
| 3 | RoBERTa | 55.06 | 58.12 | 65.23 | 35.48 | 50.80 | 57.60 | 63.18 | 53.14 | 56.90 | 
| 4 | BERT | 54.63 | 56.86 | 64.99 | 34.21 | 50.79 | 63.19 | 59.19 | 53.43 | 54.37 | 
| 5 | T5 | 51.86 | 51.95 | 59.54 | 33.44 | 50.85 | 59.90 | 56.61 | 49.03 | 53.52 | 
| Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI | 
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Transformer | 56.44 | 52.34 | 68.88 | 35.28 | 51.22 | 68.33 | 63.19 | 54.80 | 57.46 | 
| 2 | GPT-2 | 54.35 | 51.54 | 66.56 | 33.88 | 52.49 | 63.87 | 56.84 | 52.71 | 56.90 | 
| 3 | BERT | 51.43 | 50.28 | 58.16 | 34.03 | 50.10 | 54.61 | 57.27 | 50.40 | 56.62 | 
| 4 | T5 | 50.59 | 51.19 | 56.80 | 33.39 | 49.69 | 56.32 | 56.14 | 49.10 | 52.11 | 
| 5 | RoBERTa | 48.52 | 50.25 | 49.38 | 33.38 | 49.78 | 33.33 | 63.10 | 52.56 | 56.34 |