We are delighted to introduce Hard-Bench, a benchmark that carefully curates training data from original datasets, which provides a greater challenge than simply random sampling from typical datasets. Our benchmark draws on data from a variety of areas, including NLP and CV, such as the eight tasks from the GLUE benchmark, CIFAR-10, CIFAR-100, and the ImageNet collection.
In Hard-Bench (GradNorm),
we create a k-shot dataset by
selecting the k data points with the highest gradient norms
for each label.
Specifically, for each data point,
we compute the gradient vector of the predictor being trained
and calculate its Euclidean norm.
We then sort the data points by their gradient norms within each label
and select the top k data points to form the k-shot dataset.
This method allows us to construct a challenging dataset that
provides a rigorous evaluation of model performance.
In Hard-Bench (Loss),
we construct a new k-shot dataset by
selecting the k data points of
each label with the highest loss value.
Specifically,
for each data point in the original dataset,
we calculate the loss value of the predictor being trained.
We then sort the data points by their loss values
within each label and select the top k data
points to form the k-shot dataset.
This approach enables us to construct a
challenging dataset that provides a more comprehensive
evaluation of model performance.
The content on this website was proofread by ChatGPT.
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | RoBERTa | 57.33 | 51.01 | 66.10 | 38.42 | 48.61 | 82.55 | 56.69 | 60.36 | 54.93 |
2 | Transformer | 55.98 | 51.88 | 69.15 | 35.11 | 50.59 | 68.38 | 62.41 | 54.01 | 56.34 |
3 | GPT-2 | 52.90 | 51.44 | 51.93 | 35.98 | 48.62 | 65.98 | 55.40 | 57.76 | 56.06 |
4 | T5 | 50.69 | 52.34 | 55.09 | 34.27 | 48.99 | 55.88 | 55.72 | 48.88 | 54.37 |
5 | BERT | 47.88 | 47.94 | 45.77 | 33.96 | 46.24 | 56.08 | 52.60 | 51.12 | 49.30 |
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | Transformer | 53.70 | 51.38 | 69.11 | 34.98 | 50.57 | 65.64 | 48.17 | 53.43 | 56.34 |
2 | GPT-2 | 48.69 | 49.79 | 56.18 | 31.41 | 51.01 | 50.54 | 40.33 | 54.73 | 55.49 |
3 | T5 | 48.64 | 49.86 | 55.32 | 32.76 | 47.15 | 53.19 | 48.84 | 48.45 | 53.52 |
4 | RoBERTa | 44.13 | 50.55 | 48.32 | 31.66 | 41.79 | 38.14 | 31.74 | 55.09 | 55.77 |
5 | BERT | 41.50 | 45.64 | 40.92 | 30.55 | 40.11 | 38.24 | 35.55 | 47.44 | 53.52 |
Rank | Models | Average | CIFAR-10 | CIFAR-100 | ImageNet |
---|---|---|---|---|---|
1 | ViT-B/16 | 59.92 | 97.39 | 82.36 | - |
2 | EfficientNetV2-S | 54.02 | 92.51 | 69.56 | - |
3 | DenseNet-121 | 36.60 | 59.87 | 20.96 | 28.96 |
4 | ResNet-18 | 28.73 | 46.87 | 15.50 | 23.81 |
5 | VGG-16 | 27.28 | 55.11 | 17.22 | 9.51 |
6 | FFN | 13.84 | 29.64 | 8.75 | 3.13 |
Rank | Models | Average | CIFAR-10 | CIFAR-100 | ImageNet |
---|---|---|---|---|---|
1 | ViT-B/16 | 59.24 | 96.85 | 80.87 | - |
2 | EfficientNetV2-S | 50.10 | 89.88 | 60.42 | - |
3 | DenseNet-12 | 26.13 | 44.81 | 11.59 | 22.00 |
4 | ResNet-18 | 17.83 | 33.20 | 6.96 | 13.34 |
5 | VGG-16 | 14.00 | 27.58 | 7.14 | 7.27 |
6 | FFN | 7.70 | 17.26 | 3.18 | 2.66 |
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | Transformer | 56.48 | 52.41 | 69.17 | 35.25 | 50.68 | 68.43 | 63.45 | 56.10 | 56.34 |
2 | GPT-2 | 55.19 | 53.46 | 65.77 | 34.77 | 53.34 | 65.34 | 55.04 | 57.76 | 56.06 |
3 | RoBERTa | 54.24 | 51.93 | 62.84 | 36.87 | 50.08 | 60.34 | 63.02 | 53.94 | 54.93 |
4 | BERT | 52.26 | 51.38 | 61.25 | 34.74 | 49.98 | 58.77 | 60.11 | 52.56 | 49.30 |
5 | T5 | 51.86 | 52.25 | 57.49 | 34.27 | 49.99 | 57.79 | 57.54 | 51.19 | 54.37 |
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | Transformer | 55.35 | 51.74 | 68.78 | 35.11 | 50.54 | 66.67 | 60.54 | 53.07 | 56.34 |
2 | GPT-2 | 52.95 | 51.22 | 64.60 | 33.86 | 51.80 | 64.07 | 51.27 | 51.26 | 55.49 |
3 | T5 | 50.20 | 51.01 | 56.72 | 32.98 | 48.91 | 56.52 | 53.35 | 48.59 | 53.52 |
4 | RoBERTa | 48.81 | 50.57 | 59.64 | 33.64 | 50.23 | 41.47 | 48.10 | 51.05 | 55.77 |
5 | BERT | 47.89 | 49.33 | 57.09 | 33.13 | 47.26 | 48.24 | 47.14 | 47.44 | 53.52 |
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | GPT-2 | 56.65 | 57.73 | 66.06 | 34.75 | 53.70 | 67.11 | 56.03 | 60.36 | 57.46 |
2 | Transformer | 56.48 | 52.50 | 69.13 | 35.45 | 51.36 | 68.43 | 63.26 | 55.38 | 56.34 |
3 | RoBERTa | 54.98 | 57.36 | 66.27 | 36.55 | 50.67 | 55.34 | 61.82 | 54.95 | 56.90 |
4 | BERT | 54.08 | 54.70 | 62.05 | 34.45 | 50.85 | 61.72 | 61.12 | 52.85 | 54.93 |
5 | T5 | 52.57 | 54.06 | 58.39 | 33.87 | 50.74 | 58.04 | 56.21 | 52.06 | 57.18 |
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | Transformer | 56.58 | 52.27 | 69.15 | 35.42 | 50.78 | 68.48 | 63.18 | 55.02 | 58.31 |
2 | GPT-2 | 54.10 | 51.72 | 66.04 | 34.00 | 52.86 | 63.97 | 55.00 | 52.85 | 56.34 |
3 | BERT | 51.06 | 50.28 | 60.44 | 33.41 | 50.08 | 54.80 | 54.83 | 49.46 | 55.21 |
4 | T5 | 50.53 | 51.06 | 57.81 | 33.13 | 49.50 | 55.78 | 54.53 | 49.17 | 53.24 |
5 | RoBERTa | 50.19 | 50.09 | 59.85 | 33.81 | 49.61 | 35.39 | 62.81 | 52.78 | 57.18 |
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | Transformer | 56.40 | 52.89 | 68.74 | 35.45 | 50.95 | 68.38 | 63.23 | 54.95 | 56.62 |
2 | GPT-2 | 55.39 | 52.52 | 66.19 | 34.40 | 53.65 | 66.23 | 54.64 | 58.84 | 56.62 |
3 | RoBERTa | 55.06 | 58.12 | 65.23 | 35.48 | 50.80 | 57.60 | 63.18 | 53.14 | 56.90 |
4 | BERT | 54.63 | 56.86 | 64.99 | 34.21 | 50.79 | 63.19 | 59.19 | 53.43 | 54.37 |
5 | T5 | 51.86 | 51.95 | 59.54 | 33.44 | 50.85 | 59.90 | 56.61 | 49.03 | 53.52 |
Rank | Models | Average | SST2 | COLA | MNLI | QNLI | MRPC | QQP | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|---|
1 | Transformer | 56.44 | 52.34 | 68.88 | 35.28 | 51.22 | 68.33 | 63.19 | 54.80 | 57.46 |
2 | GPT-2 | 54.35 | 51.54 | 66.56 | 33.88 | 52.49 | 63.87 | 56.84 | 52.71 | 56.90 |
3 | BERT | 51.43 | 50.28 | 58.16 | 34.03 | 50.10 | 54.61 | 57.27 | 50.40 | 56.62 |
4 | T5 | 50.59 | 51.19 | 56.80 | 33.39 | 49.69 | 56.32 | 56.14 | 49.10 | 52.11 |
5 | RoBERTa | 48.52 | 50.25 | 49.38 | 33.38 | 49.78 | 33.33 | 63.10 | 52.56 | 56.34 |