Hard-Bench

About

We are delighted to introduce Hard-Bench, a benchmark that carefully curates training data from original datasets, which provides a greater challenge than simply random sampling from typical datasets. Our benchmark draws on data from a variety of areas, including NLP and CV, such as the eight tasks from the GLUE benchmark, CIFAR-10, CIFAR-100, and the ImageNet collection.

Hard-Bench (GradNorm)

In Hard-Bench (GradNorm), we create a k-shot dataset by selecting the k data points with the highest gradient norms for each label.
Specifically, for each data point, we compute the gradient vector of the predictor being trained and calculate its Euclidean norm. We then sort the data points by their gradient norms within each label and select the top k data points to form the k-shot dataset. This method allows us to construct a challenging dataset that provides a rigorous evaluation of model performance.

Hard-Bench (Loss)

In Hard-Bench (Loss), we construct a new k-shot dataset by selecting the k data points of each label with the highest loss value.
Specifically, for each data point in the original dataset, we calculate the loss value of the predictor being trained. We then sort the data points by their loss values within each label and select the top k data points to form the k-shot dataset. This approach enables us to construct a challenging dataset that provides a more comprehensive evaluation of model performance.

The content on this website was proofread by ChatGPT.

Available Leaderboards

Hard-Bench (GradNorm) NLP (500-shot) Hard-Bench (Loss) NLP (500-shot) Hard-Bench (GradNorm) CV Hard-Bench (Loss) CV Hard-Bench (GradNorm) NLP (100-shot) Hard-Bench (Loss) NLP (100-shot) Hard-Bench (GradNorm) NLP (32-shot) Hard-Bench (Loss) NLP (32-shot) Hard-Bench (GradNorm) NLP (16-shot) Hard-Bench (Loss) NLP (16-shot)

Leaderboard: NLP, 500-shot, Hard-Bench(GradNorm)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	RoBERTa	57.33	51.01	66.10	38.42	48.61	82.55	56.69	60.36	54.93
2	Transformer	55.98	51.88	69.15	35.11	50.59	68.38	62.41	54.01	56.34
3	GPT-2	52.90	51.44	51.93	35.98	48.62	65.98	55.40	57.76	56.06
4	T5	50.69	52.34	55.09	34.27	48.99	55.88	55.72	48.88	54.37
5	BERT	47.88	47.94	45.77	33.96	46.24	56.08	52.60	51.12	49.30

Leaderboard: GLUE, 500-shot, Hard-Bench(Loss)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	Transformer	53.70	51.38	69.11	34.98	50.57	65.64	48.17	53.43	56.34
2	GPT-2	48.69	49.79	56.18	31.41	51.01	50.54	40.33	54.73	55.49
3	T5	48.64	49.86	55.32	32.76	47.15	53.19	48.84	48.45	53.52
4	RoBERTa	44.13	50.55	48.32	31.66	41.79	38.14	31.74	55.09	55.77
5	BERT	41.50	45.64	40.92	30.55	40.11	38.24	35.55	47.44	53.52

Leaderboard: CV, Hard-Bench (GradNorm)
500-shot for CIFAR-10, 50-shot for CIFAR-100, 100-shot for ImageNet

Rank	Models	Average	CIFAR-10	CIFAR-100	ImageNet
1	ViT-B/16	59.92	97.39	82.36	-
2	EfficientNetV2-S	54.02	92.51	69.56	-
3	DenseNet-121	36.60	59.87	20.96	28.96
4	ResNet-18	28.73	46.87	15.50	23.81
5	VGG-16	27.28	55.11	17.22	9.51
6	FFN	13.84	29.64	8.75	3.13

Leaderboard: CV, Hard-Bench (Loss)
500-shot for CIFAR-10, 50-shot for CIFAR-100, 100-shot for ImageNet

Rank	Models	Average	CIFAR-10	CIFAR-100	ImageNet
1	ViT-B/16	59.24	96.85	80.87	-
2	EfficientNetV2-S	50.10	89.88	60.42	-
3	DenseNet-12	26.13	44.81	11.59	22.00
4	ResNet-18	17.83	33.20	6.96	13.34
5	VGG-16	14.00	27.58	7.14	7.27
6	FFN	7.70	17.26	3.18	2.66

Leaderboard: GLUE, 100-shot, Hard-Bench(GradNorm)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	Transformer	56.48	52.41	69.17	35.25	50.68	68.43	63.45	56.10	56.34
2	GPT-2	55.19	53.46	65.77	34.77	53.34	65.34	55.04	57.76	56.06
3	RoBERTa	54.24	51.93	62.84	36.87	50.08	60.34	63.02	53.94	54.93
4	BERT	52.26	51.38	61.25	34.74	49.98	58.77	60.11	52.56	49.30
5	T5	51.86	52.25	57.49	34.27	49.99	57.79	57.54	51.19	54.37

Leaderboard: GLUE, 100-shot, Hard-Bench(Loss)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	Transformer	55.35	51.74	68.78	35.11	50.54	66.67	60.54	53.07	56.34
2	GPT-2	52.95	51.22	64.60	33.86	51.80	64.07	51.27	51.26	55.49
3	T5	50.20	51.01	56.72	32.98	48.91	56.52	53.35	48.59	53.52
4	RoBERTa	48.81	50.57	59.64	33.64	50.23	41.47	48.10	51.05	55.77
5	BERT	47.89	49.33	57.09	33.13	47.26	48.24	47.14	47.44	53.52

Leaderboard: GLUE, 32-shot, Hard-Bench(GradNorm)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	GPT-2	56.65	57.73	66.06	34.75	53.70	67.11	56.03	60.36	57.46
2	Transformer	56.48	52.50	69.13	35.45	51.36	68.43	63.26	55.38	56.34
3	RoBERTa	54.98	57.36	66.27	36.55	50.67	55.34	61.82	54.95	56.90
4	BERT	54.08	54.70	62.05	34.45	50.85	61.72	61.12	52.85	54.93
5	T5	52.57	54.06	58.39	33.87	50.74	58.04	56.21	52.06	57.18

Leaderboard: GLUE, 32-shot, Hard-Bench(Loss)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	Transformer	56.58	52.27	69.15	35.42	50.78	68.48	63.18	55.02	58.31
2	GPT-2	54.10	51.72	66.04	34.00	52.86	63.97	55.00	52.85	56.34
3	BERT	51.06	50.28	60.44	33.41	50.08	54.80	54.83	49.46	55.21
4	T5	50.53	51.06	57.81	33.13	49.50	55.78	54.53	49.17	53.24
5	RoBERTa	50.19	50.09	59.85	33.81	49.61	35.39	62.81	52.78	57.18

Leaderboard: GLUE, 16-shot, Hard-Bench(GradNorm)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	Transformer	56.40	52.89	68.74	35.45	50.95	68.38	63.23	54.95	56.62
2	GPT-2	55.39	52.52	66.19	34.40	53.65	66.23	54.64	58.84	56.62
3	RoBERTa	55.06	58.12	65.23	35.48	50.80	57.60	63.18	53.14	56.90
4	BERT	54.63	56.86	64.99	34.21	50.79	63.19	59.19	53.43	54.37
5	T5	51.86	51.95	59.54	33.44	50.85	59.90	56.61	49.03	53.52

Leaderboard: GLUE, 16-shot, Hard-Bench(Loss)

Rank	Models	Average	SST2	COLA	MNLI	QNLI	MRPC	QQP	RTE	WNLI
1	Transformer	56.44	52.34	68.88	35.28	51.22	68.33	63.19	54.80	57.46
2	GPT-2	54.35	51.54	66.56	33.88	52.49	63.87	56.84	52.71	56.90
3	BERT	51.43	50.28	58.16	34.03	50.10	54.61	57.27	50.40	56.62
4	T5	50.59	51.19	56.80	33.39	49.69	56.32	56.14	49.10	52.11
5	RoBERTa	48.52	50.25	49.38	33.38	49.78	33.33	63.10	52.56	56.34

Hard-Bench

A Challenging Benchmark for Low-Resource Learning

About

Hard-Bench (GradNorm)

Hard-Bench (Loss)

Available Leaderboards