Tokenization and the Noiseless Channel

Zouhar, Vilém; Meister, Clara Isabel; Gastaldi, Juan Luis; Du, Li; Sachan, Mrinmaya; Cotterell, Ryan

doi:10.18653/v1/2023.acl-long.284

Download

Full text (published version) (PDF, 699.7Kb)

Open access

Author

Zouhar, Vilém

Meister, Clara Isabel

Du, Li

Date

2023-07

Type

Conference Paper

ETH Bibliography

yes

Altmetrics

Download

Full text (published version) (PDF, 699.7Kb)

Rights / license

Creative Commons Attribution 4.0 International

Abstract

Subword tokenization is a key part of most NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by w Show more

Permanent link

https://doi.org/10.3929/ethz-b-000643307

Publication status

published

External links

https://doi.org/10.18653/v1/2023.acl-long.284

Editor

Rogers, Anna

Boyd-Graber, Jordan

Okazaki, Naoaki

Book title

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers