University of Sussex
Browse

File(s) not publicly available

Co-occurrence retrieval: a general framework for lexical distributional similarity

journal contribution
posted on 2023-06-07, 22:27 authored by Julie WeedsJulie Weeds, David WeirDavid Weir
Techniques that exploit knowledge of distributional similarity between words have been proposed in many areas of Natural Language Processing. For example, in language modeling, the sparse data problem can be alleviated by estimating the probabilities of unseen co-occurrences of events from the probabilities of seen co-occurrences of similar events. In other applications, distributional similarity is taken to be an approximation to semantic similarity. However, due to the wide range of potential applications and the lack of a strict definition of the concept of distributional similarity, many methods of calculating distributional similarity have been proposed or adopted.In this work, a flexible, parameterized framework for calculating distributional similarity is proposed. Within this framework, the problem of finding distributionally similar words is cast as one of co-occurrence retrieval (CR) for which precision and recall can be measured by analogy with the way they are measured in document retrieval. As will be shown, a number of popular existing measures of distributional similarity are simulated with parameter settings within the CR framework. In this article, the CR framework is then used to systematically investigate three fundamental questions concerning distributional similarity. First, is the relationship of lexical similarity necessarily symmetric, or are there advantages to be gained from considering it as an asymmetric relationship? Second, are some co-occurrences inherently more salient than others in the calculation of distributional similarity? Third, is it necessary to consider the difference in the extent to which each word occurs in each co-occurrence type?Two application-based tasks are used for evaluation: automatic thesaurus generation and pseudo-disambiguation. It is possible to achieve significantly better results on both these tasks by varying the parameters within the CR framework rather than using other existing distributional similarity measures; it will also be shown that any single unparameterized measure is unlikely to be able to do better on both tasks. This is due to an inherent asymmetry in lexical substitutability and therefore also in lexical distributional similarity.

History

Publication status

  • Published

Journal

Computational Linguistics

ISSN

0891-2017

Issue

4

Volume

31

Page range

439-475

Pages

37.0

Department affiliated with

  • Informatics Publications

Notes

Originality: This paper examines measures of lexical distributional similarity, presenting a novel framework that both generalises existing measures, as well as providing the basis for understanding how these measures compare with one another. Rigour: The framework is formally specified, and extensive experimental evaluations are performed. Significance: Measuring the distributional similarity of words is emerging as one of the field's hotest topics, and a substantial number of rather different methods have been adopted, and applied to a wide variety of tasks. The challenge that this paper addresses is a lack of understanding as to which measures are most effective for which tasks. This is the first paper to present a clear basis for understanding the nature of the space of possible measures, providing a methodology for establishing measures that are tailored to particular applications and will outperform the competition. Impact: Total citations in Google Scholar for this paper and its preceding conference paper are 23.

Full text available

  • No

Peer reviewed?

  • Yes

Legacy Posted Date

2012-02-06

Usage metrics

    University of Sussex (Publications)

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC