Problem 4¶

Analyze the words of a text.

We consider not only the content of a word as string, but also its position in the text. A word can appear in multiple locations in the text. (We consider inflections as different words: “go” and “goes” are considered two different words.)

We define an order of the words. A word x is smaller than a word y if the difference between the first and last occurrence of x is smaller than the corresponding difference for y.

Provide a function to extract words from a text.

Provide a function top to return the largest n words based on the aforementioned order.

Data:

`WORD_RE`	Express a normalized word of a text.
`TOKEN_RE`	Express a token of a text.

Classes:

`Token`(text)	Represent a word as a token of the text.
`WordOccurrence`(first, last, text)	Represent a word occurence in the text.

Functions:

tokens_to_words(tokens)

ensures

tokenize(text)

Tokenize the text into normalized word tokens ignoring the punctuation.

find_top(word_occurrences, limit)

Find the limit top occurrences in word_occurrences.

WORD_RE = re.compile('^[a-z]+(-[a-z])*$')¶: Express a normalized word of a text.

class Token(text: str)[source]¶

Represent a word as a token of the text.

Methods:

__new__(cls, text)

Enforce the properties on the text of the word.

static __new__(cls, text: str) → Token[source]¶

Enforce the properties on the text of the word.

Requires

WORD_RE.match(text)

class WordOccurrence(first: int, last: int, text: Token)[source]¶

Represent a word occurence in the text.

Methods:

`__init__`(first, last, text)	Initialize with the given values.
`__lt__`(other)	Compare against `other` based on the `first` and `last`.
`__le__`(other)	Compare against `other` based on the `first` and `last`.
`__repr__`()	Represent the word occurrence as string for easier debugging.

Attributes:

`first`	Index of the first occurrence
`last`	Index of the last occurrence
`text`	Text of the word

__init__(first: int, last: int, text: Token) → None[source]¶

Initialize with the given values.

Requires

last >= 0
first >= 0
first <= last

first¶: Index of the first occurrence

last¶: Index of the last occurrence

text¶: Text of the word

__lt__(other: WordOccurrence) → bool[source]¶: Compare against other based on the first and last.

__le__(other: WordOccurrence) → bool[source]¶: Compare against other based on the first and last.

__repr__() → str[source]¶: Represent the word occurrence as string for easier debugging.

tokens_to_words(tokens: List[Token]) → List[WordOccurrence][source]¶

Ensures

len(tokens) > 0 ⇒ len(result) > 0
len(result) <= len(tokens)

all(
    tokens[word_occurrence.first] == word_occurrence.text
    and tokens[word_occurrence.last] == word_occurrence.text
    for word_occurrence in result
)

(
        word_texts := [word_occurrence.text for word_occurrence in result],
        len(word_texts) == len(set(word_texts))
)[1]

(No duplicate word occurrences)

TOKEN_RE = re.compile('[a-zA-Z]+(-[a-zA-Z])*')¶: Express a token of a text.

tokenize(text: str) → List[Token][source]¶

Tokenize the text into normalized word tokens ignoring the punctuation.

Ensures

sum(len(token) for token in result) <= len(text)

find_top(word_occurrences: List[WordOccurrence], limit: int) → List[WordOccurrence][source]¶

Find the limit top occurrences in word_occurrences.

Requires

limit > 0

Ensures

len(result) == min(len(word_occurrences), limit)

all(
    result[i] >= result[i + 1]
    for i in range(len(result) - 1)
)

(
        word_set := set(word_occurrences),
        all(
            word_occurrence in word_set  # pylint: disable=used-before-assignment
            for word_occurrence in result
        )
)[1]