Problem 4

Analyze the words of a text.

We consider not only the content of a word as string, but also its position in the text. A word can appear in multiple locations in the text. (We consider inflections as different words: “go” and “goes” are considered two different words.)

We define an order of the words. A word x is smaller than a word y if the difference between the first and last occurrence of x is smaller than the corresponding difference for y.

Provide a function to extract words from a text.

Provide a function top to return the largest n words based on the aforementioned order.

Data:

WORD_RE

Express a normalized word of a text.

TOKEN_RE

Express a token of a text.

Classes:

Token(text)

Represent a word as a token of the text.

WordOccurrence(first, last, text)

Represent a word occurence in the text.

Functions:

tokens_to_words(tokens)

ensures

tokenize(text)

Tokenize the text into normalized word tokens ignoring the punctuation.

find_top(word_occurrences, limit)

Find the limit top occurrences in word_occurrences.

WORD_RE = re.compile('^[a-z]+(-[a-z])*$')

Express a normalized word of a text.

class Token(text: str)[source]

Represent a word as a token of the text.

Methods:

__new__(cls, text)

Enforce the properties on the text of the word.

static __new__(cls, text: str) Token[source]

Enforce the properties on the text of the word.

Requires
  • WORD_RE.match(text)

class WordOccurrence(first: int, last: int, text: Token)[source]

Represent a word occurence in the text.

Methods:

__init__(first, last, text)

Initialize with the given values.

__lt__(other)

Compare against other based on the first and last.

__le__(other)

Compare against other based on the first and last.

__repr__()

Represent the word occurrence as string for easier debugging.

Attributes:

first

Index of the first occurrence

last

Index of the last occurrence

text

Text of the word

__init__(first: int, last: int, text: Token) None[source]

Initialize with the given values.

Requires
  • last >= 0

  • first >= 0

  • first <= last

first

Index of the first occurrence

last

Index of the last occurrence

text

Text of the word

__lt__(other: WordOccurrence) bool[source]

Compare against other based on the first and last.

__le__(other: WordOccurrence) bool[source]

Compare against other based on the first and last.

__repr__() str[source]

Represent the word occurrence as string for easier debugging.

tokens_to_words(tokens: List[Token]) List[WordOccurrence][source]
Ensures
  • len(tokens) > 0len(result) > 0

  • len(result) <= len(tokens)

  • all(
        tokens[word_occurrence.first] == word_occurrence.text
        and tokens[word_occurrence.last] == word_occurrence.text
        for word_occurrence in result
    )
    
  • (
            word_texts := [word_occurrence.text for word_occurrence in result],
            len(word_texts) == len(set(word_texts))
    )[1]
    

    (No duplicate word occurrences)

TOKEN_RE = re.compile('[a-zA-Z]+(-[a-zA-Z])*')

Express a token of a text.

tokenize(text: str) List[Token][source]

Tokenize the text into normalized word tokens ignoring the punctuation.

Ensures
  • sum(len(token) for token in result) <= len(text)

find_top(word_occurrences: List[WordOccurrence], limit: int) List[WordOccurrence][source]

Find the limit top occurrences in word_occurrences.

Requires
  • limit > 0

Ensures
  • len(result) == min(len(word_occurrences), limit)

  • all(
        result[i] >= result[i + 1]
        for i in range(len(result) - 1)
    )
    
  • (
            word_set := set(word_occurrences),
            all(
                word_occurrence in word_set  # pylint: disable=used-before-assignment
                for word_occurrence in result
            )
    )[1]