4
\$\begingroup\$

This is a Python script that generates ngrams using a set of rules of what letters can follow a letter stored in a dictionary.

The output is then preliminarily processed using another script, then it will be filtered further using an api of sorts by number of words containing the ngrams, the result will be used in pseudoword generation.

This is the generation part:

from string import ascii_lowercase import sys LETTERS = set(ascii_lowercase) VOWELS = set('aeiouy') CONSONANTS = LETTERS - VOWELS BASETAILS = { 'a': CONSONANTS, 'b': 'bjlr', 'c': 'chjklr', 'd': 'dgjw', 'e': CONSONANTS, 'f': 'fjlr', 'g': 'ghjlrw', 'h': '', 'i': CONSONANTS, 'j': '', 'k': 'hklrvw', 'l': 'l', 'm': 'cm', 'n': 'gn', 'o': CONSONANTS, 'p': 'fhlprst', 'q': '', 'r': 'hrw', 's': 'chjklmnpqstw', 't': 'hjrstw', 'u': CONSONANTS, 'v': 'lv', 'w': 'hr', 'x': 'h', 'y': 'sv', 'z': 'hlvw' } tails = dict() for i in ascii_lowercase: v = BASETAILS[i] if type(v) == set: v = ''.join(sorted(v)) tails.update({i: ''.join(sorted('aeiou' + v))}) def makechain(invar, target, depth=0): depth += 1 if type(invar) == str: invar = set(invar) chain = invar.copy() if depth == target: return sorted(chain) else: for i in invar: for j in tails[i[-1]]: chain.add(i + j) return makechain(chain, target, depth) if __name__ == '__main__': invar = sys.argv[1] target = int(sys.argv[2]) if invar in globals(): invar = eval(invar) print(*makechain(invar, target), sep='\n') 

I want to ask about the makechain function, I used sets because somehow the results can contain duplicates if I used lists, though the result can be cast to set, I used a nested for loop and a recursive function to simulate a variable number of for loops.

For example, makechain(LETTERS, 4) is equivalent to:

chain = set() for a in LETTERS: chain.add(a) for a in LETTERS: for b in tails[a]: chain.add(a + b) for a in LETTERS: for b in tails[a]: for c in tails[b]: chain.add(a + b + c) for a in LETTERS: for b in tails[a]: for c in tails[b]: for d in tails[c]: chain.add(a + b + c + d) 

Obviously makechain(LETTERS, 4) is much better than the nested for loop approach, it is much more flexible.

I want to know, is there anyway I can use a function from itertools instead of the nested for loop to generate the same results more efficiently?

I am thinking about itertools.product and itertools.combinations but I just can't figure out how to do it.

Any help will be appreciated.

\$\endgroup\$

    2 Answers 2

    3
    \$\begingroup\$

    Type checking

    Don't do if type(var) == cls, instead do:

    if isinstance(var, cls): 

    tails

    This can be built via a dictionary comprehension rather than iterating over ascii_lowercase and looking up against BASETAILS every time:

    tails = { i: ''.join(sorted(VOWELS + v)) for i, v in BASETAILS.items() } 

    makechain vs loops

    Your n-depth loops are really just a cartesian product of LETTERS and tails, which can be accomplished, as you suspected, with itertools.product:

    from itertools import product def build_chain(letters, tails, depth, chain=None): chain = chain if chain is not None else set() if not depth - 1: return chain.union(letters) num_tails = [letters] for _ in range(depth - 1): chain.update( ''.join(c) for c in product(*num_tails) ) last = its[-1] num_tails.append(''.join(map(tails.get, last))) return chain 

    Where this assumes that every entry in tails.values() is a string and letters is also a string

    \$\endgroup\$
      4
      \$\begingroup\$

      Any help will be appreciated

      A few suggestions on something that I noticed:

      • In the function makechain the else after the return is not necessary.

      • Typo in: VOWELS = set('aeiouy'), there is an extra y.

      • This part:

        LETTERS = set(ascii_lowercase) VOWELS = set('aeiou') CONSONANTS = LETTERS - VOWELS BASETAILS = { 'a': CONSONANTS, 'b': 'bjlr', 'c': 'chjklr', 'd': 'dgjw', .... } tails = dict() for i in ascii_lowercase: v = BASETAILS[i] if type(v) == set: v = ''.join(sorted(v)) tails.update({i: ''.join(sorted('aeiou' + v))}) 

        seems to do the following:

        1. Create a dictionary with mixed value's type (strings and sets)
        2. Convert all values to string
        3. Sort dictionary's values

        It could be simplified to:

        1. Create a dictionary where all values are strings
        2. Sort dictionary's values

        Additionally, having VOWELS as a set and CONSONANTS as a string is a bit confusing. Would be better to use only one type.

        Code with suggestions above:

        LETTERS = ascii_lowercase VOWELS = 'aeiou' CONSONANTS = ''.join(set(LETTERS) - set(VOWELS)) BASETAILS = { 'a': CONSONANTS, 'b': 'bjlr', 'c': 'chjklr', 'd': 'dgjw', .... } tails = dict() for i in ascii_lowercase: v = BASETAILS[i] tails.update({i: ''.join(sorted(VOWELS + v))}) 

        In this way, you also avoid sorting twice.

      \$\endgroup\$
      1
      • \$\begingroup\$Correction, y in 'aeiouy' is not a typo, I know y isn't a full vowel, but it is a half-vowel, it acts as a vowel in many English words, such as many, cyber, psycho, dry... So it can be used as a vowel for pseudoword generation.\$\endgroup\$CommentedAug 2, 2023 at 15:40

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.