Algorithm – Number of strings containing every string of a given set a strings

Question

I have a given set S of strings, and a length l, and I am looking for the number of strings of length l that contains every string of S. A naive approach would be to generate every string of length l (I have a given alphabet), and for each check whether it contains every string of S. However, this is clearly very inefficient, so I was wondering if there was a better way that uses dynamic programming.

I couldn't find any thread about such a problem. Some were quite similar (such as, finding the number of strings of given length that contain one given string), but none gave me the answer.

EDIT: an example

Suppose I am given the strings "ab" and "ca", and the length 4, (and my alphabet is {a,b,c}) I can make the following strings: abca, caab, caba, cabb, cabc, acab, bcab, ccab, for a total of 8 possible strings.

EDIT: code from answers

From @KilianFoth's and @Jasmijn's answer, I came up with the following code, with a few modifications:

@KilianFoth I think you might actually count twice (or many more times) a string if you don't cache, in some way, the one you've already done, but...
@Jasmijn storing the actual found strings, even in a set (which in Rust is a HashSet) takes too much memory (in the sense that I tried it and it failed)

Therefore I created a "hash" function that is more like a compression algorithm: since all my strings are made of 4 letters, they are uniquely defined by a number in base 4. Thereby, I store that number instead of the string, which takes less memory but serves the purpose.

Moreover, I don't recreate tons of templates like in @Jasmijn's answer, I recycle a single buffer for every string (notice that whenever I fill a value, few lines later I empty it)

However, the main algorithm is the same, and that's the problem: it is too slow (I submitted it and it didn't pass the fith test out of twelve). Thus there must be a better algorithm.

use std::io; use std::collections::HashSet; const ALPHABET: [char; 4] = ['A', 'C', 'G', 'T']; type Word = Vec<char>; type Buffer = Vec<Option<char>>; fn hash(word: Word) -> u32 { word .iter() .enumerate() .map(|(i, x)| match x { 'A' => 0, 'C' => 1, 'G' => 2, 'T' => 3, _ => 4 } * 4_u32.pow(i as u32)) .sum() } fn can_insert(word: &Word, i: usize, buffer: &Buffer) -> (bool, Vec<usize>) { if word.len() + i > buffer.len() { return (false, Vec::new()); } let mut changes = Vec::new(); for j in 0..word.len() { if let Some(chr) = buffer[i+j] { if chr != word[j] { return (false, Vec::new()); } } else { changes.push(j); } } (true, changes) } fn fill(buffer: &mut Buffer, index: usize, done: &mut HashSet<u32>) { if index == buffer.len() { done.insert( hash(buffer .iter() .map(|x| x.unwrap()) .collect() ) ); } else { if buffer[index].is_none() { for &chr in ALPHABET.iter() { buffer[index] = Some(chr); fill(buffer, index+1, done); } buffer[index] = None; } else { fill(buffer, index+1, done); } } } fn count(buffer: &mut Buffer, words: &[Word], index: usize, done: &mut HashSet<u32>) { if index == words.len() { fill(buffer, 0, done); } else { let word = &words[index]; let rev = word .iter() .rev() .copied() .collect(); for i in 0..buffer.len() { if let (true, changes) = can_insert(word, i, buffer) { for &j in &changes { buffer[i+j] = Some(word[j]); } count(buffer, words, index+1, done); for &j in &changes { buffer[i+j] = None; } } if let (true, changes) = can_insert(&rev, i, buffer) { for &j in &changes { buffer[i+j] = Some(rev[j]); } count(buffer, words, index+1, done); for &j in &changes { buffer[i+j] = None; } } } } } fn solve_problem(n: usize, _m: usize, words: Vec<Word>) { let mut buffer = vec![None; n]; let mut done = HashSet::new(); count(&mut buffer, &words, 0, &mut done); println!("{}", done.len() ); } fn main() { let n = read_line().parse().unwrap(); let m = read_line().parse().unwrap(); let words = read_words(m); solve_problem(n, m, words); } // What is left are just io functions to read the problem fn read_line() -> String { let mut line = String::new(); io::stdin() .read_line(&mut line) .expect("Failed to read line"); line.trim().to_string() } fn read_words(m: usize) -> Vec<Word> { let mut words = Vec::new(); for _ in 0..m { let word = read_line(); words.push(word); } words.iter().map(|x| x.chars().collect()).collect() }

Adding an example input and expected output to your qurestion would make it much more clear — Alexander, CommentedDec 22, 2020 at 14:10

Kilian Foth · Accepted Answer · 2020-12-22 10:05:14Z

3

Well, any of the strings from S that your target string has to contain must start at some index i < l, right? So

just start searching states of the world where the first string is at index 0, 1, 2 etc. in a buffer of length l
recursively put other strings at other indices - you'll see easily enough whether it's possible or not.
once you've placed all strings, you know that you've found 26^n possible solutions (n being is the number of empty slots)
sum the number of solutions in all leaf nodes

(Of course, if you can't place another string, you've hit a dead end and must simply backtrack.)

answered Dec 22, 2020 at 10:05

Kilian Foth

111k45 gold badges300 silver badges322 bronze badges

If I understood you well, you are suggesting a search over a tree in which, at each step, I add a word at a given index i. If I can reach the bottom of the tree, it means that I actually achieved to find a word that contains every string in S, and conversely every word can be constructed like that. But, if that's it, if m is the size of S, I have to do O(l^m) operations, which seems quite a lot (not really an improvement of the naive algorithm which is also basically exponential).
– jthulhu
CommentedDec 22, 2020 at 10:12
Well, my impression is that this problem should be NP-complete and therefore only have exponential exact solutions. The reason would be that in the general case, the strings may interact with each other (overlap wholly or partially) in ways that you can't compute without actually checking, and this makes the normal (polynomial) dynamic programming solution useless. Could be mistaken, though.
– Kilian Foth
CommentedDec 22, 2020 at 10:54
I see your point, however it might be feasible to precompute these overlap, and get a polynomial time on the size of the strings in S, which doesn't mean that the problem isn't NP-complete, but for as long as the size of these strings is not too big the rest of the algorithm is poly-time.
– jthulhu
CommentedDec 22, 2020 at 10:58
@BlackBeans: the approach of testing every possible string requires O(a ^ l * m * w) operations, where a is the alphabet size and w is the medium word length, so Kilian's suggestion is definitely an improvement when m is small and l is large. The question here is, which one do you prefer as exponent - l or m?. Note also that with increasing m, the search tree will get pruned automatically by the backtracking, ...
– Doc Brown
CommentedDec 23, 2020 at 10:42
... and I guess it will be possible to find a data structure for the buffer which allows to find the possible placements for the next string with a decreasing number of operations when the buffer gets filled more and more.
– Doc Brown
CommentedDec 23, 2020 at 10:44

| Show 5 more comments

Jasmijn · Accepted Answer · 2020-12-22 15:42:59Z

I don't see an easy way to efficiently do this. The closest I've come is the following (works for Python 3.9):

from typing import Optional Template = list[Optional[str]] def possible_match(template_item: Optional[str], fixed_item: str) -> bool: ''' Can the length-1 string fixed_item fit in the slot template_item? ''' return template_item is None or template_item == fixed_item def fill_templates(base_template: Template, fixed: str) -> Iterator[Template]: ''' Fill a string in all the slots that it fits. For example, with a template of **ab and a fixed string ca, we would get: caab and *cab ''' for i in range(len(base_template) - len(fixed) + 1): if all(possible_match(template_item, fixed_item) for template_item, fixed_item in zip(base_template[i:], fixed)): yield base_template[:i] + list(fixed) + base_template[i + len(fixed):] def fill_all_templates(base_template: Template, S: list[str]) -> Iterator[Template]: ''' Recursively fill all the string in all the slots they fit. ''' if S: fixed, *rest_S = S for new_template in fill_templates(base_template, fixed): yield from fill_all_templates(new_template, rest_S) else: yield base_template def generate_possible_solutions(template: Template, alphabet: frozenset[str]) -> Iterator[str]: ''' Recursively generate all the solutions from a single template. For example, given a template of *bb* and an alphabet of {a, b}, this would generate: abba, abbb, bbba and bbbb ''' if template: first, *rest = template if first is None: for solution in generate_possible_solutions(rest, alphabet): for letter in alphabet: yield letter + solution else: for solution in generate_possible_solutions(rest, alphabet): yield first + solution else: yield '' def generate_solutions(templates: Iterable[Template], alphabet: frozenset[str]) -> Iterator[str]: ''' Generate all the solutions from a number of templates. Note that this will contain duplicates. For example, given the templates *a and a* and the alphabet {a, b} this will generate: aa, ba, aa and ab ''' for template in templates: yield from generate_possible_solutions(template, alphabet) def black_beans(S: list[str], l: int, alphabet: frozenset[str]) -> frozenset[str]: return frozenset(generate_solutions(fill_all_templates([None] * l, S), alphabet)) if __name__ == '__main__': example = black_beans(['ab', 'ca'], 4, {'a', 'b', 'c'})) print(example) # => frozenset({'cabb', 'caba', 'acab', 'cabc', 'ccab', 'caab', 'bcab', 'abca'}) print(len(example)) # => 8

This will generate a lot of duplicates, but if the strings in S are long enough, it should be more efficient than brute-forcing it.

It works by recursively building a sequence of "templates", which are lists that contain either a character or a wild-card, then expanding the templates to all possible combinations, and lastly removing all of the duplicates. For the example no duplicates actually ever get generated, so it doesn't seem to be doing too much extra work if that is a representative example.

By reading your code, I think that yours is the implementation of the algorithm proposed by @KilianFoth, am I correct? — jthulhu, CommentedDec 29, 2020 at 11:32
I've edited my original post for a more complete answer, but basically, it's still too slow... — jthulhu, CommentedDec 29, 2020 at 12:20

Ewan · Accepted Answer · 2020-12-22 10:06:45Z

0

I'm not sure I fully understand the problem, but it sounds like you are looking for a "Trie"

https://en.wikipedia.org/wiki/Trie

This structure is commonly used for tasks such as auto-completion, spellcheck etc

answered Dec 22, 2020 at 10:06

Ewan

81.9k5 gold badges86 silver badges182 bronze badges

I've already looked at it (it was literally the first thing that came into my mind), but couldn't come to a solution with it...
– jthulhu
CommentedDec 22, 2020 at 10:07
What would you be building it from? A trie usually is built from a string (or a set of strings), but basically allows you to check whether an other given string is a substring of that one (and, if not, it indicates what is the first character that is wrong). Here the words that are superstrings of any string in S is what I am looking for.
– jthulhu
CommentedDec 22, 2020 at 10:15

Add a comment |

Stack Exchange Network

Algorithm – Number of strings containing every string of a given set a strings

3 Answers 3

Hot Network Questions

Algorithm – Number of strings containing every string of a given set a strings

3 Answers 3

Related

Hot Network Questions