Generate unique random strings considering collisions

Question

I'm coding a random strings generator. It's working, but I'm not sure if this is efficient.

/** * @param n total number of strings to generate * @param length length of generated strings * @param charsetStr charset to be used when generate */ public static List<String> generateRandomStrings(int n, int length, @NonNull String charsetStr) { if (n < 1 || length < 1 || charsetStr.length() < 1) { throw new IllegalArgumentException(String.format("Illegal argument(s): %d, %d, %s", n, length, charsetStr)); } //remove duplicate chars Set<Character> charset = new HashSet<>(charsetStr.length()); StringBuilder sb = new StringBuilder(); for (int i = 0; i < charsetStr.length(); i++) { char c = charsetStr.charAt(i); if (charset.add(c)) { sb.append(c); } } charsetStr = sb.toString(); BigDecimal caseNum = BigDecimal.valueOf(charsetStr.length()).pow(length); BigDecimal nBd = BigDecimal.valueOf(n); if (caseNum.compareTo(nBd) < 0) { throw new IllegalArgumentException( String.format("Number of possible strings cannot exceed the requested size: (length of '%s') ^ %d < %d", charsetStr, length, n)); } if (nBd.divide(caseNum, 1, RoundingMode.HALF_UP).compareTo(BigDecimal.valueOf(5, 1)) < 0) { //when fill factor is below 50% //generate until target size is reached Set<String> results = new HashSet<>(n); while (results.size() < n) { results.add(RandomStringUtils.random(length, charsetStr)); } return new ArrayList<>(results); } // when fill factor is above 50% // pre-generate all possible strings List<String> results = new ArrayList<>(caseNum.intValue()); generateAllPossibleStrings(results, "", charsetStr, length); // then shuffle the collection and pick target sized sublist Collections.shuffle(results); return results.subList(0, n); } public static void generateAllPossibleStrings(@NonNull List<String> results, @NonNull String prefix, @NonNull String charset, int length) { if (prefix.length() >= length) { results.add(prefix); return; } for (int i = 0; i < charset.length(); i++) { generateAllPossibleStrings(results, prefix + charset.charAt(i), charset, length); } }

Any advice would be grateful, but my main focus here is performance.

I'm using org.apache.commons.lang3.RandomStringUtils to generate strings if n is less than 50% of the number of all possible strings(let's call it "fill factor"). If not, to avoid unnecessary collisions, I generate all possible strings using generateAllPossibleStrings, then shuffle the result and get a sublist from it.

The reasons I used 2 methods for generating strings are:

If the fill factor is low(like 1~2%), using generateAllPossibleStrings seemed overkill. So I used org.apache.commons.lang3.RandomStringUtils, hoping collisions not matter much.
If the fill factor is high(like more than 90%), using org.apache.commons.lang3.RandomStringUtils seems inefficient since a lot of collisions will occur as time goes by. So I used generateAllPossibleStrings in this case, dropping without any collision looks more efficient than generating.

There is no big reason why I choose 50% as a diverging point; I just used my hunch. This too may need to be fixed.

While I mostly wrote about the collision, I'm looking for any advice on performance enhancements. How can I make this more time-efficient?

@vnp A wild guess of string length is about 10~15, and max charset is alphanumeric(62). Maybe I need to limit the max string length, not sure. — user2652379, CommentedNov 18, 2020 at 4:08
Without doing a full review. You should not shorten names because you can, full names will always make the code more readable, even if they are longer. They also have the upside that lines without context becoming much more readable. — Bobby, CommentedNov 19, 2020 at 17:01
@Bobby I'm assuming you're talking about sb, n, and nBd. Thank you. — user2652379, CommentedNov 20, 2020 at 1:13
You seem to be asking for ideas on how to pick the "diverging point". That's really a math problem, and the exact answer will be a formula that depends on n, length and charsetStr.length(). But if you can put realistic bounds on these parameters, you may be able to avoid doing the math. — Stephen C, CommentedMay 2, 2021 at 3:32

Martin Spamer · Accepted Answer · 2024-07-12 10:21:42Z

Your code broadly follows the Java conventions.
Your objective is unclear, why do you need random strings? If they are for hashkeys there are plenty of strong methods for generating these. If you need simple tokens then is there any reason you cannot use random UUIDs?
Your code is named using the programming domain, not the problem domain. Naming things relevantly improves maintainability and the potential for reuse. So from your response comment below, generateRandomStrings should be called generateCouponCodes.
Use Unit tests, they can be used to gauge performance for optimisations, but avoid premature optimisation.
Static methods are an anti-pattern with OOP and should be avoided except in specific circumstances that do not apply here.

Thank you. About the objective, it's the requirements from our client for generating their coupon codes per coupon type. A coupon type and a coupon code together must be unique(composite key). But the coupon type is out of the scope of this question. Maybe I can use UUIDs. — user2652379, CommentedJun 17, 2021 at 2:00
Oh, and the client should be able to choose charsets(numbers, lower cases, upper cases) for each code generation. Maybe that requirement could be changed, but it's too late for that. Already shipped. — user2652379, CommentedJun 17, 2021 at 2:10

Stack Exchange Network

Generate unique random strings considering collisions

1 Answer 1

Hot Network Questions

Generate unique random strings considering collisions

1 Answer 1

Related

Hot Network Questions