I'm coding a random strings generator. It's working, but I'm not sure if this is efficient.
/** * @param n total number of strings to generate * @param length length of generated strings * @param charsetStr charset to be used when generate */ public static List<String> generateRandomStrings(int n, int length, @NonNull String charsetStr) { if (n < 1 || length < 1 || charsetStr.length() < 1) { throw new IllegalArgumentException(String.format("Illegal argument(s): %d, %d, %s", n, length, charsetStr)); } //remove duplicate chars Set<Character> charset = new HashSet<>(charsetStr.length()); StringBuilder sb = new StringBuilder(); for (int i = 0; i < charsetStr.length(); i++) { char c = charsetStr.charAt(i); if (charset.add(c)) { sb.append(c); } } charsetStr = sb.toString(); BigDecimal caseNum = BigDecimal.valueOf(charsetStr.length()).pow(length); BigDecimal nBd = BigDecimal.valueOf(n); if (caseNum.compareTo(nBd) < 0) { throw new IllegalArgumentException( String.format("Number of possible strings cannot exceed the requested size: (length of '%s') ^ %d < %d", charsetStr, length, n)); } if (nBd.divide(caseNum, 1, RoundingMode.HALF_UP).compareTo(BigDecimal.valueOf(5, 1)) < 0) { //when fill factor is below 50% //generate until target size is reached Set<String> results = new HashSet<>(n); while (results.size() < n) { results.add(RandomStringUtils.random(length, charsetStr)); } return new ArrayList<>(results); } // when fill factor is above 50% // pre-generate all possible strings List<String> results = new ArrayList<>(caseNum.intValue()); generateAllPossibleStrings(results, "", charsetStr, length); // then shuffle the collection and pick target sized sublist Collections.shuffle(results); return results.subList(0, n); } public static void generateAllPossibleStrings(@NonNull List<String> results, @NonNull String prefix, @NonNull String charset, int length) { if (prefix.length() >= length) { results.add(prefix); return; } for (int i = 0; i < charset.length(); i++) { generateAllPossibleStrings(results, prefix + charset.charAt(i), charset, length); } }
Any advice would be grateful, but my main focus here is performance.
I'm using org.apache.commons.lang3.RandomStringUtils
to generate strings if n
is less than 50% of the number of all possible strings(let's call it "fill factor"). If not, to avoid unnecessary collisions, I generate all possible strings using generateAllPossibleStrings
, then shuffle the result and get a sublist from it.
The reasons I used 2 methods for generating strings are:
- If the fill factor is low(like 1~2%), using
generateAllPossibleStrings
seemed overkill. So I usedorg.apache.commons.lang3.RandomStringUtils
, hoping collisions not matter much. - If the fill factor is high(like more than 90%), using
org.apache.commons.lang3.RandomStringUtils
seems inefficient since a lot of collisions will occur as time goes by. So I usedgenerateAllPossibleStrings
in this case, dropping without any collision looks more efficient than generating.
There is no big reason why I choose 50% as a diverging point; I just used my hunch. This too may need to be fixed.
While I mostly wrote about the collision, I'm looking for any advice on performance enhancements. How can I make this more time-efficient?
sb
,n
, andnBd
. Thank you.\$\endgroup\$n
,length
andcharsetStr.length()
. But if you can put realistic bounds on these parameters, you may be able to avoid doing the math.\$\endgroup\$