1
\$\begingroup\$

I have a dictionary with a lot of symbols, each of which is encoded in a huffman binary string.

Example:

SymbolHuffman Code
you010
shall0111
not00111
pass00001
......

Therefore I encode the sentence "you shall not pass" like this: "01001110011100001".

Now I would like to divide the binary string into chunks of six characters and interpret each sextuplet as a character according to the base64 encoding. If the length of the string is not divisible by 6, I simply append zero at the end until it does.

So "you shall not pass" would be represented like this: 010011 100111 000010 which in base64 should be just TnC.

To do this I managed to write the following PhP code, but I suspect that it could be done in some more elegant way, using some more efficient built-in function. I was looking to something like base64_encode(), but if I try to base64_encode("01001110011100001"), the function treats each single character as 1 byte, and outputs "MDEwMDExMTAwMTExMDAwMDE=". Could you give me some suggestion please?

<?php function binaryToChars($binarystring){ static $chars = array( "000000"=>"A","010000"=>"Q","100000"=>"g","110000"=>"w", "000001"=>"B","010001"=>"R","100001"=>"h","110001"=>"x", "000010"=>"C","010010"=>"S","100010"=>"i","110010"=>"y", "000011"=>"D","010011"=>"T","100011"=>"j","110011"=>"z", "000100"=>"E","010100"=>"U","100100"=>"k","110100"=>"0", "000101"=>"F","010101"=>"V","100101"=>"l","110101"=>"1", "000110"=>"G","010110"=>"W","100110"=>"m","110110"=>"2", "000111"=>"H","010111"=>"X","100111"=>"n","110111"=>"3", "001000"=>"I","011000"=>"Y","101000"=>"o","111000"=>"4", "001001"=>"J","011001"=>"Z","101001"=>"p","111001"=>"5", "001010"=>"K","011010"=>"a","101010"=>"q","111010"=>"6", "001011"=>"L","011011"=>"b","101011"=>"r","111011"=>"7", "001100"=>"M","011100"=>"c","101100"=>"s","111100"=>"8", "001101"=>"N","011101"=>"d","101101"=>"t","111101"=>"9", "001110"=>"O","011110"=>"e","101110"=>"u","111110"=>"+", "001111"=>"P","011111"=>"f","101111"=>"v","111111"=>"/"); $result = ""; while (strlen($binarystring)%6!=0){$binarystring .= "0";} while (strlen($binarystring)>0){ $substr = substr($binarystring,0,6); if (array_key_exists($substr, $chars)){$result .= $chars[$substr]; $binarystring = substr($binarystring,6);} else{throw new Exception("The variable is not a binary string.");} } return $result; } ?> 
\$\endgroup\$
4
  • \$\begingroup\$Hmmm... Homebrew data compression... A fixed' palette' of words encoded with a few bits... With 18 bits (3x6), one could number 256,000 individual messages, some as complex as "It's the weekend, so I'm not answering calls" or more specific. English alphabet has 'Q' and 'X', but I don't think there's a word using Q and X (ie: Some token combinations have no value.) OP has encoded "shall" with 4 bits and "not" with 5... It's unlikely this represents optimal accounting for "frequency of use"...\$\endgroup\$
    – Fe2O3
    CommentedApr 11 at 3:09
  • \$\begingroup\$The example using words of sentences was just for simplify the question! I'm not really trying to compress plaintext. I'm actually doing it for chess positions. My "words" are blocks of pieces on the board.\$\endgroup\$
    – Benzio
    CommentedApr 15 at 21:09
  • 1
    \$\begingroup\$I’m voting to close this question because OP admits to deliberate misrepresentation of their true intent and purpose.\$\endgroup\$
    – Fe2O3
    CommentedApr 15 at 21:51
  • \$\begingroup\$I suggest not to use strings of characters to represent binary values: indexing instead of key matching.\$\endgroup\$
    – greybeard
    CommentedApr 18 at 17:31

2 Answers 2

3
\$\begingroup\$

array_key_exists vs. isset

if (array_key_exists($substr, $chars)){ 

You almost never want to use array_key_exists in PHP.

if (isset($chars[$substr])) { 

does the same thing faster unless null is a valid value for which you want to return true.

String padding

while (strlen($binarystring)%6!=0){$binarystring .= "0";} 

That's about as efficient a way as you'll find for less than six characters. I'd write it with more whitespace though.

while (strlen($binarystring) % 6 != 0) { $binarystring .= '0'; } 

An alternative would be

if (strlen($binarystring) % 6 != 0) { str_pad($binarystring, strlen($binarystring) + 6 - strlen($binarystring) % 6, '0'); } 

I'm not sure that's more efficient though. It would be if you might add more characters.

base64_encode

The rest of your code is where you could make it simpler. Consider

$excess_count = strlen($binarystring) % 8; if ($excess_count != 0) { str_pad($binarystring, strlen($binarystring) + 8 - $excess_count, '0'); } $binary_data = ''; for ($i = 0; $i < strlen($binarystring); $i += 8) { $binary_data .= chr(bindec(substr($binarystring, $i, 8)); } $encoded = base64_encode($binary_data); 

Then you don't have to generate the string manually. You can check the string to make sure that it's composed only of 0s and 1s if you want. You'd do that before this code.

I'm not sure that you need to zero pad it in this case, but that's the way that you were doing it.

The bindec converts from a string to a number (decbin reverses it). The chr converts from a number to a character in a string (the character might be unprintable; ord reverses it). Concatenating gives you a string of these characters (str_split reverses it so that you can use foreach on the resulting array). The base64_encode converts from binary data that may include unprintable characters to only printable characters (base64_decode reverses it).

\$\endgroup\$
    4
    \$\begingroup\$

    I suspect that it could be done in some more elegant way, using some more efficient built-in function

    One could:

    • convert binary numbers to decimal using bindec()
    • convert the decimal number to an ASCII character using chr()
      • because the ASCII characters are arranged a bit differently one would need to add to or subtract from the value returned from bindec() to get to the correct spot, and likely just have a special case for the two non-alphnumeric characters - i.e. +, /

    That would allow for the elimination of static $chars. One could use a function like preg_match() to determine if non-binary characters are present in the string (i.e. to throw the exception).

    Review

    Consider making whitespace more consistent

    A space is added after some keywords like if but not else. For the sake of readability, it would be good to have a space after each keyword, as well as around each binary operator. For idiomatic PHP consider following recommendations in PSR-12.

    Closing PHP tag can be eliminated

    At the end of the PHP file there is a closing PHP tag:

    ?> 

    Per the PHP documentation:

    If a file contains only PHP code, it is preferable to omit the PHP closing tag at the end of the file. This prevents accidental whitespace or new lines being added after the PHP closing tag, which may cause unwanted effects because PHP will start output buffering when there is no intention from the programmer to send any output at that point in the script. 1

    $chars could be a constant

    If you continue using $chars it may be worth making it a constant instead of a static variable. That way it can't be modified at run-time.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.