UDP vs TCP
UDP is lightweight, stateless, and lossy.
TCP is heavy, statefull, and robust.
UDP is best when transmission errors can be ignored. TCP is best when transmission errors need correction. Voice over IP uses UDP because of how time sensitive voice conversations are. It's better to just hear a hiccup and be back in sync then to ask for another copy of a lost packet and end up lagging further behind.
In your use case I'd hate to find myself stabbing at the phone trying to get it to move only to see it sporadically show my movements later in a burst. Keep up with me.
Frequency
Every 0.5ms (a tuning variable) is 2000Hz and frankly faster then most monitors 60Hz-120Hz refresh rate. Adjusting this may solve some of your problems. It should allow you to have about 20 times the people connected before the same problem shows up. Write your software so this number is decided in one place so you can adjust it easily to experiment with your real needs.
If you want to make a bigger impact then just 20 times as big, consider cutting down the overhead. Rather then continually transmitting one x and y packet, try transmitting a few of them in a burst. The duration of the burst will add to your lag but it will be consistent. Small enough and it will be barely perceptible.
This idea would work well with frame buffering. Rather then just have one frame being rendered at a time video games work on rendering multiple ones at a time and let them be consumed in order. If you have 10 frames buffered and a packet comes in with 10 sampled locations you render each to their buffered frames.
Doing all this takes your .5ms rate to 0.1s and means your audience can be 200 times as big. That might not be acceptable lag for some video games but it should work fine for fireflies.
Some people might be connected but not actually moving their firefly at the moment. Some unreliable savings can be gained by having their smartphones keep quiet until they have something useful to say.
UI
Your firefly game reminds me of a myth busters where they tried to burn a ship by having a crowd hold mirrors in the sun. The biggest problem they had was no one could tell whose reflection was whose. This meant they couldn't focus because they couldn't tell if they needed to move up, down, left, or right to be on target.
Consider giving out groups of colors. If I'm one green dot among hundreds I have no hope of following my dot. If I'm one of ten red dots I have a chance even if there are 90 other colors bouncing around. Assuming I'm not color blind (red/green is the most common). Good color choices should be able to minimize this impact.
Directly to the Unity app
You are basically making a server when you do this. You will need to listen on a port for users to connect.
Remember to clean up the old location before moving them to a new location if you keep them on screen even after they've stopped transmitting for the moment. If you do your statefull again and must remember to keep drawing them in place until they time out. A way to handle that is to make them keep their own state and transmit where they were before so you can use that to clean up the old location.
Or you can only draw when told to draw. This is simpler but now packet loss turns into firefly flicker.