Skip to content

Collision Performance#4

Open
AmityWilder wants to merge 10 commits intoanadon:masterfrom
AmityWilder:collision-perf
Open

Collision Performance#4
AmityWilder wants to merge 10 commits intoanadon:masterfrom
AmityWilder:collision-perf

Conversation

@AmityWilder
Copy link

@AmityWilder AmityWilder commented Feb 8, 2026

  • Copy element Sets to temporary arrays and ArrayLists, since they will be iterated over far more than they will be searched within the overlap method.
  • Implement a bounding box for the selection and remove any elements not overlapping the bounding box from the temporary unselected element ArrayList before any further checks.
    • Use swap-removal since order doesn't matter, making the removal worst case $O(1)$ instead of $O(n)$
  • Crunch the $O(n^3)$ section in the "else" case for simple intersections down to $O(n^2)$, since that part seems to just be checking against all elements per element per selected element--despite only involving the selected element and all other elements, not the selected element and all permutations of other elements.
    i.e.
for (Element sel : selected) {          // O(n)
  for (Element el : elementsArr) {      // O(n^2)
    if (intersecting) { ... }
    else {
      for (Element elm : elementsArr) { // O(n^3)
        if (sel ? elm) // no mention of `el`

$$\large\Downarrow$$

for (Element sel : selected) {       // O(n)
  for (Element el : elementsArr) {   // O(n^2)
    if (intersecting) { ... }
  }
  if (none intersected) {
    for (Element el : elementsArr) { // O(n^2)
      if (sel ? el) // ...

Related to #3

Collections was imported for swap remove, before realizing swapping wasnt needed and copying would suffice.
…loop

I'm not sure that "for (sel : selected) { for (el : elements) { for (elm : elements) { /* only acknowledging sel and elm*/ } } }" makes sense. Perhaps at some point in development, the "sel.intersects(el)" may have been mistaken for a bulk test of all the elements?
The way it was written prior to this commit, it would check all elements for all elements, turning an O(n^2) operation into an O(n^3) operation with n:1 duplicate operations (if I am understanding the intent of the code correctly).
@AmityWilder
Copy link
Author

Apologies for my IDE removing all the trailing whitespace in the files I edited, I didn't notice it had done that until creating this PR.

Also, feel free to remove (or ask me to remove, if that's easier) the edits to the README.md; I understand that may be out of scope for this.

@anadon
Copy link
Owner

anadon commented Feb 8, 2026 via email

@AmityWilder
Copy link
Author

AmityWilder commented Feb 9, 2026

My Computer Organization and Algorithm Engineering professor, Dr. Lalejini, put me in contact with Dr. Kurmas over email after I offered to look into the collision performance. JLS was running rather slowly when I tried building a 16-bit adder without using subcircuits.

I promise I did not use AI (I personally hate the stuff and 100% understand your apprehension), but I don't exactly have a 5hr recording of myself writing the code. How would you like me to prove this wasn't AI generated? I can explain every change and every step of my process in depth if that would work.

@anadon
Copy link
Owner

anadon commented Feb 9, 2026 via email

@bsiever
Copy link

bsiever commented Feb 10, 2026

@anadon FWIW, I created a fork of your work, made some updates, and continue to use JLS in classes (https://github.com/bsiever/JLS/) --- your work definitely wasn't for naught!

int i = 0;
while (i < elementsArr.size()) {
if (selected.contains(elementsArr.get(i))) {
// "swap" remove - slightly cheaper because we don't care about order
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're very worried about performance for this operation, a linked list has better cache locality.

Copy link
Author

@AmityWilder AmityWilder Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused how that can be true...

An array has all of its elements next to each other in memory, so the entire surrounding chunk of memory can be pulled into memory at once and contain those adjacent elements.

A linked list has pointers to adjacent elements, meaning it couldn't be cached without dereferencing those pointers--which in addition to being extra operations (and slow ones at that: loading data from RAM) may be unsafe to do speculatively since the CPU can't be certain those pointers belong to the program, or are even valid memory addresses.

Am I incorrect?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd be almost right if this were C. Less so for Java. In a traditional linked list and doubly linked list, each node has one to three pointers. One is for the next node, optionally for the previous node, and the value is either embedded or has a pointer to it. However, in optimized implementations these nodes are kept densely packed in an array or something more exotic. So, you're comparing something called a long jump to overcome 2+GB changes in virtual address or physical address (we're skipping the kernel considerations for now) or if correctly optimized in some circumstances a standard jump. Those jumps are, practically, cache-busters. Intel has some fancy stuff with their pre-fetcher, but relying on that from a programmer's perspective is faulty at best, meaning that you count on missing L1, L2, L3 (and sometimes L4) cache, hitting RAM, and killing the performance. Now, when working with swapping to the end of a dense array, you're busting this cache on every swap just by virtue of touching the end of the array while you were just working on the beginning. However, with Java's presumably optimized implementation, these removals are always touching bytes which are within one or two words away. This means that the cache characteristics are great. So even if you're performing 5X the operations, the fact this the CPU assumes that you're working with memory right next to each other allows for simpler algorithms to pre-fetch that memory and avoids the cache misses in the first place. And this is without the extremely optimized data structures which are actually probably being used which are going to have better characteristics. For these reasons, trust the standard library. Neither of us are going to out-smart it.

Once you hit system organization, operating systems, and compilers you'll formally be introduced to these topics if your profs are good.

One tip for compilers -- learn how to use stack machines. They aren't always covered but they allow you to entirely skip register management and it makes your life immensely easier.

Copy link
Author

@AmityWilder AmityWilder Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I had completely not realized that accessing the end would be what causes the slowdown (and I had NO idea about any of those linked list optimizations), but it makes perfect sense after you explained it!

I'll try benchmarking this and an alternative linked list implementation and keep whichever is faster. However, I'm not entirely sure how to benchmark this code without its speed being influenced by the speed of my own movements. As far as I can tell, there isn't a good way of isolating the overlap code and running it with a variety specific circuits--I would have to trigger the overlap method manually. Do you have any suggestions?

// so it makes sense to use a temporary array for cache locality and O(n) traversal.
Element[] selectedArr = selected.toArray(Element[]::new);
ArrayList<Element> elementsArr = new ArrayList(circuit.getElements());
// elementsArr.removeAll(selected);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These top level functions often have significant hidden optimizations to them. I'd benchmark before doing this and claiming this is the part that improves performance.

int ymax = Integer.MIN_VALUE;

// build up a bounding box
for (Element sel : selectedArr) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect for concave geometries.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further on this point, say a student wants to work on a component involving several parts of a CPU pipeline in a ring-like layout on the diagram, and then they want to re-join in to the full CPU area. It is most likely that the lines do validly connect. But with this, the rectangularized area of each will indicate a false positive for overlapping and reject the drop. Rather than this approach, you should be looking at partitioning algorithms with non-zero area elements. You're likely to find something which generates a layout in O(n*log(n)) time. If you then couple this with a de-bouncer and a early termination signal for a separate thread, you'll freely be able to move the selected elements and only allow placement after the thread for the current position terminates and indicates success.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have encountered such a false-negative on this branch myself, so I can confirm with 100% certainty that you are correct.

For now, I have removed the bounding box part of the code in 9275399. I can try working on your suggested alternative to a bounding box once I'm sure everything else is correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize, you were talking about a false positive. I encountered a false negative and mistook it for what you were saying.

The false negative I encountered was that, when selecting wires, the user could place them in such a way that they overlap a wire or pin that does not overlap the selection bounding box.

Either way, I believe the bounding box check may be a bust and/or need work.

@anadon
Copy link
Owner

anadon commented Feb 17, 2026

Thinking further, I think the move-and-drop implementation should take advantage that elements and voxels in 2D space and have their collisions detected by generating a sorted list for each of non-moving elements' X-values, moving elements' X-values, non-moving elements' Y-values, and moving elements' Y-values. Use counting sort. Have two threads with kill logic handle each of the X and Y dimensions. Use a transform for the granularized cursor coordinates to detect movements and launch threads with correct parameters. If in the line-segments for the elements in each merged array have a line-segment overlap for the same element, there is a collision. Segment collision detection is trivial. These sorted lists will need to be sorted on keys with their coordinate integer value, and a value of a list containing references to element objects. This should yield O(n) performance.

@anadon
Copy link
Owner

anadon commented Feb 19, 2026

@AmityWilder Are you about?

@AmityWilder
Copy link
Author

AmityWilder commented Feb 20, 2026

Sorry, I've been getting pulled in a lot of directions lately and I wanted to research your response before giving a yes or no.

@AmityWilder
Copy link
Author

AmityWilder commented Feb 20, 2026

My immediate thought is whether two $O(n \log(n))$ sortings (one for $x$ and one for $y$) replacing the $O(1)$ insertion and $O(1)$ translation in exchange for an $O(n)$ overlap detection would be more performant on average than an $O(n^2)$ overlap detection.

I imagine, given the way students are encouraged to prefer smaller circuits containing subcircuits, most might not exceed a value of $n$ that makes an $O(n^2)$ overlap detection that significantly more expensive than an $O(n \log(n))+O(n)$ translation+overlap detection.

Plus, the bigger part from my perspective (as the person who will possibly be implementing this):
Getting the existing algorithm down from $O(n^3)$ to $O(n^2)$ was a surprisingly small refactor with no tradeoffs. It only involved noticing that a particular chunk of code was in the wrong place. It didn't involve changing any data structures or trading complexity between two things--just a direct reduction of complexity with minimal changes to existing code.

Your suggestion for changing it from $O(n^2)$ to $O(n)$ will certainly require modifying at least one data structure (to store the sorted lists; as creating them on the spot in overlap() would make the complexity at minimum that of the sorting algorithm, if it is more complex than the overlap detection) as well as possibly a full audit to make sure there are no cases where Elements can be added, removed, or transformed without being updated in the sorted lists (which I can tell you from experience will also lead to more difficult maintenance, since you can't just audit once and trust that future changes will somehow never ever cause the cached position map to desync). And that's on top of essentially having to implement a wholely different algorithm, without unit tests (due to the lack of a framework), rather than just tweaking the algorithm that's already there.

I'm not saying no, I just want to make sure it's worth it before spending what could be an entire week on what could very well amount to a minimal, possibly even negative performance change.
I may be 100% wrong. Maybe it would triple performance in all cases. Maybe I'm misinterpreting you, or maybe it would be a rather trivial change with great value. I just don't know.

That's why I want to give it some extra thought before jumping headfirst into it. It just happens that I'm having to wait for the right time to research it, as at this moment it's approching the 7th week of this 14-week semester, I have 5 classes, I've been doing bug fixes on Bill Siever's repo, and Dr. Kurmas has suggested some additional projects for me.

I will absolutely look into this when I have the chance. I apologize for not answering sooner, your previous reply was automatically marked as read when I read it but I had to leave for class before I could get a response back to you, and then I forgot to reply when I got home.

@AmityWilder
Copy link
Author

AmityWilder commented Feb 20, 2026

My bad, I misremembered parts of your response when I made mine. You suggested using specifically counting sort ($O(n+k)$) and I was only thinking of sort in general (often $O(n\log(n))$).

However, I think then there is the problem of how to sort and store segments. I've run into this problem before: they aren't singular values, they're a combination of a start and an end. Is a segment that starts earlier but ends later "less than" or "greater than" a segment that starts later but ends earlier? Does it "eat" the contained segment? Does a segment that starts earlier and end earlier get merged with a segment whose start intersects that one, but ends later than it?

And how do we handle the fact that there are not just an $x$ lane and a $y$ lane, but one for every gridspace? Do we store them in a 2D array, or reuse two 1D arrays for each moving element? Do we sort the entire non-moving set for each moving element? Filter the ones overlapping their axes? That could be incredibly expensive--transforming an $O(n^2)$ algorithm into an $O(2n(2n+k))$

for each element a:
  select elements b
    where b.xrange overlaps a.xrange
    order by b.xrange
  select elements c
    where c.yrange overlaps a.yrange
    order by c.yrange
  if any b or c overlaps a:
    do remaining tests
  else:
    do other tests

It might be a better idea to use a quad tree. I have yet to implement one successfully, especially when wires are involved, but I think it would be more likely to improve performance than sorting segments every tick since it has inherent locality and positional sorting, and there is empirical evidence that they do boost performance in video games. It would be equally as much work to implement, however.

@AmityWilder
Copy link
Author

AmityWilder commented Feb 20, 2026

I've been avoiding your suggestion regarding multithreading, because

  1. I'm not familiar enough with Java to reliably use multithreading.
  2. With a small enough quantity of Elements, it's possible that asking the OS for extra threads and waiting for one to get freed up could actually slow down the algorithm. I believe that, not being an "industry accepted" like you mentioned in your first reply to this PR, most users will be working in quantities of hundreds to thousands of elements--not hundreds of thousands. (I am not saying this definitively, I'm only saying this is another point I will need to research)

@anadon
Copy link
Owner

anadon commented Feb 20, 2026

Segment overlap is actually shockingly easy! The gist is that each start pushes onto a stack, and each end marks a collision from the top of the stack to the related start. Then remove that start and continue pushing, marking, and removing.

Quad trees are plausible, but I'm not that familiar with their details. I understand the complications with multi-threading. While they have some great benefits with responsiveness, I'd drop them until you have a more optimal single-threaded solution.

Lastly, I understand that these are more work, but the false rejection on concave geometries is a blocking bug.

If you need some help, guidance, or to be taught some of these algorithms let me know! I understand your class-load. Been there.

@AmityWilder
Copy link
Author

I could simply remove the bounding box part for now, if that's what you mean by "false rejection on concave geometries". The bigger speed boost came from the $O(n^3) \to O(n^2)$.

If that's not what you mean, could you please clarify?

@AmityWilder
Copy link
Author

AmityWilder commented Feb 20, 2026

The details of a quad tree are that it's much like a binary tree, but in 2D. You split the useable area into quadrants, then separate elements out by "above/below y-midpoint" and "left/right of x-midpoint". You keep splitting up the space until only a manageable number of elements are in each leaf, leaving you with $O(\log(n))$ complexity when searching for nearby elements ($O(n\log(n))$ to visit each element and find the elements in touching-range, as opposed to $O(n^2)$ to filter out any non-overlapping elements)

The trouble I have with them is that wires have a tendency of occupying multiple leaves at once...

@anadon
Copy link
Owner

anadon commented Feb 24, 2026

I could simply remove the bounding box part for now, if that's what you mean by "false rejection on concave geometries". The bigger speed boost came from the O ( n 3 ) → O ( n 2 ) .

If that's not what you mean, could you please clarify?

I'll try to properly respond to this later in the week. I'm cognitively loaded myself at the moment and can't give you a satisfying and meaningful response.

@anadon
Copy link
Owner

anadon commented Feb 24, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants