Editorial for ECOO '12 R3 P4 - Splitsville
Submitting an official solution before solving the problem yourself is a bannable offence.
This problem is inspired by the field of Machine Learning, specifically Concept Learning Systems. In a concept learning system, you have a list of training objects (e.g. satellite pictures of possible oil spills, data from medical scanners, etc.) with category labels (e.g. is/is not an oil spill, is/is not a tumor, etc.). Each object is described using a list of properties (usually numbers) and the task is to try to form a theory that separates the objects into their different categories. This theory can be viewed as a partitioning of a multi-dimensional space.
In this problem the training objects are houses, the labels are A
and B
, and the list of properties for each object are just the and location of the house. The "theory" is the set of fences that separate the A
's from the B
's. In machine learning, the theory you form should eventually do well at categorizing previously unseen objects, but that would only work in this case if A
and B
families tend to live close to other families of the same type.
There are a number of ways to partition spaces like this, including "nearest neighbour" classification and neural network classifiers. The algorithm used in this question is a simplification of machine learning algorithms that produce decision trees or decision rules as their output.
Recommended Approach
This problem lends itself well to a recursive solution in which you partition a region of space into two parts, then recursively subdivide the two new regions created. The base case for the recursion is a "pure" region (i.e. a region with only A
or B
houses in it). Within each region, the simplest strategy is to try all possible partitions and count the number of houses out of place in each one, where the number of houses out of place is just the sum of the minima of the counts of A
and B
houses on each side. So if the left side has A
and B
's while the right side has A
's and B
's, the total number of out of place houses is .
There are at least two ways to represent a region in this space. The first is to think of it as a 2-dimensional space bounded by integer coordinates above and below and on the left and right. To process it, you try all partitions half way between each pair of integer coordinates. (To keep life simpler and avoid floating point numbers, you could also double the original coordinates and then only consider splits on odd numbered coordinates.) This approach is not very efficient but it's fast enough for problems of this size. One potential pitfall is that you have to rule out splits that would not separate at least one house from the others.
The second, more efficient, method is to think of a region as just a list of houses. If there are houses in the region, you only need to try at most different cuts on each dimension, even if the houses are separated by thousands of units. To do this, you would have to sort the houses first, then find a split between each pair. Because of the sorting step, and the problems of creating and managing lists, this solution is more difficult to implement.
Comments