The more you play with the parts, the more it makes sense. The App has a clever snapshot feature (I don't if that's what they call it, but that's what it seems to do). Once you acquire an object, push the button and it locks the image and the controls stay in a fixed place, but you can still slide them and such.
The Vuforia software associates an image "marker" with the hybrid object through characterization files (which includes the image, the object basic info, and the app overlay controls and stuff). These files live on the server, and get loaded by the App when it connects to the server. The server is the "brains" and the "storehouse" for the information, while the App is a clever user interface - UX really -- camera, display, touchscreen, etc.
The server also has interfaces to the "real world" on the backside, or that's how it sits in my head. The server glues together the App interface and the real world, and is the core of the hybrid reality. It also hosts the Vuforia application that builds the hybrid object, but that's just a "setup" set to build the configuration.
To me, it is helpful to divorce the setup/configuration phase from what happens during usage.
Hope this helps,