Non-uniform memory access (NUMA) architectures are modern shared-memory, multi-core machines offering varying access time and latencies between different memory banks. The organisation of nodes across different regions with nodes in the same regions that share the same memory poses challenges to efficient shared-memory access, thus negatively affecting the scalability of parallel applications. This paper studies the effect of state-of-the-art physical shared-memory NUMA architectures on the performance scalability of parallel applications using a range of programs and various language technologies. In particular, different parallel programs are used with different communication libraries and patterns in two sets of experiments. The first experiment examines the performance of the mainstream, widely used parallel technologies MPI and OpenMP, which utilise message passing and shared-memory communication patterns respectively. In addition, the performance implications of message passing versus shared-memory access on NUMA are compared using a concordance application. The second experiment assesses the performance of two parallel Haskell implementations as examples of a high-level language with automatic memory management.The results revealed that in the case of OpenMP the scalability was good, with threads up to six representing threads allocated in the same NUMA node. Moreover, as the number of threads increased, the performance dramatically decreased, confirming the effect of inefficient memory access. Likewise, MPI demonstrated similar behaviours, with the optimum speedup at six cores. However, unlike OpenMP, performance did not decrease sharply beyond that point, illustrating the benefits of message passing as opposed to shared-memory access. In terms of the standard, shared-memory parallel Haskell implementation, the scalability was limited to between 10 to 25 cores out of 48 across three parallel programs, with high memory management overheads. On the other hand, our parallel Haskell implementation, GUMSMP, which combines both distributed and shared-heap abstractions, scaled consistently with a speedup of up to 24 on 48 cores and overall performance improvement of up to 57%, as compared with the shared-memory implementation.