Discrete cosine transform (DCT) is one of the major operations in image compression standards and it requires intensive and complex computations. Recent computer systems and handheld devices are equipped with high computing capability devices such as a general-purpose graphics processing unit (GPGPU) in addition to the traditional multicores CPU. We develop an optimized parallel implementation of the forward DCT algorithm for the JPEG image compression using the recently proposed Open Computing Language (OpenCL). This OpenCL parallel implementation combines a multicore CPU and a GPGPU in a single solution to perform DCT computations in an efficient manner by applying certain optimization techniques to enhance the kernel execution time and data movements. A separate optimal OpenCL kernel code was developed (CPU-based and GPU-based kernels) based on certain appropriate device-based optimization factors, such as thread-mapping, thread granularity, vector-based memory access, and the given workload. The performance of DCT is evaluated on a heterogeneous environment and our OpenCL parallel implementation results in speeding up the execution of the DCT by the factors of 3.68 and 5.58 for different image sizes and formats in terms of workload allocations and data transfer mechanisms. The obtained speedup indicates the scalability of the DCT performance.