Weighted MinHash on GPU helps to find duplicate GitHub repositories.

We describe how we filtered very similar GitHub repositories using our new open source project MinHashCuda.

Keep reading

kmcuda (K-Means on GPU) version 4 is released

Our kmcuda v4 is released, featuring multi-gpu, float16, Spherical K-Means and improved precision.

Keep reading

Native GNU nano text editor in CoreOS.

CoreOS ships with Vim as the only text editor by default. The following is how to compile GNU nano text editor for CoreOS as a first-class citizen.

Keep reading

Hands on with the most starred GitHub repositories.

Playing with most popular repositories’ metadata.

Keep reading

Reading PySpark pickles locally

How to load Hadoop SequenceFile-s with Python serialized objects without having to install Spark - using src-d/sparkpickle

Keep reading

Adding LZO support to Dataproc

How to leverage splittable LZO compression in Dataproc

Keep reading

Topic Modeling of GitHub Repositories

Data mining of 18M GitHub repositories

Keep reading

397 Languages, 18,000,000 GitHub repositories, 1.2 billion files, 20 terabytes of code: Spaces or Tabs

Comprehensive study of spaces and tabs usage in source code in GitHub repositories

Keep reading

Setting up Google Cloud Dataproc with Jupyter and Python 3 stack

How-to article devoted to setting up Dataproc, Jupyter and Python 3 data science stack

Keep reading

Towards Yinyang K-means on GPU

K-means is a nice and simple clustering algorithm. It can be effectively implemented using NVIDIA CUDA technology and we elaborate on how.

Keep reading