How to create combinations and not die trying

When you are building your website is very important generate good keywords in order to get visits from search engines quickly. First of all, is very important to analyze your market and your possible competitors and see which keywords they are using. But even if you have the best study you can pay for at the end of all you will need to look at your content and see what is talking about.

So if for example your website is about videogame reviews then your individual documents will be talking about game reviews so in this case you can analyze one of your documents:

1:{ title:”guitar hero III”, description: “You can play your own guitar and emulate….”, category: “Music”, platforms:{0:{“Xbox360”},1:{“Play Station 3”}}, age: “+13″, mutiplayer:”YES”, release_date:”29th September 2012″, on sale: “y”,”reviews”:{0:”Awesome!”, 1:”I expected something better”}}

After analyzing this document you will see some potential keywords like “games xbox360”, “games play station 3”, “multiplayer games” or “multiplayer music games”. But how do you deal with this when you have thousands or millions of documents. Will you analyze all those documents one by one?

One of my favorite solutions to create keywords is to use this class CombinationGenerator. This class basically receives one array with options and generates all the possible combinations for this with no repetitions. Also it creates combinations of multiple sizes. If you try to combine 4 elements it retrieves results for 1 element, 2 elements, 3 elements and 4 elements. The order is not considered for this algorithm.

So for [1,2,3] this would generate [1], [2], [3], [1,2], [2,3], [1,3] and [1,2,3]. In this specific use case we could try to create combinations for [platform:xbox360, platform:play station 3, multiplayer:y, category:music]. This will give us all our potential keywords for one of our documents. If you repeat this process for all your documents you will be able to aggregate this data to see which keywords are more frequent (and more important by extension).

TIP 1: Use the map step from Hadoop is a good way to save some performance during this aggregation and avoid you to create unnecessary memory leaks:

map => for( keyword : DOCUMENT) emit (keyword, 1)

reduce => keyword, [1,1,…,1] = keyword, total

TIP 2: As you will see the class has one memory leak since is creating more than one StringBuilder to generate combinations. Also is not perfect if you try to generate combinations for more than 27 elements you will get an OutOfMemory. But usually if you work with less than 20 elements it works quite fine.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s