Crawling with Python and WordFreq library

Today I published a mini-repo that was part of a training inside Atrápalo https://github.com/raimonbosch/wordfreq.crawler The objective of this training was to come out with a crawler that could find a specific word as fast as possible. To do so we have used the library wordfreq in order to analyze which links could be closer to…

Continue reading →

Calibrate your algorithms with Random Walk

When you are designing any algorithm that inside needs a series of numeric parameters that are open to tuning, the random-walk algorithm is an excellent way to refine your algorithm and improve accuracy. I made a small project in PHP that provides you the ability to generate numeric parameters and create series of iterations o…

A comprehensive summary for Elastic Search

What is ElasticSearch? ElasticSearch is a search server that works on top of Lucene. A search engine is a type of technology that allows you to do text searches very efficiently. The best-known equivalent of this use case is Google, although there are others such as Yahoo or Bing. Although today it is not necessary…

Elastic Tour Madrid 2019 conference highlights

Last week I was in Madrid to attend the Elasticsearch’s european conference. We had the opportunity to talk to the experts and see the last features of the search engine. Seems that one of the main focuses for 2019 is machine learning. Since the system has evolved to a log store system, one of the…

SparkSQL-Dashboard: Create your own dashboard based on millions of rows

This year I have released a small framework based in Pyhton and Javascript to be able to graph SQL queries. The main idea behind this project is to map 1 SQL query to 1 two-dimensional graph line or to 1 table row. By using Spark instead of MySQL we get all the commodities provided by…

Released the date-extraction library for Ruby

Hi all, On my github account you will find a Ruby version of the date.extraction library: https://github.com/raimonbosch/ruby-date-extraction This is a project that uses regular grammars in order to understand raw dates and convert them into the string format:“%Y-%m-%d %H:%M:%S“ although it can deliver also the timestamp format or any format you may need. The methodology can…

UCC 2016, Shanghai, A Methodology for Full-System Power Modeling in Heterogeneous Data Centers

On 6th of December I was in Shanghai presenting one of our last papers that we wrote at Barcelona Supercomputing Center for the 9th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2016). Our current research work focuses on the area of Energy-aware Management for Modern Distributed Computing Systems. Our goal is to develop management algorithms for…

Multihost configuration on Docker

Entire tutorial here: http://chunqi.li/2015/11/09/docker-multi-host-networking/ This is an excellent alternative if your system is based in Docker and you want to enable communication between dockers placed in different hosts. Using the overlay network each docker will get a 10.0.0.X IP that will be visible by all containers regardless of its host. In our use case we have used…

Probabilistic actions to extract artists from a text

Author: Raimon Bosch Abstract One of the main problems of software technologies is that often they only provide one possible solution for a problem. This obligates us to create human validation systems to cover this cases that are not that general, those exceptions that a computer can’t detect. But what if we provide several solutions…

Interesting SEO factors on your access log: Bot’s efficiency, bounce rate and average page views

This week I have been experimenting with a new database based on data taken from an access log. This database basically keeps information of all the pages that had been accessed on our website in the last month. An example of a register of this database would be: path user_visits bot_visits google_visits page_views bounces filters /live-music-barcelona 20 2…

Sentiment Analysis: Incremental learning to build domain models

Raimon Bosch, Master thesis – Intelligent Interactive Systems, Universitat Pompeu Fabra (2013), Prof. Dr. Leo Wanner Abstract Nowadays, social contacts are vital to find relevant content. We need to connect with people with similar interests because they provide content that matters. Every day is more clear that in the future of document recommendations will be necessary to…

Text Categorization with K-Nearest Neighbors using Lucene

Text categorization (also known as text classiﬁcation, or topic spotting) is the task of automatically sorting a set of documents into categories from a predeﬁned set. Text categorization is a complex problem to solve, for solving it you need to provide a variable for each important word in your text. Maybe not stopwords or very common…

How to use Near Real Time Search in Solr

As you might know Solr has prepared a cool new feature for its release 4.0: Near Realtime Search. With this new feature our search engine will be able to perform in-memory commits a.k.a. soft commits without having to perform a real commit that can cause some seconds of bad performance to your users. If you…

How to create a Solr index and speed up your data

If you are designing a website and you want to have a solid backend Solr is an exceptional choice not only because its search capabilities and all the integration with the lucene ecosystem also because its capacity to shard your data and get very good response times. But which is the best approach in order…

Raimon Bosch ./blog

Software & Code Examples