Tuesday, May 16, 2017

Technology Report: Apache Flink

I. What is Apache Flink?:
Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
More information can be found on Flink website. Below is some use cases from the website:

  • Optimization of e-commerce search results in real-time: Alibaba’s search infrastructure team uses Flink to update product detail and inventory information in real-time, improving relevance for users.
  • Stream processing-as-a-service for data science teams: King (the creators of Candy Crush Saga) makes real-time analytics available to its data scientists via a Flink-powered internal platform, dramatically shortening the time to insights from game data.
  • Network / sensor monitoring and error detection: Bouygues Telecom, one of the largest telecom providers in France, uses Flink to monitor its wired and wireless networks, enabling a rapid response to outages throughout the country.
  • ETL for business intelligence infrastructure: Zalando uses Flink to transform data for easier loading into its data warehouse, converting complex payloads into relatively simple ones and ensuring that analytics end users have faster access to data.

We can tease out common threads from these use cases. Based on the examples above, Flink is well-suited for:
  • A variety of (sometimes unreliable) data sources: When data is generated by millions of different users or devices, it’s safe to assume that some events will arrive out of the order they actually occurred–and in the case of more significant upstream failures, some events might come hours later than they’re supposed to. Late data needs to be handled so that results are accurate.
  • Applications with state: When applications become more complex than simple filtering or enhancing of single data records, managing state within these applications (e.g., counters, windows of past data, state machines, embedded databases) becomes hard. Flink provides tools so that state is efficient, fault tolerant, and manageable from the outside so you don’t have to build these capabilities yourself.
  • Data that is processed quickly: There is a focus in these use cases on real-time or near-real-time scenarios, where insights from data should be available at nearly the same moment that the data is generated. Flink is fully capable of meeting these latency requirements when necessary.
  • Data in large volumes: These programs would need to be distributed across many nodes (in some cases, thousands) to support the required scale. Flink can run on large clusters just as seamlessly as it runs on small ones.
II. Download and Setup:
  • Requirement: Java must be installed, and the folder for JVE must be included in %PATH% variable.
After downloading and extracting the package, go to bin folder and type start-local.bat to start Flink job manager.
After that we can connect to localhost:8081 to check if everything works.
Above is how to download and install for Window.
For Mac, we can just install homebrew and type $brew install apache-flink
For Linux: 
III. Extract, Transform and Load Data:
Before going with complex and big data like yelp dataset, I started install Flink Quick Start and tried to run it locally.
There are several ways to download quickstart.
For Mac: we can use: $ curl https://flink.apache.org/q/quickstart.sh | bash
Another way is to download Maven and set Maven archetype as follow
      $ mvn archetype:generate \ 
       -DarchetypeGroupId=org.apache.flink \ 
       -DarchetypeArtifactId=flink-quickstart-java \ 
       -DarchetypeVersion=1.2.0
Or we can download IDE for Flink, I prefer IntelliJ IDEA and it allows you to evaluate free for 1 month.
Finished Job for QuickStart:

As Quick Start comes with a project skeleton, we can easily import Flink libraries and work with BatchJob or StreamJob files in the main folder.
I use csvkit to transform and extract valuable data that is needed for this test, for Yelp Business file, I only get city, review_count and name of business into file called business.csv

After that, we can load the data with IDE, I immediately output the data to csv files using Flink Data Sink to see if we load the file correctly.
The program ran successfully, output a folder with 8 csv files, each file was about 550KB.
Now, we can delete all the files, then test some transformation with Flink using filter(), map() and reduce().
First of all, we need to check if we loaded the data correctly, by counting the number of lines for the file. 



The program printed out 144072 lines, without the first line for schema so we got the correct number.

First query I tried is to find top 5 business in each city that has most review_count.

The results folder had 8 files with similar size again, 55 kb each, so I think this is how IntelliJ IDEA setup to output files.

Next, I added another transformation: filter, to get all top review from Las Vegas business.
I tried to do it all at once, but got error that the IDE does not recognize my static function. So I commented it out and filter before group and sort.
Above is the result for such query, all is stored in file 2 instead of being divided equally to 8 files as before.
Next, I tried to extract and transform data from the whole business file, without using csvkit.
The important line here is .includeFields with an array of 1 and 0. There are 16 bits for this parameter, 1 is included, 0 is not included. So I put 1 in the 8th, 9th and 10th bits to include city, review_count and name.
Another problem I have encountered while loading data was with the review file. First, I tried to load the same way as business, then got error: "Row too short". So there are some rows in the files that do not qualify the schema. I added ".ignoreInvalidLines()" then check number of loaded lines.
There were only 2353660 lines loaded (more than 50%). So I tried to use csvgrep to get business_id and stars, then load again.

4153150 lines were loaded now, so we successfully extracted and loaded important data that we need.
Checkin, User and Tip files were loaded as same as Business without any problems.
Last query that I worked on this DataSet API of Flink is to output top 100 businesses that have the most stars.
To do that, first I need to add all the stars in review dataset with the same business_id.

To add all stars, we need to use reduce() transformation and implement StarsCounter() based on that.
Then join two dataset by business_id (field 0 for both) and output business.name(field 3 for business) and total stars(field 1 for review_sum) using my own JoinBusinessReview class that implements JoinFunction.

We can see the following output with field delimiter set to "\t":
IV. Table API and DataStream API:
Flink also has Table API which is a high level language, same as SQL.
Flink's main strength is DataStream API. Below is the sample program from Flink that takes input stream at port 9999 and feed it to output every 5 seconds.
V. Optimization:
Quality: we can use GroupReduce and Set in Java to add data into Set so that Set can eliminate duplicate data for us.
Using Partition in Flink to optimize performance with parallel programming, we can also use Rebalance function to balance all parallel partitions.
VI. Conclusion:
Flink has both DataSet API and DataStream API and excels at Streaming Data. We can use Flink with the following structure:
Flink takes data from input Stream, processes it with real time query and output to live data view, then transfer those data into data warehouse Batch Layer where we can do complicated queries with MapReduce, transformation. Both layers output data as csv file which can be stored and analyzed later within the Batch Layer.
Take another closer look with Flink use cases in big corporation.
Most of them use Flink to process and analyze real-time data. Below is that last picture for Flink's advantages and why we should use Flink.







Tuesday, May 17, 2016

CS 108 - Game Study


I have been playing games since I was born. I learned how to play Chinese Chess at 3, video games at 6 and continue to discover new games around me. Game is an essential part that makes me who I am today. Therefore when I decided to choose one of the elective course, CS 108 - Game Study is my first choice, and the course even exceeds my expectation.
There are so much more purposes and types of games that I have never known before. For me, playing game is to have fun, challenge myself and earn the happiness of victory or the experience from defeat. After taking the course, I learn that some games are made so that we can understand each other, feel other people's feelings, have a environment that we can put ourselves into their position and then look at things in a different prospective.
Reading about games, watching videos about how game can benefit us help me understand more about the importance of game in our life. Then I learned about GameMaker, a software that can turn beautiful but simple ideas into a wonderful game easily. I enjoy spending my time to apply all the ideas I have had since I was a little kid to make my own games, played by my own rules.
Moreover, the final presentation helped to bring us together. I can learn what other people think about game, beautiful ideas my classmates have put into their games, and have a chance to show our game to the whole class. I feel proud making a game with my group, and the pride keeps growing when I presented it to the whole class, and take down comments, compliments, ideas I can add to. It is like creating my own world with piece by piece contributed by people around, and the feeling is wonderful.

Tuesday, May 3, 2016

Blog 8 - Dogfight

To do our final project, our group of four: Jeffrey Tran, Hoai Nguyen, Marlowe De Vera, and me Huy Nguyen work together to make an aircraft shooting game. Marlowe and Hoai are in charge of the art work, Jeffrey and I are in charge of programming and design. We came up with the ideas of different type of enemies and we have to do different mechanics for each one of them. For example for meteors, you cannot should down meteors, all you can do is to move out of their way. On another hand you cannot hide from heat-seeking missile as it always follows you, the only way to take down the missiles is to shoot them down before they reach you. I made the game so that if you cannot shoot all the missiles down then you are dead but Jeff thought it was too hard so he changed the gameplay: once you left the missiles behind, they never follow you.
We still need more work on the boss level and add several enemies to add the variety of mechanics to the game. One of the enemies I thought about is the shielded bomber. Its ability is to have shield on for several seconds and off for 1 second. If you shoot it while it is shielded, it will explode and kill the player, you have to time it right to shoot it right at the time the shield is down.
The boss should have all mechanics of meteors, missile and bomber. The boss' cannon will be shielded and we need to time to take it down, it the cannon is destroyed, it will be torn off and flying toward us. The boss also shoot random missiles at the player. Those are all the ideas and designs we have in mind, we will try to implement it and make the game better.

Saturday, April 23, 2016

Ingress Game

Our professor just introduced us this new cool game called Ingress. The game runs on mobile platform and requires your GPS to locate your position. The game rule is simple, each historical or famous place in the real world is a portal in the game world. We need to deploy the portal to generate energy and gain territory from portals. All players are divided into two factions: the Enlightened and the Resistance. When you hack a portal, you can gain some items such as: resonators with levels to deploy and upgrade, gun to destroy other faction's portal, power cube to regain your XM. XM stands for Exotic Matter, is similar to your mana pool or energy. We need XM to spend for your actions, and while we are walking in real life with the game on, we gain more XM. If we have no XM, our scanner is disabled and could not locate our position so we had better store some power cubes to regain XM in some emergency. The game requires players in each faction to play together in a group, keep coming back to each portal site to recharge the resonators or we can use portal key signet to remotely recharge them. Each player is just a small part of a game, but working together as a faction building your resonators and turrets will easily help you gain more territory and energy to bring your chosen factor to victory.

Wednesday, April 13, 2016

Flying Monkey Prototyping

Jeffrey and I continued to work on the game, to develop the background and gameplay. I had an idea about creating a jungle background with grass on the ground and huge trees covering the sun and sky and Jeffrey sent me this, not dark and sad as I expected but exciting and lively as a monkey game should look like. The game rules and controls are simple: using arrows to move around, space to shoot bananas, kill the giant spider and pick up the giant banana to win the game. We made the background scrolling so it looks like the monkey is flying around the jungle. We should make it the endless running game, but I was still working on how random monsters appear after launch so our game is really easy right now, the small spiders only need one shot to die, and the giant spider does not have any harmful mechanics that can kill the player, or surprise him/her. Even though it is just a simple game, we are proud that it is our first game. Jeffrey Tran as a Designer and I, Huy Nguyen, as a Programmer spent a lot of time and hard work to make this happen. The more I work with Gamemaker the more interesting I feel about this software, wonderful and easy to make, even for beginner. This first game is like our practice to make further better games. Here is the link to our game: download here.

Wednesday, March 23, 2016

Flying Monkey Game

Jeffrey and I are working together to make our first game. Both of us want to contribute as a programmer, however as artist is as important as programmer and I can't draw so Jeffrey agreed on the drawing job. In our first meeting, we talked about ideas, what we should do, what our game should be and I wanted to create a game about a monkey, because this year is a Monkey year. At first, I just want to create some random monkey climbing around in the jungle, avoiding danger and taking objectives. Then Jeffrey thought it would be great if we have a flying monkey riding a jet pack with a gun, shooting enemies with bananas so we came up with this game: Flying Monkey.
My role is to create interaction with all the characters, we planned to make it endless run and random spiders or snakes would appear and we have to avoid it or shoot it down, but I have not figure out how to create random stuff and endless room. Therefore for our first playable, I just made one simple game with five big black spider running around in a simple path, our monkey can shoot them down to pick up the giant banana and win the game. Here is the link to our first playable ever: link.
My main question is should I create a fence around the room, that limits our monkey's movement, I think that would be create, and easier to show that spider crawl up and down from there. We will develop the game more and more, and will come up with a better version of Flying Monkey soon.

Wednesday, March 9, 2016

Video Games Experience


On Monday March 7, 2016 Garrett and I played several games, but the three most impressive to me are "This is the only level", "The Beagles", and "Wizard Wizard." The common things between the three games are simple designs, jumping to avoid dangerous traps and using arrow buttons as main movement. "This is the only level" is really interesting, if I had not watched Garrett played it, I would have no ideas how to pass through the stages. There are different stages with different environments but the same objective: to go through the gate at the right corner. At each stage, there is one short line at the left corner as a hint to player how to pass the stage. The game shows us different ways to show one problem, and sometimes it requires an open-minded player to figure out new, out of the box thinking to reach the final stage (for example: refreshing the stage, reading the credits, etc). For "Wizard Wizard", this game is the most difficult among the three. We died more than fifty times that night just to pass seven stages. The game's mechanic is simple and often seen at most game: jumping around to avoid the saw and get the key, to reach the gate. The wizard can use double jump too so it makes him more flexible for going up and down levels. The last game is "The Beagles" where we need to go to different height levels to rescue all the beagles, if we miss one, we lose. To go to higher level, we have to use rope; to go to lower level, we can either jump straight down or use parachute if it is too high. We have to fight the doctors and pet controls by shooting owls at them; ropes, parachutes and owls are picked up along the way. All there games are fun to play and watch. Garrett played "This is the only level" first, and he has played it many times before, that makes him the expert to go through the level really fast and gives me the idea of how to pass it with an open thinking. New tricks and skills are easily to learn if you watch others play it first, so that you can make it your own and develop the tricks.