Skip to content
This repository was archived by the owner on Mar 19, 2019. It is now read-only.

Test resultts

jbytecoder edited this page Jun 18, 2015 · 2 revisions

Scenario 1 results:

The scenario got tested by loading keywords to look for form predefined dictionary of english words. there wehere at maximum 250 words available which was sufficent to stress both big and small spark cluster. The function that testet the delay started by erasing the existsance of the word like from database. Then it inserted this word so that it would be look after. Then the timer got started. After that database was queried for existance of data related to this keyword. Following sql query got used to detect data presence:

 select k.FLD_NAME from TBL_HASHTAG_KEYWORD_USAGE as u
                        join TBL_KEYWORD as k on u.FKF_KEYWORD = k.FLD__ID
                        where k.FLD_NAME = ?
                        group by k.FLD_NAME;

If data was not found the thread slept for 1 second and repreted the sequnece until data was ready or one minute elespaded.

The measurement for the smaller sprak cluster looks like this:

keyword delay chart

The data where collected in 10 measurements for each data size, and then the blue line represents the minimal timings reported, the red line represents the maximal reported time, and yellow line reresents the mediana of whole mesurements. As we can see the data depends lineray on the number of keywords in database we can also see that above 100 of keywords the delay becomes inacceptable

The second chart demonstrates mesurements form bigger spark cluster 1+3

BIgger cluster keyword delay

as we can see the tendency is aobut the same as on the sxmaller cluster, we can see that there is some sort of inherit limit to how low the delay can be and it is about 5 seconds, we think that it could be contributed to network communiations between spark and db - this kind of delay is tolerable

We prepraed also a diagram in wchich we present how the delays on differn clusters relate to each other

Comparision

we can clearly see that the delay grows more gradually on bigger cluster. This leeds us to conclusion that spark capacity was the bottleneck in this example and not the database

Scenario 3 results

The specific sql query that was used to benchmark the database was:

 select k.FLD_NAME,SUM(u.FLD_COUNT) from TBL_HASHTAG_KEYWORD_USAGE as u 
                         join TBL_KEYWORD as k 
                         on u.FKF_KEYWORD = k.FLD__ID
                         join TBL_HASHTAG_CATEGORY_USAGE as c 
                         on u.FKF_CATEGORY = c.FLD__ID;
                         where u.FLD_COUNT > 0
                         grou by k.FLD_NAME
                         having MIN(u.FLD_COUNT) > 0;

this query was choosen because the root table contains the most records in database, and allows some joins to further stress out the database.

The benchmark executed this query and also read all the data returend by query - and the whole time got presented as benchmark time.

The results are as follows: stress query

We can see that the database did well up to aobut 100,000 records in root table after which the delay starts o be a little to much. This leeds to conclusion that SQL database would'nt stand real life big data, datasets

Clone this wiki locally