-
Notifications
You must be signed in to change notification settings - Fork 0
Test resultts
The scenario got tested by loading keywords to look for form predefined dictionary of english words. there wehere at maximum 250 words available which was sufficent to stress both big and small spark cluster. The function that testet the delay started by erasing the existsance of the word like from database. Then it inserted this word so that it would be look after. Then the timer got started. After that database was queried for existance of data related to this keyword. Following sql query got used to detect data presence:
select k.FLD_NAME from TBL_HASHTAG_KEYWORD_USAGE as u
join TBL_KEYWORD as k on u.FKF_KEYWORD = k.FLD__ID
where k.FLD_NAME = ?
group by k.FLD_NAME;
If data was not found the thread slept for 1 second and repreted the sequnece until data was ready or one minute elespaded.
The measurement for the smaller sprak cluster looks like this:

The data where collected in 10 measurements for each data size, and then the blue line represents the minimal timings reported, the red line represents the maximal reported time, and yellow line reresents the mediana of whole mesurements. As we can see the data depends lineray on the number of keywords in database we can also see that above 100 of keywords the delay becomes inacceptable
The second chart demonstrates mesurements form bigger spark cluster 1+3

as we can see the tendency is aobut the same as on the sxmaller cluster, we can see that there is some sort of inherit limit to how low the delay can be and it is about 5 seconds, we think that it could be contributed to network communiations between spark and db - this kind of delay is tolerable
We prepraed also a diagram in wchich we present how the delays on differn clusters relate to each other

we can clearly see that the delay grows more gradually on bigger cluster. This leeds us to conclusion that spark capacity was the bottleneck in this example and not the database
The specific sql query that was used to benchmark the database was:
select k.FLD_NAME,SUM(u.FLD_COUNT) from TBL_HASHTAG_KEYWORD_USAGE as u
join TBL_KEYWORD as k
on u.FKF_KEYWORD = k.FLD__ID
join TBL_HASHTAG_CATEGORY_USAGE as c
on u.FKF_CATEGORY = c.FLD__ID;
where u.FLD_COUNT > 0
grou by k.FLD_NAME
having MIN(u.FLD_COUNT) > 0;
this query was choosen because the root table contains the most records in database, and allows some joins to further stress out the database.
The benchmark executed this query and also read all the data returend by query - and the whole time got presented as benchmark time.
The results are as follows:

We can see that the database did well up to aobut 100,000 records in root table after which the delay starts o be a little to much. This leeds to conclusion that SQL database would'nt stand real life big data, datasets