| View previous topic :: View next topic |
| Author |
Message |
Yash Guest
|
Posted: Sat Oct 11, 2008 12:21 pm Post subject: Analytical processing problem |
|
|
We collect TV viewership data to provide details of ‘who watched what’
to advertising agencies. A typical report would show the distribution
of the number of people who watched a given advertisement. The report
will be further sliced into channels, categories of viewers, etc.
We were thinking of OLAP as the solution as it keeps the data pre-
processed.
The problem we face is the following:
User A watches from T0 to T20
User B watches from T0 to T10
User C watches from T10 to T20
If a single fact record represents 10 using of the time,
A report for T0 to T10 will show 2 users
A report for T10 to T20 shows 2 users
A report for T0 to T20 would show an aggregated count of 2 + 2 = 4
which is wrong. This should ideally be 3 as 3 distinct users have
watched between T0 and T20
Apart from this, the number of fact records created on a per 10 second
timeframe would be very large.
What would be an ideal solution to this problem? Is there a different
way we should be creating fact data? Is there a different approach
other than OLAP?
The raw logs have the fields:
UserID, advt ID, start time, end time.
The records represent the time frame for which the advertisement was
being watched without switching the channel.
We were thinking of extracting out of this to create the fact records.
Are there existing products that do something similar?
Thanks,
Yash |
|
| |
|
Back to top |
H. S. Lahman Guest
|
Posted: Sat Oct 11, 2008 8:46 pm Post subject: Re: Analytical processing problem |
|
|
Responding to Yash...
| Quote: | We collect TV viewership data to provide details of ‘who watched what’
to advertising agencies. A typical report would show the distribution
of the number of people who watched a given advertisement. The report
will be further sliced into channels, categories of viewers, etc.
We were thinking of OLAP as the solution as it keeps the data pre-
processed.
The problem we face is the following:
User A watches from T0 to T20
User B watches from T0 to T10
User C watches from T10 to T20
If a single fact record represents 10 using of the time,
A report for T0 to T10 will show 2 users
A report for T10 to T20 shows 2 users
A report for T0 to T20 would show an aggregated count of 2 + 2 = 4
which is wrong. This should ideally be 3 as 3 distinct users have
watched between T0 and T20
Apart from this, the number of fact records created on a per 10 second
timeframe would be very large.
|
I think this is a basic design problem rather than a pattern problem. I
also suspect that optimization will be a bigger problem that simply
ensuring the viewers in the count are unique.
[Advertisement]
+ adID
| 1
|
| R1
|
| presented in
| *
[TimeFrame]
+ startTime
+ endTime
| *
| views in
|
R2 |--------- [LoginRecord]
|
| *
[Viewer]
+ viewerID
Basically the R2 association is managed with a sorted lookup table that
maps viewers to time slices. So to get the count of unique viewers for
any given advertisement, one just navigates the R1 -> R2 collections.
Note that there are relatively few [TimeFrame] instances (8640 for 24
hrs). R1 is instantiated from knowing when the ads play. The raw records
are only used to instantiate entries in the R2 lookup table. But that
table is, at worst, O(number of viewers x number of time frames) in
size. One can also play implementation games by making viewerID
consecutive integers and encoding the times as integer codes to reduce
the lookup table's physical size in memory.
If there is a very large number of viewers (i.e., the log source is not
a sampling), then you might want to preprocess by creating the lookup
table temporarily in the DB from the raw data and then reading it in
sequentially to actually count the viewers per ad.
| Quote: |
What would be an ideal solution to this problem? Is there a different
way we should be creating fact data? Is there a different approach
other than OLAP?
The raw logs have the fields:
UserID, advt ID, start time, end time.
|
This seems strange. I would expect the raw data to be viewer logs
without any adID because you would probably know when each ad played
from other sources. IOW, adID may be a 3NF violation because it is
dependent on {start time, end time}, depending on what the key of the
raw log records is.
If the raw data is coming from a viewer logging system, then I would
expect the records to be naturally sorted on {start time, end time},
which would would be the key. That would make the construction of the R2
lookup table much easier. If you need to create the R2 lookup table in
the DB, that should be fairly easy if the raw data is sorted
conveniently. Since the lookup table is sorted, the counting can be done
by reading the lookup table records sequentially and just incrementing a
count.
--
There is nothing wrong with me that could
not be cured by a capful of Drano.
H. S. Lahman
hsl@pathfindermda.com
Pathfinder Solutions
http://www.pathfindermda.com
blog: http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
info@pathfindermda.com for your copy.
Pathfinder is hiring:
http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH |
|
| |
|
Back to top |
Walter Mitty Guest
|
Posted: Sat Oct 11, 2008 11:20 pm Post subject: Re: Analytical processing problem |
|
|
"H. S. Lahman" <hsl@pathfindermda.com> wrote in message
news:BT3Ik.234$Rx2.69@nwrddc01.gnilink.net...
| Quote: | Responding to Yash...
We collect TV viewership data to provide details of ‘who watched what’
to advertising agencies. A typical report would show the distribution
of the number of people who watched a given advertisement. The report
will be further sliced into channels, categories of viewers, etc.
We were thinking of OLAP as the solution as it keeps the data pre-
processed.
The problem we face is the following:
User A watches from T0 to T20
User B watches from T0 to T10
User C watches from T10 to T20
If a single fact record represents 10 using of the time,
A report for T0 to T10 will show 2 users
A report for T10 to T20 shows 2 users
A report for T0 to T20 would show an aggregated count of 2 + 2 = 4
which is wrong. This should ideally be 3 as 3 distinct users have
watched between T0 and T20
Apart from this, the number of fact records created on a per 10 second
timeframe would be very large.
I think this is a basic design problem rather than a pattern problem. I
also suspect that optimization will be a bigger problem that simply
ensuring the viewers in the count are unique.
[Advertisement]
+ adID
| 1
|
| R1
|
| presented in
| *
[TimeFrame]
+ startTime
+ endTime
| *
| views in
|
R2 |--------- [LoginRecord]
|
| *
[Viewer]
+ viewerID
Basically the R2 association is managed with a sorted lookup table that
maps viewers to time slices. So to get the count of unique viewers for any
given advertisement, one just navigates the R1 -> R2 collections.
Note that there are relatively few [TimeFrame] instances (8640 for 24
hrs). R1 is instantiated from knowing when the ads play. The raw records
are only used to instantiate entries in the R2 lookup table. But that
table is, at worst, O(number of viewers x number of time frames) in size.
One can also play implementation games by making viewerID consecutive
integers and encoding the times as integer codes to reduce the lookup
table's physical size in memory.
If there is a very large number of viewers (i.e., the log source is not a
sampling), then you might want to preprocess by creating the lookup table
temporarily in the DB from the raw data and then reading it in
sequentially to actually count the viewers per ad.
What would be an ideal solution to this problem? Is there a different
way we should be creating fact data? Is there a different approach
other than OLAP?
The raw logs have the fields:
UserID, advt ID, start time, end time.
This seems strange. I would expect the raw data to be viewer logs without
any adID because you would probably know when each ad played from other
sources. IOW, adID may be a 3NF violation because it is dependent on
{start time, end time}, depending on what the key of the raw log records
is.
If the raw data is coming from a viewer logging system, then I would
expect the records to be naturally sorted on {start time, end time}, which
would would be the key. That would make the construction of the R2 lookup
table much easier. If you need to create the R2 lookup table in the DB,
that should be fairly easy if the raw data is sorted conveniently. Since
the lookup table is sorted, the counting can be done by reading the lookup
table records sequentially and just incrementing a count.
--
There is nothing wrong with me that could
not be cured by a capful of Drano.
H. S. Lahman
hsl@pathfindermda.com
Pathfinder Solutions
http://www.pathfindermda.com
blog: http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
info@pathfindermda.com for your copy.
Pathfinder is hiring:
http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH
|
My two cents...
Do a web search on "overlapping time intervals". You'll find several
solutions to the exact same problem. |
|
| |
|
Back to top |
Yash Guest
|
Posted: Sun Oct 12, 2008 5:45 am Post subject: Re: Analytical processing problem |
|
|
I agree with the DB design proposed by H. S. Lahman. However, one
thing to note is that the we do not want to query the relational DB to
count the viewership of ads every time a report is submitted. Typical
queries would be:
Show the count of distinct people who watched XYZ ad between 10:00am
and 11:am on 25th Sep
Show the count of distinct people who watched XYZ ad between 10:00am
and 10:pm on 25th Sep
OLAP would keep the counts(measures) precomputed in a different
structure(cube) and would not go to the DB for every request. The
counts of individual minutes would add up to the count of the hour
which futher adds up to the day.
But the time overlap explained in the original post prevents us from
using this technique as the aggregated counts woul be wrong. We would
not get the distinct number of viewers.
So the main challenge in the problem is how to pre-process the records
to compute and aggregate the counts the way OLAP does. I am still
looking for a this solution which may not involve OLAP but something
similar.
I could not find a solution on searching for "overlapping time
intervals". The problem we are dealing with is not only to find
overlapping time intervals but to aggregate the measures efficiently.
Thanks,
Yash
Walter Mitty wrote:
| Quote: | "H. S. Lahman" <hsl@pathfindermda.com> wrote in message
news:BT3Ik.234$Rx2.69@nwrddc01.gnilink.net...
Responding to Yash...
We collect TV viewership data to provide details of �who watched what�
to advertising agencies. A typical report would show the distribution
of the number of people who watched a given advertisement. The report
will be further sliced into channels, categories of viewers, etc.
We were thinking of OLAP as the solution as it keeps the data pre-
processed.
The problem we face is the following:
User A watches from T0 to T20
User B watches from T0 to T10
User C watches from T10 to T20
If a single fact record represents 10 using of the time,
A report for T0 to T10 will show 2 users
A report for T10 to T20 shows 2 users
A report for T0 to T20 would show an aggregated count of 2 + 2 = 4
which is wrong. This should ideally be 3 as 3 distinct users have
watched between T0 and T20
Apart from this, the number of fact records created on a per 10 second
timeframe would be very large.
I think this is a basic design problem rather than a pattern problem. I
also suspect that optimization will be a bigger problem that simply
ensuring the viewers in the count are unique.
[Advertisement]
+ adID
| 1
|
| R1
|
| presented in
| *
[TimeFrame]
+ startTime
+ endTime
| *
| views in
|
R2 |--------- [LoginRecord]
|
| *
[Viewer]
+ viewerID
Basically the R2 association is managed with a sorted lookup table that
maps viewers to time slices. So to get the count of unique viewers for any
given advertisement, one just navigates the R1 -> R2 collections.
Note that there are relatively few [TimeFrame] instances (8640 for 24
hrs). R1 is instantiated from knowing when the ads play. The raw records
are only used to instantiate entries in the R2 lookup table. But that
table is, at worst, O(number of viewers x number of time frames) in size.
One can also play implementation games by making viewerID consecutive
integers and encoding the times as integer codes to reduce the lookup
table's physical size in memory.
If there is a very large number of viewers (i.e., the log source is not a
sampling), then you might want to preprocess by creating the lookup table
temporarily in the DB from the raw data and then reading it in
sequentially to actually count the viewers per ad.
What would be an ideal solution to this problem? Is there a different
way we should be creating fact data? Is there a different approach
other than OLAP?
The raw logs have the fields:
UserID, advt ID, start time, end time.
This seems strange. I would expect the raw data to be viewer logs without
any adID because you would probably know when each ad played from other
sources. IOW, adID may be a 3NF violation because it is dependent on
{start time, end time}, depending on what the key of the raw log records
is.
If the raw data is coming from a viewer logging system, then I would
expect the records to be naturally sorted on {start time, end time}, which
would would be the key. That would make the construction of the R2 lookup
table much easier. If you need to create the R2 lookup table in the DB,
that should be fairly easy if the raw data is sorted conveniently. Since
the lookup table is sorted, the counting can be done by reading the lookup
table records sequentially and just incrementing a count.
--
There is nothing wrong with me that could
not be cured by a capful of Drano.
H. S. Lahman
hsl@pathfindermda.com
Pathfinder Solutions
http://www.pathfindermda.com
blog: http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
info@pathfindermda.com for your copy.
Pathfinder is hiring:
http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH
My two cents...
Do a web search on "overlapping time intervals". You'll find several
solutions to the exact same problem. |
|
|
| |
|
Back to top |
H. S. Lahman Guest
|
Posted: Sun Oct 12, 2008 10:24 pm Post subject: Re: Analytical processing problem |
|
|
Responding to Yash...
| Quote: | I agree with the DB design proposed by H. S. Lahman. However, one
thing to note is that the we do not want to query the relational DB to
count the viewership of ads every time a report is submitted. Typical
queries would be:
Show the count of distinct people who watched XYZ ad between 10:00am
and 11:am on 25th Sep
Show the count of distinct people who watched XYZ ad between 10:00am
and 10:pm on 25th Sep
|
There is nothing to prevent you from computing the counts once and
storing them in the Timeframe objects. (Once the raw data is recorded
for a time frame, the number of viewers won't change.) You are
deliberately denormalizing the data but in this context that should be
pretty benign precisely because the counts never change once they are
computed.
Then you can persist the TimeFrame objects separately and the individual
requests just access those TimeFrame objects. Unless there is something
else going on, that should not bog down the system very much since there
are so few TimeFrame objects compared to raw data records.
| Quote: |
OLAP would keep the counts(measures) precomputed in a different
structure(cube) and would not go to the DB for every request. The
counts of individual minutes would add up to the count of the hour
which futher adds up to the day.
|
Note that aggregation to larger time frame units is also a one-time
computation and you can save those flavors of TimeFrame as well. Better
yet, that aggregation is done from the existing TimeFrame objects rather
than the raw data.
| Quote: | But the time overlap explained in the original post prevents us from
using this technique as the aggregated counts woul be wrong. We would
not get the distinct number of viewers.
|
There is no free lunch. OLAP has to compute the counts so that solution
will have to process the raw data, probably pretty much the same way as
I suggested. IOW the queries will have to be more complicated. More to
the point, OLAP has to save the counts somehow and that requires
resources. In the end it is probably going to disk anyway with the OLAP
server managing LRU access issues for the counts.
--
There is nothing wrong with me that could
not be cured by a capful of Drano.
H. S. Lahman
hsl@pathfindermda.com
Pathfinder Solutions
http://www.pathfindermda.com
blog: http://pathfinderpeople.blogs.com/hslahman
"Model-Based Translation: The Next Step in Agile Development". Email
info@pathfindermda.com for your copy.
Pathfinder is hiring:
http://www.pathfindermda.com/about_us/careers_pos3.php.
(888)OOA-PATH |
|
| |
|
Back to top |
|