Projects/SCADS/Indexes

From RAD Lab

Jump to: navigation, search

Contents

Conclusions

Index function classes

  • simple: update from one value in main data is update to one index
  • more complex: update from one of value is update to multiple values in index (didn't see more than two yet)
  • collection membership: (two-way join) does an object belong to some group of objects, e.g. tasks and lists; also member count
  • reversing relationships --store one way and get lookups both ways: (self join) e.g. friendships
  • updated computation: (aggregation, self lookup and store agg state, if not holistic) e.g. photo viewcount, or seller rating
  • sort orders: (interesting orders) to support different sort orders, need index for each one
  • classifiers: (drill-down, roll-up) "tag cloud with constraints" tree-like or DAGs for topic; have index for hierarchy structure and for describing object membership.

Other thoughts

  • should support serialized column values to avoid second lookup on primary key... implications of either technique?
  • may need support for serialized column value, if having pointer won't make sense. e.g. seller rating if ratings are stored with auctions
  • implications on running time of having mix of simple updates and complicated update computation (like collaborative filtering) in one function, e.g. adding photo to favorite set means updating interestingness
  • topic hierarchies result in "prefix searches" for membership; column values can be treated as the prefix, e.g. "gardening, vegetables, organic". should there be an index for each prefix to check membership? or can we do prefix matching on columns? (this assumes topic membership is stored with object) what about queries like "top 10 organics"?

Making a new index

Need original data

  • Making a new sort order: since other sort orders have stored some top k, the new order may call for a different set of k items
  • set of ANDed predicates, e.g. tag1 AND tag2 AND tag3:

Use existing indexes

  • set of ORed predicates, e.g. tag1 OR tag2 OR tag3: can use individual indexes on each component to generate an index of their union; this will bound the new index cardinality by the number of arguments
  • distributive aggregation, e.g. count(tag1) + count(tag2), assuming individual counts are indexes

Example schemas and index functions

facebook

Main data

<!-- main tables -->
	<ColumnFamily ColumnType="Super" Index="Time">events <!-- key is event_id -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: time/, location/, etc. -->
		</ColumnFamily>
		<ColumnFamily Index="Name">guests
			<!-- columns: uid1/, uid2/, ... -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">users <!-- key is uid -->
		<ColumnFamily Index="Name">profile
			<!-- columns: email/, hometown/, birthday/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">friends
			<!-- columns: uid1/, uid2/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">photos
			<!-- columns: pid1/, pid2/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Time">pokes
			<!-- columns: uid1/, uid2/, ... -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">groups <!-- key is gid -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: name/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">admins
			<!-- columns: uid1/, uid2/,... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">members
			<!-- columns: uid1/, uid2/... -->
		</ColumnFamily>
	</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">photos <!-- key is pid -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: owner_uid/, caption/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">albums
			<!-- columns: name1/, name2/,... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">tags
			<!-- columns: uid1/, uid2/,... -->
		</ColumnFamily>
		<ColumnFamily Index="Time">comments
			<!-- columns: uid1/comment, uid2/comment,... -->
		</ColumnFamily>
	</ColumnFamily>


Index functions

<!-- indexes -->
	<ColumnFamily Index="Name">friendship <!-- key is uid -->
		<!-- columns: uid1/, uid2/,... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">events_upcoming <!-- key is uid -->
		<!-- columns: event_id1/, event_id2/, ... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">events_friends <!-- key is uid -->
		<!-- columns: event_id1/, event_id2/, ... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">photos_of_us <!-- key is uid -->
		<!-- columns: pid/uid1, pid/uid2,... -->
	</ColumnFamily>


// given an event id, produce a tuple for each event that a user is affiliated with, in sorted order
[uid->eid] events_upcoming(eid) {
	x = get(eid,Events,guests) // get all uid columns for this event id
	for each x
		put(x,index_events_upcoming,eid)
}
// given an event id, produce tuple for each event that a user's friend is affiliated with
[uid,eid] events_friends(eid) {
	x = get(eid,Events,guests) // for all guests of this event
	for each x
		f = get(x,index_friends)//find all their friends
		for each f
		put(f,index_events_friends)
}
// for a given photo, do pairwise of everyone in it
[uid1,uid2,pid],[uid2,uid1,pid] photos_of_us(pid) {
	x =y = get(pid, Photos,tags) // get everyone in photo
	for each x != y // pair up everyone
		put(x,index_photos_us,y,pid)
		put(y,index_photos_us,x,pid)
}

thoughts:

  • cascading index updates of adding another friend, this implies order of index updates
  • cascades are also a consequence of allowing joins
  • logic language to express bounded recursion? (may be tricky in SQL to express number of times to do something)

digg

Main data

<!-- Main data  -->
		<ColumnFamily ColumnType="Super" Index="Name">users
			<!-- key is user_id -->
			<ColumnFamily Index="Name">metadata
				<!-- columns: profile key/val -->
			</ColumnFamily>
			<ColumnFamily Index="Time">dugg_items
				<!-- columns: item_id/+-1 -->
			</ColumnFamily>
			<ColumnFamily Index="Time">dugg_comments
				<!-- columns: item_id:comment_text/+-1 -->
			</ColumnFamily>
			<ColumnFamily Index="Name">friends
				<!-- columns: user_id/ -->
			</ColumnFamily>
		</ColumnFamily>
		<ColumnFamily ColumnType="Super" Index="Time">items
			<!-- key is item_id -->
			<ColumnFamily Index="Name">metadata
				<!-- columns: topic/, title/ , added_by_user/-->
			</ColumnFamily>
			<ColumnFamily Type="Super" Index="Time">comments
				<ColumnFamily Index="Time">comment_id
					<!-- columns: user_id/comment_text, parent_comment/ -->
				</ColumnFamily>
			</ColumnFamily>
		</ColumnFamily>
		<ColumnFamily Index="Name">topics <!-- key is topic_id -->
			<!-- columns: text/parent_topic_id -->
		</ColumnFamily>

Index functions

<!-- Indexes -->
		<ColumnFamily Index="Time">diggs
			<!-- key is item_id -->
			<!-- column: num_diggs -->
		</ColumnFamily>
		<ColumnFamily ColumnType="Super" Index="Time">top
			<!-- key is topic_id -->
			<ColumnFamily Index="Name">dugg_items
				<!-- columns: item_id/num_diggs-->
			</ColumnFamily>
		</ColumnFamily>
		<ColumnFamily Index="Time">recent
			<!-- key is topic_id -->
			<ColumnFamily Index="Time">dugg_items
				<!-- columns: item_id/num_diggs-->
			</ColumnFamily>
		</ColumnFamily>
		<ColumnFamily Index="Time">comment_diggs
			<!-- key is comment_id -->
			<!-- columns: num_diggs/ -->
		</ColumnFamily>
		<ColumnFamily Index="Name">subtopics
			<!-- key is topic_id -->
			<!-- columns: topic_id(s) -->
		</ColumnFamily>

urls: digg.com/topic/thing_title

thoughts:

  • topics: tree hierarchy/ancestry structure? limit depth of tree?
  • threaded conversation, can has arbitrary depth?
  • an index for different sort orders or granularities or type? (affected by cardinality constraints)
  • sorted order by count? when count changes, order can change too... need aggregation state
  • for cascading updates, can we decide which ones have ordering constraints and which ones don't? can affect meeting SLAs


// given a vote on an item, update that item's diggcount
[item_id->num_diggs] diggs(item_id, +-)
{
	// get current num_diggs from index diggs:item_id
	// set diggs:item_id:num_diggs to num_diggs+-1 
}
// given vote on a comment on an item, update that comment's diggcount 
[item_id:comment_id->num_diggs] comment_diggs(item_id, +-)
{
	// get current num_diggs from index comment_diggs:item_id
	// set comment_diggs:item_id:num_diggs to num_diggs+-1 
}
// given a new topic, update any topics that should include it as a subtopic
[topic_id->sub_topic] subtopics(topic_id)
{
	// get parent topic from topics:topic_id:parent_topic
	// while have parent_topic_id:
	// 	add column to subtopics:parent_topic_id
	// 	set parent_topic_id to topics:parent_topic_id:parent_topic_id
	// end while
}
// intuition, read all topic to get all
[topic_id->item_id,num_diggs] top(item_id) // need to update all topics that item belongs to
{
	// get num_diggs from index diggs:item_id:num_diggs
	// get topic_id from items:metadata:topic
	// while have topic_id:
	// 	set top:topic_id:item_id to num_diggs
	// 	set topic_id to topics:topic_id:parent_topic_id
	// end while
}
[topic_id->item_id,num_diggs] recent(item_id)
{
	// get num_diggs from index diggs:item_id:num_diggs
	// get topic_id from items:metadata:topic
	// while have topic_id:
	// 	set top:topic_id:item_id to num_diggs
	// 	set topic_id to topics:topic_id:parent_topic_id
	// end while
}
// given a new comment on an item, update the list of comments for that item
[item_id->comment_id,num_diggs] item_comments(item_id:comment_id)
{
	// get num_diggs from comment_diggs:item_id
	// set item_comments:item_id:comment_id to num_diggs

}

eBay

Main data

<!-- main data tables -->
	<ColumnFamily ColumnType="Super" Index="Name">users <!-- key is uid -->
		<ColumnFamily Index="Name">profile
			<!-- columns: name/, location/,... -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">items <!-- key is item_id -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: owner/uid, name/, location/, etc. -->
		</ColumnFamily>
		<ColumnFamily Index="Name">photos
			<!-- columns: photo/binary -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Time">auctions <!-- key is item_id -->
		<ColumnFamily Index="Time">metadata
			<!-- columns: start/, end/, start_price/ -->
		</ColumnFamily>
		<ColumnFamily Index="Time">bid_id
			<!-- columns: uid/, bid_amount -->
		</ColumnFamily>
		<ColumnFamily Index="Name">rate_buyer
			<!-- columns: buyer/uid, comment/text, feedback/, axis/detail_rating... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">rate_seller
			<!-- columns: buyer/uid, comment/text, feedback/, axis/detail_rating... -->
		</ColumnFamily>
	</ColumnFamily>

Index functions

<!-- indexes -->
	<ColumnFamily Index="Name">seller_rating <!-- key is uid -->
                  <ColumnFamily Index="Name">feedback
                              <!-- columns: feedback_avg/, feedback_sum, feedback_count/ -->
                   </ColumnFamily>
                 <ColumnFamily Index="Name">axis1
                              <!-- columns: num_stars/, axis1_sum, axis1_count/ -->
                   </ColumnFamily>
                    .....
	</ColumnFamily>
	<ColumnFamily Index="Name">buyer_rating <!-- key is uid -->
                  <ColumnFamily Index="Name">feedback
                              <!-- columns: feedback_avg/, feedback_sum, feedback_count/ -->
                   </ColumnFamily>
                 <ColumnFamily Index="Name">axis1
                              <!-- columns: num_stars/, axis1_sum, axis1_count/ -->
                   </ColumnFamily>
                    .....
	</ColumnFamily>
	<ColumnFamily Index="Name">user_items <!-- key is uid -->
		<!-- columns: item_id1/, item_id2/, ... -->
	</ColumnFamily>


// given a new seller rating on an item, update the seller's aggregatation of ratings
public class SellerRatingIndex implements IndexUpdateFunction {
	public List<RowMutation> computeUpdate(IndexUpdateMessage message) {
		ArrayList<RowMutation> list = new ArrayList<RowMutation>();
		ColumnFamily scf_auction = message.getColumnFamilies().get("auction");
		ColumnFamily cf_rating = scf_auction.getColumn("rate_seller");
		Collection<IColumn> cols_rating = cf_rating.getAllColumns();

		// update the old values with the new
		IColumn uid = cols_rating.get("buyer"");
		ColumnFamily cf_previous = CassandraServer.getPeerStorageServer().get_cf(" // broken?
			TheTable","seller_rating_index","seller_rating_index:"+uid);
			
		Double new_feedback_sum = cf_previous.getColumn("feedback_sum") +cols_rating.get("feedback")
		Double new_feedback_count = cf_previous.getColumn("feedback_count")+1
		Double new_feedback_avg = new_feedback_sum/new_feedback_count
		
		RowMutation rmc = new RowMutation("TheTable","seller_rating_index:"+uid);
		rmc.add(("seller_rating_index:feedback_sum"),(""+new_feedback_sum)).getBytes());
		rmc.add(("seller_rating_index:feedback_count"),(""+new_feedback_count)).getBytes());
		rmc.add(("seller_rating_index:feedback_avg"),(""+new_feedback_avg)).getBytes());
	
          	list.add(rmc);
		return list;
	}
}
// given a new seller rating on an item, update the buyer's aggregatation of ratings
(same as above)
// given a new item, update list of user's items
 public class UserItemIndex implements IndexUpdateFunction {
	public List<RowMutation> computeUpdate(IndexUpdateMessage message) {
		ArrayList<RowMutation> list = new ArrayList<RowMutation>();
		ColumnFamily scf_items = message.getColumnFamilies().get("items");
		ColumnFamily cf_metada = scf_auction.getColumn("metadata");
		Collection<IColumn> cols_meta = cf_metadata.getAllColumns();

		IColumn uid = cols_meta.getColumn("owner");
		
		RowMutation rmc = new RowMutation("TheTable","user_item_index:"+uid);
		rmc.add(("seller_rating_index:"+item_id),(""+item_id)).getBytes());
		list.add(rmc);
		
		return list;
	}
}

remember the milk

Main data

<!-- main data -->
<ColumnFamily ColumnType="Super" Index="Name">users <!-- key is uid -->
	<ColumnFamily Index="Name">metadata
		<!-- columns: name/, ... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">friends
		<!-- columns: uid1/, uid2/,... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">list1
		<!-- columns: tid1, tid2.... -->
	</ColumnFamily>
	...
</ColumnFamily>
<ColumnFamily ColumnType="Super" Index="Name">tasks <!-- key is tid -->
	<ColumnFamily Index="Name">metadata
		<!-- columns: text/, due_date/, priority/,... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">owners
		<!-- columns: uid1, uid2,... -->
	</ColumnFamily>
</ColumnFamily>

Index functions

<!-- indexes -->
<ColumnFamily ColumnType="Super" Index="Name">list_tasks_priority <!-- key is uid -->
	<ColumnFamily Index="Name">list1
		<!-- columns: tid1, tid2, ... -->
	</ColumnFamily>
	....
</ColumnFamily>
<ColumnFamily ColumnType="Super" Index="Name">list_tasks_due <!-- key is uid -->
	<ColumnFamily Index="Time">list1
		<!-- columns: tid1, tid2, ... -->
	</ColumnFamily>
	....
</ColumnFamily>
<ColumnFamily Index="Time">friends <!-- key is uid -->
	<!-- columns: uid1, uid2, ... -->
</ColumnFamily>

- tasks can be part of multiple smart lists, but still belong to one list - users have contacts and belong to groups - users can share or publish tasks and lists

// task postpone, update index for when things are due

flickr

Main data

	<!-- main data -->"
	<ColumnFamily ColumnType="Super" Index="Name">users <!-- key is uid -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: name/, email/, etc. -->
		</ColumnFamily>
		<ColumnFamily Index="Name">photo_views
			<!-- columns: pid1/count, pid2/count... -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">photos <!-- key is pid -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: owner/uid, title/, filename/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">properties
			<!-- columns: camera/, exposure/, date/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">tags
			<!-- columns: tag1/, tag2, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">friends
			<!-- columns: uid1/, uid2/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">family
			<!-- columns: uid1/, uid2/, ... -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">groups <!-- key is gid -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: title/, privacy/,... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">members
			<!-- columns: uid1/, uid2/... -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">sets <!-- key is setid -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: owner/uid, title/... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">photos
			<!-- columns: pid1/, pid2/, ... -->
		</ColumnFamily>
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">collections <!-- key is collectid -->
		<ColumnFamily Index="Name">metadata
			<!-- columns: owner/uid, title/... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">sets
			<!-- columns: setid1/, setid/, ... -->
		</ColumnFamily>
		<ColumnFamily Index="Name">collections
			<!-- columns: collectid1/, collectid2/, ... -->
		</ColumnFamily>
	</ColumnFamily>

Index functions

<!-- indexes -->
	<ColumnFamily Index="Name">photo_in <!-- key is pid -->
		<!-- columns: photostream/uid, set1/, set2/,... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">location_photos <!-- key is lat, long -->
		<!-- columns: pid1, pid2... -->
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">user_tags <!-- key is uid -->
		<ColumnFamily Index="Name">tag1 
			<!-- columns: pid1, pid2,... -->
		</ColumnFamily>
		...
	</ColumnFamily>
	<ColumnFamily Index="Name">global_tags <!-- key is tag -->
		<!-- columns: count/, pid1/, pid2/... -->
	</ColumnFamily>
	<ColumnFamily Index="Time">recent_photos <!-- key is tag -->
		<!-- columns: pid1/, pid2/... -->
	</ColumnFamily>
	<ColumnFamily Index="Name">photo_views <!-- key is pid -->
		<!-- column: count -->
	</ColumnFamily>
	<ColumnFamily Index="Name">in_grop <!-- key is uid -->
		<!-- column: gid1,gid2... -->
	</ColumnFamily>
	<ColumnFamily ColumnType="Super" Index="Name">are_friends_family <!-- key is uid -->
		<ColumnFamily Index="Name">uid1
			<!-- columns: uid2, uid3... -->
		</ColumnFamily>
		...
	</ColumnFamily>

- global tag cloud - global tags, personal photostream tags - tags sorted: most recent, most 'interesting' (lots of comments?) - users upload photos, have "pro" status - users can be friends or family members - groups have members, public/private, pool of photos - photo metadata: location, usage rights, photography properties - photo advanced meta: viewcount, tags (from diff users) - photos can belong to group pools - photos can have descriptions/comments on parts of them - photos can belong to multiple sets (main set is photostream) - sets and/or collections can belong to multiple collections - comment conversation around a photo - tag map search

// new photo? update photo in, location photos
// add photo to set?  add set to collection?
[pid->photostream and/or pid->setid] photo_in(pid) // which user's photostream, and other sets/collection, and favorites
{
	if favorite, update interesting photo thing
	// get photos:metadata:owner
	// set photo_in:photostream
	// set photo_in:setid
}
[location->pid] location_photos(pid) // given a location, find photos tagged as being there
{
	// get photos:metadata:location
	// set location_photos:pid
}
// new tag? update user_tags, global_tags, interesting tags, recent_tags
[uid->tag] user_tags(pid:tags:tag)
{
	// get photos:metadata:owner
	// set user_tags:tag:pid
}
[tag->count, pid] global_tags(pid:tags:tag)
{
	// get global_tags:tag:count
	// set global_tags:tag:count = count+1 (or 1 if didn't exist)
}
// new comment or view? update interesting photos
[pid->count] photo_views(users:views:pid)
{
	// get photo_views:count
	// set photo_views:count++
	// get interesting:
}
// group membership?
[uid->gid] in_group(gid:members:uid)
{
	// set in_group:uid:gid
}
[uid1->uid2, uid2->uid1] are_friends_family(uid1, uid2)
{
	// set are_friends_family:uid1:uid2
	// set are_friends_family:uid2:uid1
}

tag_cluster() // statistic groups of tags and their photos interesting_photos() look at favorites, comments, views


Todo

  • Write index functions in java syntax
  • Come up with generic templates for the classes of index functions (or rules to govern them)
  • Come up with cardinality implications for index function classes
  • Change perspectives: given a value that's being updated, which indexes will be affected? Is there an implied ordering?

Notes

  • lots of predicates in WHERE clause would result in explosion of index functions... need generic way to deal with this
  • quantify tradeoff cost of maintaining index and not having it... but SCADS system should tell you! (but is that reality!?)